whole web crawl

View: New views
3 Messages — Rating Filter:   Alert me  

whole web crawl

by Gaurang Patel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

All-

I am novice to using Nutch. Can anyone tell me the estimated size in (I suppose, in TBs) that will be required to store the crawled results for whole web? I want to get estimate of the memory requirements for my project, that uses Nutch web crawler.



Regards,
Gaurang Patel

Re: whole web crawl

by kevin chen-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


The estimated size of Google's index is 15 billion. So even for 1KB per
page, it will be 15 TBs. But I think the average page size is way more
than 1k.


On Sun, 2009-10-04 at 17:28 -0700, Gaurang Patel wrote:

> All-
>
> I am novice to using Nutch. Can anyone tell me the estimated size in
> (I suppose, in TBs) that will be required to store the crawled results
> for whole web? I want to get estimate of the memory requirements for
> my project, that uses Nutch web crawler.
>
>
>
> Regards,
> Gaurang Patel


Re: whole web crawl

by Gaurang Patel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hey Kevin,

You are right. It's around 30-40 TBs for google. But as far as Nutch is concerned, I think Jack Yu is right.

Following is what he said, in case you did not receive it:
0.1 billion pages for 1.5TB

Regards,
Gaurang

2009/10/4 kevin chen <kevinchen@...>

The estimated size of Google's index is 15 billion. So even for 1KB per
page, it will be 15 TBs. But I think the average page size is way more
than 1k.


On Sun, 2009-10-04 at 17:28 -0700, Gaurang Patel wrote:
> All-
>
> I am novice to using Nutch. Can anyone tell me the estimated size in
> (I suppose, in TBs) that will be required to store the crawled results
> for whole web? I want to get estimate of the memory requirements for
> my project, that uses Nutch web crawler.
>
>
>
> Regards,
> Gaurang Patel