Parallel Webcrawler Implementation

View: New views
5 Messages — Rating Filter:   Alert me  

Parallel Webcrawler Implementation

by Tobias N. Sasse :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Guys,

I am working on a parallel webcrawler implementation in Java. I could
use some help with some design question and a bug that takes my sleep ;-)

First thing, this is my design: I have a list, which stores URL's that
have been crawled already. Furhter I have a Queue which is responsible
to provide the crawler with the next URL to fetch. Then I have a
ThreadController which spawns new crawler-threads until a maximum number
is reached. Finally there are crawler-threads that process a URL given
by the queue. They work until the queue size is zero and then the system
stops.

Following is my question: I am using (basically) the following
statements. As I am new to httpclient this could probably a dump
approach, and I am happy for feedback.

<snip from WebCrawlerThread>
DefaultHttpClient client;
HttpGet get;

    public run() {
        client = new DefaultHttpClient();
        HttpResponse response = client.execute(get);
        HttpEntity entity = response.getEntity();
        String mimetype = entity.getContentType().getValue();
        String rawPage = EntityUtils.toString(entity);
        client.getConnectionManager().shutdown();

       (...) doing crawler things
    }
</snap>

First thing: Is the thread the right place to host the client object, or
should it be shared?
Second: Would it enhance performance if I reuse the connection somehow?

And most important the bug: With increasing number of pages I receive
zillions of

"java.net.BindException: Address already in use: connect"

Exceptions from the logger.

What can be the cause for that?

BTW: Once the code is working properly I want to publish it as
OpenSource Project.

HTH,
Tobi

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Parallel Webcrawler Implementation

by Tobias N. Sasse :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tobias N. Sasse wrote:
> And most important the bug: With increasing number of pages I receive
> zillions of
> "java.net.BindException: Address already in use: connect"

Just to let you know, this bug could be fixed. Cause of the problem was
that I have been testing under Windows XP and this crappy OS only uses a
port range of ~4000 ports for TCP/IP connections, and needs up to 4
minutes to clear them again. Thus connections will be refused once those
4000 are satureated, until the OS clears them again.

Under a linux environment I did not occur these problems. The design
questions remain open and I am looking for your feedback!

Some interesting numbers:

With a Dualcore 2x1.8 GHz, 2 GB Ram, S-ATA Drive
Client: Java-6
Server: local with Apache HTTPD 2 (standard config)

I gain a maximum of 146 pages/second. Annotation: The crawler is doing
heavy String Processing and strips out all HTML Tags and stuff. As I am
actually working on a search engine, I only want to have the plain text
of a website, the html and script / image / bla stuff is not relevant
for my search algorithms, so I delete them. You could cut back the
String processing, that would push the pages/second rate even further.
But than there will be more I/O cause the files we are writing are larger.

With kind regards,
Tobi

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Parallel Webcrawler Implementation

by Ken Krugler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Tobi,

First, I'd suggest getting and reading through the sources of existing  
Java-based web crawlers. They all use HttpClient, and thus would  
provide much useful example code:

Nutch (Apache)
Droids (Apache)
Heritrix (Archive)
Bixo (http://bixo.101tec.com)

Some comments below:

On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote:

> Hi Guys,
>
> I am working on a parallel webcrawler implementation in Java. I  
> could use some help with some design question and a bug that takes  
> my sleep ;-)
>
> First thing, this is my design: I have a list, which stores URL's  
> that have been crawled already. Furhter I have a Queue which is  
> responsible to provide the crawler with the next URL to fetch. Then  
> I have a ThreadController which spawns new crawler-threads until a  
> maximum number is reached. Finally there are crawler-threads that  
> process a URL given by the queue. They work until the queue size is  
> zero and then the system stops.
>
> Following is my question: I am using (basically) the following  
> statements. As I am new to httpclient this could probably a dump  
> approach, and I am happy for feedback.
>
> <snip from WebCrawlerThread>
> DefaultHttpClient client;
> HttpGet get;
>
>   public run() {
>       client = new DefaultHttpClient();
>       HttpResponse response = client.execute(get);
>       HttpEntity entity = response.getEntity();
>       String mimetype = entity.getContentType().getValue();
>       String rawPage = EntityUtils.toString(entity);
>       client.getConnectionManager().shutdown();
>
>      (...) doing crawler things
>   }
> </snap>
>
> First thing: Is the thread the right place to host the client  
> object, or should it be shared?

You should use the ThreadSafeClientConnManager, and reuse the same  
DefaultHttpClient instance for all threads.

See the init() method of Bixo's SimpleHttpFetcher class for an example  
of setting this up.

> Second: Would it enhance performance if I reuse the connection  
> somehow?

Yes, via keep-alive. Though you then have to be a bit more careful  
about handling stale connections (ones that the server has shut down).

Again, take a look at the Bixo SimpleHttpFetcher class for some code  
that tries (at least) to do this properly.

> And most important the bug: With increasing number of pages I  
> receive zillions of
>
> "java.net.BindException: Address already in use: connect"

No idea, sorry.

But I think that by default HttpClient limits the number of parallel  
request to one host to be two. Not sure if that would be a factor in  
your case, given how you're creating a new client for each request.

-- Ken



--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Re: Parallel Webcrawler Implementation

by sebb-2-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 24/09/2009, Ken Krugler <kkrugler_lists@...> wrote:

> Hi Tobi,
>
>  First, I'd suggest getting and reading through the sources of existing
> Java-based web crawlers. They all use HttpClient, and thus would provide
> much useful example code:
>
>  Nutch (Apache)
>  Droids (Apache)
>  Heritrix (Archive)
>  Bixo (http://bixo.101tec.com)
>
>  Some comments below:
>
>  On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote:
>
>
> > Hi Guys,
> >
> > I am working on a parallel webcrawler implementation in Java. I could use
> some help with some design question and a bug that takes my sleep ;-)
> >
> > First thing, this is my design: I have a list, which stores URL's that
> have been crawled already. Furhter I have a Queue which is responsible to
> provide the crawler with the next URL to fetch. Then I have a
> ThreadController which spawns new crawler-threads until a maximum number is
> reached. Finally there are crawler-threads that process a URL given by the
> queue. They work until the queue size is zero and then the system stops.
> >
> > Following is my question: I am using (basically) the following statements.
> As I am new to httpclient this could probably a dump approach, and I am
> happy for feedback.
> >
> > <snip from WebCrawlerThread>
> > DefaultHttpClient client;
> > HttpGet get;
> >
> >  public run() {
> >      client = new DefaultHttpClient();
> >      HttpResponse response = client.execute(get);
> >      HttpEntity entity = response.getEntity();
> >      String mimetype =
> entity.getContentType().getValue();
> >      String rawPage = EntityUtils.toString(entity);
> >      client.getConnectionManager().shutdown();
> >
> >     (...) doing crawler things
> >  }
> > </snap>
> >
> > First thing: Is the thread the right place to host the client object, or
> should it be shared?
> >
>
>  You should use the ThreadSafeClientConnManager, and reuse the same
> DefaultHttpClient instance for all threads.
>
>  See the init() method of Bixo's SimpleHttpFetcher class for an example of
> setting this up.
>
>
> > Second: Would it enhance performance if I reuse the connection somehow?
> >
>
>  Yes, via keep-alive. Though you then have to be a bit more careful about
> handling stale connections (ones that the server has shut down).
>
>  Again, take a look at the Bixo SimpleHttpFetcher class for some code that
> tries (at least) to do this properly.
>
>
> > And most important the bug: With increasing number of pages I receive
> zillions of
> >
> > "java.net.BindException: Address already in use: connect"
> >

I've seen this error generated when a WinXP host runs out of sockets.
i.e. the message is misleading in this case.

>  No idea, sorry.
>
>  But I think that by default HttpClient limits the number of parallel
> request to one host to be two. Not sure if that would be a factor in your
> case, given how you're creating a new client for each request.
>
>  -- Ken
>
>
>
>  --------------------------
>  Ken Krugler
>  TransPac Software, Inc.
>  <http://www.transpac.com>
>  +1 530-210-6378
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...


Re: Parallel Webcrawler Implementation

by olegk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

sebb wrote:

> On 24/09/2009, Ken Krugler <kkrugler_lists@...> wrote:
>> Hi Tobi,
>>
>>  First, I'd suggest getting and reading through the sources of existing
>> Java-based web crawlers. They all use HttpClient, and thus would provide
>> much useful example code:
>>
>>  Nutch (Apache)
>>  Droids (Apache)
>>  Heritrix (Archive)
>>  Bixo (http://bixo.101tec.com)
>>
>>  Some comments below:
>>
>>  On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote:
>>
>>
>>> Hi Guys,
>>>
>>> I am working on a parallel webcrawler implementation in Java. I could use
>> some help with some design question and a bug that takes my sleep ;-)
>>> First thing, this is my design: I have a list, which stores URL's that
>> have been crawled already. Furhter I have a Queue which is responsible to
>> provide the crawler with the next URL to fetch. Then I have a
>> ThreadController which spawns new crawler-threads until a maximum number is
>> reached. Finally there are crawler-threads that process a URL given by the
>> queue. They work until the queue size is zero and then the system stops.
>>> Following is my question: I am using (basically) the following statements.
>> As I am new to httpclient this could probably a dump approach, and I am
>> happy for feedback.
>>> <snip from WebCrawlerThread>
>>> DefaultHttpClient client;
>>> HttpGet get;
>>>
>>>  public run() {
>>>      client = new DefaultHttpClient();
>>>      HttpResponse response = client.execute(get);
>>>      HttpEntity entity = response.getEntity();
>>>      String mimetype =
>> entity.getContentType().getValue();
>>>      String rawPage = EntityUtils.toString(entity);
>>>      client.getConnectionManager().shutdown();
>>>
>>>     (...) doing crawler things
>>>  }
>>> </snap>
>>>
>>> First thing: Is the thread the right place to host the client object, or
>> should it be shared?
>>  You should use the ThreadSafeClientConnManager, and reuse the same
>> DefaultHttpClient instance for all threads.
>>
>>  See the init() method of Bixo's SimpleHttpFetcher class for an example of
>> setting this up.
>>
>>
>>> Second: Would it enhance performance if I reuse the connection somehow?
>>>
>>  Yes, via keep-alive. Though you then have to be a bit more careful about
>> handling stale connections (ones that the server has shut down).
>>
>>  Again, take a look at the Bixo SimpleHttpFetcher class for some code that
>> tries (at least) to do this properly.
>>
>>
>>> And most important the bug: With increasing number of pages I receive
>> zillions of
>>> "java.net.BindException: Address already in use: connect"
>>>
>
> I've seen this error generated when a WinXP host runs out of sockets.
> i.e. the message is misleading in this case.
>

Which is hardly surprising given that the crawler creates a new
connection per EACH request / link.

Oleg



>>  No idea, sorry.
>>
>>  But I think that by default HttpClient limits the number of parallel
>> request to one host to be two. Not sure if that would be a factor in your
>> case, given how you're creating a new client for each request.
>>
>>  -- Ken
>>
>>
>>
>>  --------------------------
>>  Ken Krugler
>>  TransPac Software, Inc.
>>  <http://www.transpac.com>
>>  +1 530-210-6378
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@...
> For additional commands, e-mail: httpclient-users-help@...
>


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@...
For additional commands, e-mail: httpclient-users-help@...