Niocchi - java asynchronous crawl library released

View: New views
8 Messages — Rating Filter:   Alert me  

Niocchi - java asynchronous crawl library released

by Lukáš Vlček :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I just noticed that Niocchi has been released recently.

Niocchi is a java asynchronous crawl library implemented with NIO. It is designed to crawl several thousands of hosts in parallel on a single low end server.It is currently being used in production by Enormo to crawl thousands of websites daily, and by Vitalprix.

Regards,
Lukas

Re: Niocchi - java asynchronous crawl library released

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Lukáš Vlček wrote:

> Hi,
>
> I just noticed that Niocchi has been released recently.
> http://www.niocchi.com/
>
> Niocchi is a java asynchronous crawl library implemented with NIO. It is
> designed to crawl several thousands of hosts in parallel on a single low
> end server.It is currently being used in production by Enormo
> <http://www.enormo.com/> to crawl thousands of websites daily, and
> by Vitalprix <http://www.vitalprix.com/>.

Well, of course we should optimize our use of resources, and we could
check what this library can offer - but I doubt that optimizations on
this level would bring significant benefits in terms of increased speed
of crawling - low-level IO handling is rarely the bottleneck. Most of
the time the politeness limits (max rate of requests per host) are the
bottleneck.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: Niocchi - java asynchronous crawl library released

by Funtick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

I like architectural ideas behind Apache MINA (inspired by SEDA): for some (CPU-intensive) processing (such as parsing of content) we need single thread per single CPU core, for others (I/O bound) much more threads (waiting for response from network socket). It’s not just NIO...

-Fuad

 

 

 

From: Lukáš Vlček [mailto:lukas.vlcek@...]
Sent: October-18-09 7:12 AM
To: nutch-dev@...; droids-dev@...
Subject: Niocchi - java asynchronous crawl library released

 

Hi,

 

I just noticed that Niocchi has been released recently.

 

Niocchi is a java asynchronous crawl library implemented with NIO. It is designed to crawl several thousands of hosts in parallel on a single low end server.It is currently being used in production by Enormo to crawl thousands of websites daily, and by Vitalprix.

 

Regards,

Lukas


RE: Niocchi - java asynchronous crawl library released

by Funtick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Andrzej,

Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized singleton (shared by multiple threads). And similar synchronized plugins which should be probably refactored to Nutch core...

-Fuad


> Most of
> the time the politeness limits (max rate of requests per host) are the
> bottleneck.



Re: Niocchi - java asynchronous crawl library released

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Fuad Efendi wrote:
> Hi Andrzej,
>
> Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized singleton (shared by multiple threads). And similar synchronized plugins which should be probably refactored to Nutch core...

It's not a singleton, but it's true that the normalize() method is
synchronized. Did you actually measure the impact of this
synchronization on the crawling speed? I very much doubt it outweighs
the impact of politeness limits.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: Niocchi - java asynchronous crawl library released

by Funtick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hi Andrzej,

Yes, I measured/compared (two years ago), I am actually using simplified rewritten code based on Nutch, with non-synchronized instance per thread.

Imagine 1024 threads, each having 100 Outlinks and trying to call synchronized method... total 102,400 concurrent calls to synchronized method (during, in average (network delays), 3-seconds frame)... I was even able to have 1024 concurrent threads without any performance impact! Also, each synchronization requires additional CPU cycles (500-1000) even when concurrency is small.

With non-synchronized, I can't have more than 128 threads - CPU overloaded. It run faster.
-Fuad


> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@...]
> Sent: October-19-09 5:47 AM
> To: nutch-dev@...
> Subject: Re: Niocchi - java asynchronous crawl library released
>
> Fuad Efendi wrote:
> > Hi Andrzej,
> >
> > Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized
> singleton (shared by multiple threads). And similar synchronized plugins which
> should be probably refactored to Nutch core...
>
> It's not a singleton, but it's true that the normalize() method is
> synchronized. Did you actually measure the impact of this
> synchronization on the crawling speed? I very much doubt it outweighs
> the impact of politeness limits.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com




Re: Niocchi - java asynchronous crawl library released

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Fuad Efendi wrote:
> Hi Andrzej,
>
> Yes, I measured/compared (two years ago), I am actually using
> simplified rewritten code based on Nutch, with non-synchronized
> instance per thread.

This was probably based on the original Fetcher code (now
OldFetcher.java) - the new Fetcher uses threads very differently.

>
> Imagine 1024 threads, each having 100 Outlinks and trying to call
> synchronized method... total 102,400 concurrent calls to synchronized
> method (during, in average (network delays), 3-seconds frame)... I
> was even able to have 1024 concurrent threads without any performance
> impact! Also, each synchronization requires additional CPU cycles
> (500-1000) even when concurrency is small.
>
> With non-synchronized, I can't have more than 128 threads - CPU
> overloaded. It run faster. -Fuad

Ok, sounds cool - could you prepare a patch for the RegexURLNormalizer
that removes this problem?


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: Niocchi - java asynchronous crawl library released

by Funtick :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Ok, sounds cool - could you prepare a patch for the RegexURLNormalizer
> that removes this problem?

I least I can try :)
Leaving it as plugin means I'll need to use ThreadLocal or something...