|
View:
New views
8 Messages
—
Rating Filter:
Alert me
|
|
|
Niocchi - java asynchronous crawl library releasedHi,
I just noticed that Niocchi has been released recently. Niocchi is a java asynchronous crawl library implemented with NIO. It is designed to crawl several thousands of hosts in parallel on a single low end server.It is currently being used in production by Enormo to crawl thousands of websites daily, and by Vitalprix.
Regards, Lukas
|
|
|
Re: Niocchi - java asynchronous crawl library releasedLukáš Vlček wrote:
> Hi, > > I just noticed that Niocchi has been released recently. > http://www.niocchi.com/ > > Niocchi is a java asynchronous crawl library implemented with NIO. It is > designed to crawl several thousands of hosts in parallel on a single low > end server.It is currently being used in production by Enormo > <http://www.enormo.com/> to crawl thousands of websites daily, and > by Vitalprix <http://www.vitalprix.com/>. Well, of course we should optimize our use of resources, and we could check what this library can offer - but I doubt that optimizations on this level would bring significant benefits in terms of increased speed of crawling - low-level IO handling is rarely the bottleneck. Most of the time the politeness limits (max rate of requests per host) are the bottleneck. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
RE: Niocchi - java asynchronous crawl library releasedI like architectural ideas behind Apache MINA (inspired by
SEDA): for some (CPU-intensive) processing (such as parsing of content) we need
single thread per single CPU core, for others (I/O bound) much more threads
(waiting for response from network socket). It’s not just NIO... -Fuad From: Lukáš Vlček [mailto:lukas.vlcek@...] Hi, I just noticed that Niocchi has been released recently. Niocchi
is a java asynchronous crawl library implemented with NIO. It is designed to
crawl several thousands of hosts in parallel on a single low end server.It is
currently being used in production by Enormo to
crawl thousands of websites daily, and by Vitalprix. Regards, Lukas |
|
|
RE: Niocchi - java asynchronous crawl library releasedHi Andrzej,
Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized singleton (shared by multiple threads). And similar synchronized plugins which should be probably refactored to Nutch core... -Fuad > Most of > the time the politeness limits (max rate of requests per host) are the > bottleneck. |
|
|
Re: Niocchi - java asynchronous crawl library releasedFuad Efendi wrote:
> Hi Andrzej, > > Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized singleton (shared by multiple threads). And similar synchronized plugins which should be probably refactored to Nutch core... It's not a singleton, but it's true that the normalize() method is synchronized. Did you actually measure the impact of this synchronization on the crawling speed? I very much doubt it outweighs the impact of politeness limits. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
RE: Niocchi - java asynchronous crawl library releasedHi Andrzej, Yes, I measured/compared (two years ago), I am actually using simplified rewritten code based on Nutch, with non-synchronized instance per thread. Imagine 1024 threads, each having 100 Outlinks and trying to call synchronized method... total 102,400 concurrent calls to synchronized method (during, in average (network delays), 3-seconds frame)... I was even able to have 1024 concurrent threads without any performance impact! Also, each synchronization requires additional CPU cycles (500-1000) even when concurrency is small. With non-synchronized, I can't have more than 128 threads - CPU overloaded. It run faster. -Fuad > -----Original Message----- > From: Andrzej Bialecki [mailto:ab@...] > Sent: October-19-09 5:47 AM > To: nutch-dev@... > Subject: Re: Niocchi - java asynchronous crawl library released > > Fuad Efendi wrote: > > Hi Andrzej, > > > > Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized > singleton (shared by multiple threads). And similar synchronized plugins which > should be probably refactored to Nutch core... > > It's not a singleton, but it's true that the normalize() method is > synchronized. Did you actually measure the impact of this > synchronization on the crawling speed? I very much doubt it outweighs > the impact of politeness limits. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Niocchi - java asynchronous crawl library releasedFuad Efendi wrote:
> Hi Andrzej, > > Yes, I measured/compared (two years ago), I am actually using > simplified rewritten code based on Nutch, with non-synchronized > instance per thread. This was probably based on the original Fetcher code (now OldFetcher.java) - the new Fetcher uses threads very differently. > > Imagine 1024 threads, each having 100 Outlinks and trying to call > synchronized method... total 102,400 concurrent calls to synchronized > method (during, in average (network delays), 3-seconds frame)... I > was even able to have 1024 concurrent threads without any performance > impact! Also, each synchronization requires additional CPU cycles > (500-1000) even when concurrency is small. > > With non-synchronized, I can't have more than 128 threads - CPU > overloaded. It run faster. -Fuad Ok, sounds cool - could you prepare a patch for the RegexURLNormalizer that removes this problem? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
RE: Niocchi - java asynchronous crawl library released> Ok, sounds cool - could you prepare a patch for the RegexURLNormalizer
> that removes this problem? I least I can try :) Leaving it as plugin means I'll need to use ThreadLocal or something... |
| Free embeddable forum powered by Nabble | Forum Help |