|
View:
New views
7 Messages
—
Rating Filter:
Alert me
|
|
|
crawl always stops at depth=3My crawl always stops at depth=3. It gets documents but does not continue any further.
Here is my nutch-site.xml <?xml version="1.0"?> <configuration> <property> <name>http.agent.name</name> <value>nutch-solr-integration</value> </property> <property> <name>generate.max.per.host</name> <value>1000</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\ lizer-(pass|regex|basic)</value> </property> <property> <name>db.max.outlinks.per.page</name> <value>1000</value> </property> </configuration> |
|
|
Re: crawl always stops at depth=3try
bin/nutch readdb crawl/crawldb -stats are there any unfetched pages? nutchcase schrieb: > My crawl always stops at depth=3. It gets documents but does not continue any > further. > Here is my nutch-site.xml > <?xml version="1.0"?> > <configuration> > <property> > <name>http.agent.name</name> > <value>nutch-solr-integration</value> > </property> > <property> > <name>generate.max.per.host</name> > <value>1000</value> > </property> > <property> > <name>plugin.includes</name> > <value>protocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\ > lizer-(pass|regex|basic)</value> > </property> > <property> > <name>db.max.outlinks.per.page</name> > <value>1000</value> > </property> > </configuration> > > > |
|
|
Re: crawl always stops at depth=3Here is the output from that:
TOTAL urls: 297 retry 0: 297 min score: 0.0 avg score: 0.023377104 max score: 2.009 status 2 (db_fetched): 295 status 5 (db_redir_perm): 2
|
|
|
Re: crawl always stops at depth=3the crawler has stopped fetching because all urls are already fetched.
there are no unfetched urls left. do you expect to have more urls fetched? either you need more seed urls or you change your urf filters. the default nutch url filter configuration excludes the deep web, every url with a query part (?). nutchcase schrieb: > Here is the output from that: > TOTAL urls: 297 > retry 0: 297 > min score: 0.0 > avg score: 0.023377104 > max score: 2.009 > status 2 (db_fetched): 295 > status 5 (db_redir_perm): 2 > > > reinhard schwab wrote: > >> try >> >> bin/nutch readdb crawl/crawldb -stats >> >> are there any unfetched pages? >> >> nutchcase schrieb: >> >>> My crawl always stops at depth=3. It gets documents but does not continue >>> any >>> further. >>> Here is my nutch-site.xml >>> <?xml version="1.0"?> >>> <configuration> >>> <property> >>> <name>http.agent.name</name> >>> <value>nutch-solr-integration</value> >>> </property> >>> <property> >>> <name>generate.max.per.host</name> >>> <value>1000</value> >>> </property> >>> <property> >>> <name>plugin.includes</name> >>> <value>protocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\ >>> lizer-(pass|regex|basic)</value> >>> </property> >>> <property> >>> <name>db.max.outlinks.per.page</name> >>> <value>1000</value> >>> </property> >>> </configuration> >>> >>> >>> >>> >> >> > > |
|
|
Re: crawl always stops at depth=3Right, I have commented that part of the filter out and it gets urls with queries, but only to a depth of 3. Here is my url filter:
-^(https|telnet|file|ftp|mailto): # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|i\ co|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] # allow urls in foofactory.fi domain +^http://([a-z0-9\-A-Z]*\.)*.foo.com/ # deny anything else #-.
|
|
|
Re: crawl always stops at depth=3and you miss some urls to be crawled? which?
with bin/nutch readdb crawl/crawldb -dump <some directory> you can dump the content of the crawl db into readable format. you will see there the next fetch times of the urls and the status. with bin/nutch readseg -dump crawl/segments/<segment_dir> <output_dir> you can dump a segment into readable format and see which links have been extracted. nutchcase schrieb: > Right, I have commented that part of the filter out and it gets urls with > queries, but only to a depth of 3. Here is my url filter: > -^(https|telnet|file|ftp|mailto): > > # skip some suffixes > -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|i\ > co|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > #-[?*!@=] > > # allow urls in foofactory.fi domain > +^http://([a-z0-9\-A-Z]*\.)*.foo.com/ > > # deny anything else > #-. > > > reinhard schwab wrote: > >> the crawler has stopped fetching because all urls are already fetched. >> there are no unfetched urls left. >> do you expect to have more urls fetched? >> >> either you need more seed urls or you change your urf filters. >> the default nutch url filter configuration excludes the deep web, every >> url with a query part (?). >> >> >> > > |
|
|
Re: crawl always stops at depth=3All the urls that are qeued are crawled, the problem is that it doesnt look further than depth 3 for urls so anything below that depth doesnt end up in the segments. If I disable url filtering completely by removing it from nutch-site.xml, it gets too many urls so I guess it is a problem with my filter definition. I just can't seem to get the filter right.
|
| Free embeddable forum powered by Nabble | Forum Help |