|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Nutch indexes less pages, then it fetchesHi All,
I've got a strange problem, that nutch indexes much less URLs then it fetches. For example URL: http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. I assume that if fetched sucessfully because in fetch logs it mentioned only once: 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm But it was not sent to the indexer on indexing phase (I'm using custom NutchIndexWriter and it logs every page for witch it's write method executed). What could be possible reason? Is there a way to browse crawldb to ensure that page really fetched? What else could I check? Thanks |
|
|
Re: Nutch indexes less pages, then it fetchescheck the parse data first, maybe it parse unsuccessful.
2009/10/27 caezar <caezaris@...> > > Hi All, > > I've got a strange problem, that nutch indexes much less URLs then it > fetches. For example URL: > http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. > I assume that if fetched sucessfully because in fetch logs it mentioned > only > once: > 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching > http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm > > But it was not sent to the indexer on indexing phase (I'm using custom > NutchIndexWriter and it logs every page for witch it's write method > executed). What could be possible reason? Is there a way to browse crawldb > to ensure that page really fetched? What else could I check? > > Thanks > -- > View this message in context: > http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > |
|
|
Re: Nutch indexes less pages, then it fetchesI have similar experience.
Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) I haven't have chance to try it out yet. On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: > Hi All, > > I've got a strange problem, that nutch indexes much less URLs then it > fetches. For example URL: > http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. > I assume that if fetched sucessfully because in fetch logs it mentioned only > once: > 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching > http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm > > But it was not sent to the indexer on indexing phase (I'm using custom > NutchIndexWriter and it logs every page for witch it's write method > executed). What could be possible reason? Is there a way to browse crawldb > to ensure that page really fetched? What else could I check? > > Thanks |
|
|
Re: Nutch indexes less pages, then it fetcheswhat is the db status of this url in your crawl db?
if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your crawl db with reinhard@thord:>bin/nutch readdb <crawldb> -url <url> it has this status, if it is recrawled and the signature does not change. the signature is MD5 hash of the content. another reason may be that you have some indexing filters. i dont believe its the reason here. regards kevin chen schrieb: > I have similar experience. > > Reinhard schwab responded a possible fix. See mail in this group from > Reinhard schwab at > Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) > > I haven't have chance to try it out yet. > > On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: > >> Hi All, >> >> I've got a strange problem, that nutch indexes much less URLs then it >> fetches. For example URL: >> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. >> I assume that if fetched sucessfully because in fetch logs it mentioned only >> once: >> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching >> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm >> >> But it was not sent to the indexer on indexing phase (I'm using custom >> NutchIndexWriter and it logs every page for witch it's write method >> executed). What could be possible reason? Is there a way to browse crawldb >> to ensure that page really fetched? What else could I check? >> >> Thanks >> > > > |
|
|
Re: Nutch indexes less pages, then it fetchesSorry, but how could I do this?
|
|
|
Re: Nutch indexes less pages, then it fetchesThanks, that was really helpful. I've moved forward but still not found the solution.
So the status of the initial URL (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: Status: 5 (db_redir_perm) Metadata: _pst_: moved(12), lastModified=0: http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm So it answers the question, why initial page was not indexed - because it was redirected. Now checking the status of redirect target: Status: 2 (db_fetched) So it was sucessfully fetchet. But, according to indexing log - it still was not sent to indexer!
|
|
|
Re: Nutch indexes less pages, then it fetchesyes, its permanently redirected.
you can check also the segment status of this url here is an example reinhard@thord:>bin/nutch readseg -get crawl/segments/20091028122455 "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20" it will show you whether it is parsed and the extracted outlinks. it will show any data related to this url stored in the segment. regards caezar schrieb: > Thanks, that was really helpful. I've moved forward but still not found the > solution. > So the status of the initial URL > (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) is: > Status: 5 (db_redir_perm) > Metadata: _pst_: moved(12), lastModified=0: > http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm > > So it answers the question, why initial page was not indexed - because it > was redirected. > Now checking the status of redirect target: > Status: 2 (db_fetched) > > So it was sucessfully fetchet. But, according to indexing log - it still was > not sent to indexer! > > > > reinhard schwab wrote: > >> what is the db status of this url in your crawl db? >> if it is STATUS_DB_NOTMODIFIED, >> then it may be the reason. >> (you can check it if you dump your crawl db with >> reinhard@thord:>bin/nutch readdb <crawldb> -url <url> >> >> it has this status, if it is recrawled and the signature does not change. >> the signature is MD5 hash of the content. >> >> another reason may be that you have some indexing filters. >> i dont believe its the reason here. >> >> regards >> >> >> kevin chen schrieb: >> >>> I have similar experience. >>> >>> Reinhard schwab responded a possible fix. See mail in this group from >>> Reinhard schwab at >>> Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) >>> >>> I haven't have chance to try it out yet. >>> >>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: >>> >>> >>>> Hi All, >>>> >>>> I've got a strange problem, that nutch indexes much less URLs then it >>>> fetches. For example URL: >>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. >>>> I assume that if fetched sucessfully because in fetch logs it mentioned >>>> only >>>> once: >>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching >>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm >>>> >>>> But it was not sent to the indexer on indexing phase (I'm using custom >>>> NutchIndexWriter and it logs every page for witch it's write method >>>> executed). What could be possible reason? Is there a way to browse >>>> crawldb >>>> to ensure that page really fetched? What else could I check? >>>> >>>> Thanks >>>> >>>> >>> >>> >> >> > > |
|
|
Re: Nutch indexes less pages, then it fetchesThanks, checked, it was parsed. Still no answer why it was not indexed
|
|
|
Re: Nutch indexes less pages, then it fetcheshmm i have no idea now.
check the reduce method in IndexerMapReduce and add some debug statements there. recompile nutch and try it again. caezar schrieb: > Thanks, checked, it was parsed. Still no answer why it was not indexed > > reinhard schwab wrote: > >> yes, its permanently redirected. >> you can check also the segment status of this url >> here is an example >> >> reinhard@thord:>bin/nutch readseg -get crawl/segments/20091028122455 >> "http://www.krems.at/fotoalbum/fotoalbum.asp?albumid=37&big=1&seitenid=20" >> >> it will show you whether it is parsed and the extracted outlinks. >> it will show any data related to this url stored in the segment. >> >> regards >> >> caezar schrieb: >> >>> Thanks, that was really helpful. I've moved forward but still not found >>> the >>> solution. >>> So the status of the initial URL >>> (http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm) >>> is: >>> Status: 5 (db_redir_perm) >>> Metadata: _pst_: moved(12), lastModified=0: >>> http://www.1stdirectory.com/Companies/1627406_Darwins_Catering_Limited.htm >>> >>> So it answers the question, why initial page was not indexed - because it >>> was redirected. >>> Now checking the status of redirect target: >>> Status: 2 (db_fetched) >>> >>> So it was sucessfully fetchet. But, according to indexing log - it still >>> was >>> not sent to indexer! >>> >>> >>> >>> reinhard schwab wrote: >>> >>> >>>> what is the db status of this url in your crawl db? >>>> if it is STATUS_DB_NOTMODIFIED, >>>> then it may be the reason. >>>> (you can check it if you dump your crawl db with >>>> reinhard@thord:>bin/nutch readdb <crawldb> -url <url> >>>> >>>> it has this status, if it is recrawled and the signature does not >>>> change. >>>> the signature is MD5 hash of the content. >>>> >>>> another reason may be that you have some indexing filters. >>>> i dont believe its the reason here. >>>> >>>> regards >>>> >>>> >>>> kevin chen schrieb: >>>> >>>> >>>>> I have similar experience. >>>>> >>>>> Reinhard schwab responded a possible fix. See mail in this group from >>>>> Reinhard schwab at >>>>> Sun, 25 Oct 2009 10:03:41 +0100 (05:03 EDT) >>>>> >>>>> I haven't have chance to try it out yet. >>>>> >>>>> On Tue, 2009-10-27 at 07:34 -0700, caezar wrote: >>>>> >>>>> >>>>> >>>>>> Hi All, >>>>>> >>>>>> I've got a strange problem, that nutch indexes much less URLs then it >>>>>> fetches. For example URL: >>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm. >>>>>> I assume that if fetched sucessfully because in fetch logs it >>>>>> mentioned >>>>>> only >>>>>> once: >>>>>> 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: >>>>>> fetching >>>>>> http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm >>>>>> >>>>>> But it was not sent to the indexer on indexing phase (I'm using custom >>>>>> NutchIndexWriter and it logs every page for witch it's write method >>>>>> executed). What could be possible reason? Is there a way to browse >>>>>> crawldb >>>>>> to ensure that page really fetched? What else could I check? >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > |
|
|
Re: Nutch indexes less pages, then it fetchesIn the IndexerMapReduce.reduce there is a code:
if (CrawlDatum.STATUS_LINKED == datum.getStatus() || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) { continue; } And the status of the redirect target URL is really linked. Thats why it's skipped. But what does this status mean?
|
|
|
Re: Nutch indexes less pages, then it fetchesSome more information. Debugging reduce method I've noticed, that before code
if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } my page has fetchDatum, parseText and parseData not null, but dbDatum is null. Thats why it's skipped :) Any ideas about the reason?
|
|
|
Re: Nutch indexes less pages, then it fetchescaezar wrote:
> Some more information. Debugging reduce method I've noticed, that before code > if (fetchDatum == null || dbDatum == null > || parseText == null || parseData == null) { > return; // only have inlinks > } > my page has fetchDatum, parseText and parseData not null, but dbDatum is > null. Thats why it's skipped :) > Any ideas about the reason? Yes - you should run updatedb with this segment, and also run invertlinks with this segment, _before_ trying to index. Otherwise the db status won't be updated properly. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: Nutch indexes less pages, then it fetchesI'm pretty sure that I ran both commands before indexing
|
|
|
Re: Nutch indexes less pages, then it fetchesI've compared the segments data of the URL which have no redirect and was indexed correctly, with this "bad" URL, and there is really a difference. First one have db record in the segment:
Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Oct 28 16:01:05 EET 2009 Modified time: Thu Jan 01 02:00:00 EET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1256738472613 But second one have no such record, which seems pretty fine: it was not added to the segment on generate stage, it was added on the fetch stage. Is this a bug in Nutch? Or I'm missing some configuration option?
|
|
|
Re: Nutch indexes less pages, then it fetchesis your problem solved now???
this can be ok. new discovered urls will be added to a segment when fetched documents are parsed and if these urls pass the filters. they will not have a crawl datum Generate because they are unknown until they are extracted. regards caezar schrieb: > I've compared the segments data of the URL which have no redirect and was > indexed correctly, with this "bad" URL, and there is really a difference. > First one have db record in the segment: > Crawl Generate:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Oct 28 16:01:05 EET 2009 > Modified time: Thu Jan 01 02:00:00 EET 1970 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: _ngt_: 1256738472613 > > But second one have no such record, which seems pretty fine: it was not > added to the segment on generate stage, it was added on the fetch stage. Is > this a bug in Nutch? Or I'm missing some configuration option? > > caezar wrote: > >> I'm pretty sure that I ran both commands before indexing >> >> Andrzej Bialecki wrote: >> >>> caezar wrote: >>> >>>> Some more information. Debugging reduce method I've noticed, that before >>>> code >>>> if (fetchDatum == null || dbDatum == null >>>> || parseText == null || parseData == null) { >>>> return; // only have inlinks >>>> } >>>> my page has fetchDatum, parseText and parseData not null, but dbDatum is >>>> null. Thats why it's skipped :) >>>> Any ideas about the reason? >>>> >>> Yes - you should run updatedb with this segment, and also run >>> invertlinks with this segment, _before_ trying to index. Otherwise the >>> db status won't be updated properly. >>> >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> >>> >>> >> > > |
|
|
Re: Nutch indexes less pages, then it fetchesNo, problem is not solved. Everything happens as you described, but page is not indexed, because of condition:
if (fetchDatum == null || dbDatum == null || parseText == null || parseData == null) { return; // only have inlinks } in IndexerMapReduce code. For this page dbDatum is null, so it is not indexed!
|
|
|
Re: Nutch indexes less pages, then it fetcheswhat is in the crawl db?
reinhard@thord:>bin/nutch readdb <crawldb> -url <url> caezar schrieb: > No, problem is not solved. Everything happens as you described, but page is > not indexed, because of condition: > if (fetchDatum == null || dbDatum == null > || parseText == null || parseData == null) { > return; // only have inlinks > } > in IndexerMapReduce code. For this page dbDatum is null, so it is not > indexed! > > reinhard schwab wrote: > >> is your problem solved now??? >> >> this can be ok. >> new discovered urls will be added to a segment when fetched documents >> are parsed and if these urls pass the filters. >> they will not have a crawl datum Generate because they are unknown until >> they are extracted. >> >> regards >> >> caezar schrieb: >> >>> I've compared the segments data of the URL which have no redirect and was >>> indexed correctly, with this "bad" URL, and there is really a difference. >>> First one have db record in the segment: >>> Crawl Generate:: >>> Version: 7 >>> Status: 1 (db_unfetched) >>> Fetch time: Wed Oct 28 16:01:05 EET 2009 >>> Modified time: Thu Jan 01 02:00:00 EET 1970 >>> Retries since fetch: 0 >>> Retry interval: 2592000 seconds (30 days) >>> Score: 1.0 >>> Signature: null >>> Metadata: _ngt_: 1256738472613 >>> >>> But second one have no such record, which seems pretty fine: it was not >>> added to the segment on generate stage, it was added on the fetch stage. >>> Is >>> this a bug in Nutch? Or I'm missing some configuration option? >>> >>> caezar wrote: >>> >>> >>>> I'm pretty sure that I ran both commands before indexing >>>> >>>> Andrzej Bialecki wrote: >>>> >>>> >>>>> caezar wrote: >>>>> >>>>> >>>>>> Some more information. Debugging reduce method I've noticed, that >>>>>> before >>>>>> code >>>>>> if (fetchDatum == null || dbDatum == null >>>>>> || parseText == null || parseData == null) { >>>>>> return; // only have inlinks >>>>>> } >>>>>> my page has fetchDatum, parseText and parseData not null, but dbDatum >>>>>> is >>>>>> null. Thats why it's skipped :) >>>>>> Any ideas about the reason? >>>>>> >>>>>> >>>>> Yes - you should run updatedb with this segment, and also run >>>>> invertlinks with this segment, _before_ trying to index. Otherwise the >>>>> db status won't be updated properly. >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> Andrzej Bialecki <>< >>>>> ___. ___ ___ ___ _ _ __________________________________ >>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>>>> ___|||__|| \| || | Embedded Unix, System Integration >>>>> http://www.sigram.com Contact: info at sigram dot com >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > |
|
|
Re: Nutch indexes less pages, then it fetchesStatus: 5 (db_redir_perm) for redirect source
and Status: 2 (db_fetched) for redirect target
|
|
|
Re: Nutch indexes less pages, then it fetchesDoes anybody know how to solve this problem?
|
|
|
Re: Nutch indexes less pages, then it fetchesI've solved this problem by modifying nutch code. If this solution acceptable for you I can provide the details
|
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |