Show db_gone in crawlDB

View: New views
2 Messages — Rating Filter:   Alert me  

Show db_gone in crawlDB

by schroedi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Nutch Guys,

I used to show the crawldb stats. Now I want to show which urls are
db_gone (it means an error 404 - or anything else)
how may I showing the db_gone urls?

bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:    2157
retry 0:    2154
retry 5:    3
min score:    0.0
avg score:    0.018363468
max score:    3.01
status 1 (db_unfetched):    1971
status 2 (db_fetched):    158
status 3 (db_gone):    13
status 4 (db_redir_temp):    1
status 5 (db_redir_perm):    14
CrawlDb statistics: done

thanks,

Mario

--

Mario Schröder | http://www.finanz-checks.de
Office: +49 361 2152062
Phone: +49 34464 62301 Cell: +49 163 27 09 807
http://www.xing.com/go/invite/6035007.9c143c


Re: Show db_gone in crawlDB

by Xiangjun(XJ) Wang :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

bin/nutch readdb <crawldb> -dump <out_dir> -format csv

schroedi wrote:

> Hi Nutch Guys,
>
> I used to show the crawldb stats. Now I want to show which urls are
> db_gone (it means an error 404 - or anything else)
> how may I showing the db_gone urls?
>
> bin/nutch readdb crawl/crawldb -stats
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:    2157
> retry 0:    2154
> retry 5:    3
> min score:    0.0
> avg score:    0.018363468
> max score:    3.01
> status 1 (db_unfetched):    1971
> status 2 (db_fetched):    158
> status 3 (db_gone):    13
> status 4 (db_redir_temp):    1
> status 5 (db_redir_perm):    14
> CrawlDb statistics: done
>
> thanks,
>
> Mario
>
>