Nutch recrawl script for 0.9 doesn't work with trunk. Help

View: New views
10 Messages — Rating Filter:   Alert me  

Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Alexis Votta :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The recrawl script for 0.9 I found in
http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works
first time successfully. Second time, it fails with this error.

merging indexes to: crawl/index
IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory crawl/index already exists!
        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74)
        at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111)

I am trying this with the latest version available in trunk. Please
help me to rectify this.

Parent Message unknown Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Jeff Van Boxtel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hmm yea, in my case I merge the indexes into a temp dir, then I delete
the existing index dir and move mine in there:
 
rm -rf crawl/MERGEDindexes
$NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes
 
# in nutch-site, hadoop.tmp.dir points to crawl/tmp
rm -rf crawl/tmp/*
 
# we have to stop tomcat because sometimes it is still accessing the
index file
sudo /etc/init.d/tomcat5.5 stop
 
# replace indexes with indexes_merged
rm -rf crawl/OLDindexes
mv --verbose crawl/index crawl/OLDindexes
mv --verbose crawl/MERGEDindexes crawl/index

echo "----- Restarting Tomcat (Step 10 of $steps) -----"
sudo /etc/init.d/tomcat5.5 start
 
I also found I have to stop the tomcat service otherwise I can't delete
the index files. You may not need to do this if you aren't using
Tomcat.
 
-Jeff

>>> alexisvotta@... 9/19/2007 10:34 AM >>>

The recrawl script for 0.9 I found in
http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works
first time successfully. Second time, it fails with this error.

merging indexes to: crawl/index
IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory crawl/index already exists!
        at
org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74)
        at
org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at
org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111)

I am trying this with the latest version available in trunk. Please
help me to rectify this.


Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Alexis Votta :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Jeff...

Your block of code comes from Nutch 0.9 crawl script which is a
different article. http://wiki.apache.org/nutch/Crawl I am facing the
problem with Nutch 0.9 recrawl script which I found in this article =>
http://wiki.apache.org/nutch/IntranetRecrawl

Even if I follow your approach, I am losing index of previous crawl.
You are merging the new indexes only in this line:

NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes

I want to merge the new indexes with the old index which nutch 0.9
recrawl wants to do but it fails with this error.

merging indexes to: crawl/index
IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory crawl/index already exists!
       at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74)
       at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148)
       at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
       at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111)

Has anyone used the re-crawl script successfully with trunk? Does it
work for nutch 0.9?

On 9/20/07, Jeff Van Boxtel <jboxtel@...> wrote:

> hmm yea, in my case I merge the indexes into a temp dir, then I delete
> the existing index dir and move mine in there:
>
> rm -rf crawl/MERGEDindexes
> $NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes
>
> # in nutch-site, hadoop.tmp.dir points to crawl/tmp
> rm -rf crawl/tmp/*
>
> # we have to stop tomcat because sometimes it is still accessing the
> index file
> sudo /etc/init.d/tomcat5.5 stop
>
> # replace indexes with indexes_merged
> rm -rf crawl/OLDindexes
> mv --verbose crawl/index crawl/OLDindexes
> mv --verbose crawl/MERGEDindexes crawl/index
>
> echo "----- Restarting Tomcat (Step 10 of $steps) -----"
> sudo /etc/init.d/tomcat5.5 start
>
> I also found I have to stop the tomcat service otherwise I can't delete
> the index files. You may not need to do this if you aren't using
> Tomcat.
>
> -Jeff
>
> >>> alexisvotta@... 9/19/2007 10:34 AM >>>
>
> The recrawl script for 0.9 I found in
> http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works
> first time successfully. Second time, it fails with this error.
>
> merging indexes to: crawl/index
> IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException:
> Output directory crawl/index already exists!
>         at
> org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74)
>         at
> org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at
> org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111)
>
> I am trying this with the latest version available in trunk. Please
> help me to rectify this.
>
>

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Lyndon Maydwell :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've been having problems with the merge portion of the script too.

My solution was to check the success status of the merge ( $? ), and
if it failed, try again, or wait until next time.

nutch_bin/nutch mergesegs $merged_segment -dir $segments
if [ $? -ne 0 ]
then
        echo "merging segments failed, lets abort now in case of large failure"
        echo "this was the main sticking point for some reason"
        echo "removing the merged segment in case of fire"

        rm -r $merged_segment
        exit
else
        rm -r $segments/*
        mv $merged_segment/* $segments/
        rm -r $merged_segment
fi

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Tomislav Poljak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,
I had the same problem using re-crawl scripts from wiki. They all work
fine with nutch versions up to 0.9 (0.9 included), but when using
nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that
merge in nutch-0.9 (from re-crawl scripts):

bin/nutch merge crawl/indexes crawl/NEWindexes

did the merging of old indexes from crawl/indexes and the new indexes
from crawl/NEWindexes and stored it in crawl/indexes. But with
nutch-1.0-dev (from trunk) merge requires empty (new) output folder.

Solution that works (I have tried it) is to do following:

bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes

where crawl/index is new (output) folder, crawl/indexes is old indexes
and crawl/NEWindexes is the new indexes. It is important to know that
you can do this with as many indexes you want to merge (as many
re-crawls), you only have to do:

bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ...

but crawl/index must not exist (delete it or backup it).

Nutch search web application will use merged index form crawl/index,
this is from my web application log:

2007-09-09 20:30:58,949 INFO  searcher.NutchBean - creating new bean
2007-09-09 20:30:59,128 INFO  searcher.NutchBean - opening merged index
in /home/nutch/test/trunk/crawl/index


Hope this will help,
               
Tomislav


 
On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote:
> /nutch mergesegs $merged_segment -dir $segments
> if [ $? -ne 0 ]


Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Alexis Votta :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Tomislav and Nutch users

I could not solve the problem with your instructions.

I crawled two times.  In re-crawl. It generated crawl/NEWindexes.
crawl/indexes was generated in 1st crawl.

I merged ==> bin/nutch merge crawl/index crawl/indexes/ crawl/NEWindexes/

Now search.jsp is showing error.
type Exception report

message

description The server encountered an internal error () that prevented
it from fulfilling this request.

exception

org.apache.jasper.JasperException: java.lang.RuntimeException:
java.lang.NullPointerException
        org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:532)
        org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:426)
        org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320)
        org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:803)

root cause

java.lang.RuntimeException: java.lang.NullPointerException
        org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204)
        org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:342)
        org.apache.jsp.search_jsp._jspService(search_jsp.java:247)
        org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
        org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:384)
        org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320)
        org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:803)

root cause

java.lang.NullPointerException
        org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159)
        org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177)

Is there any Crawl guru who can help?

On 9/20/07, Tomislav Poljak <tpoljak@...> wrote:

> Hi,
> I had the same problem using re-crawl scripts from wiki. They all work
> fine with nutch versions up to 0.9 (0.9 included), but when using
> nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that
> merge in nutch-0.9 (from re-crawl scripts):
>
> bin/nutch merge crawl/indexes crawl/NEWindexes
>
> did the merging of old indexes from crawl/indexes and the new indexes
> from crawl/NEWindexes and stored it in crawl/indexes. But with
> nutch-1.0-dev (from trunk) merge requires empty (new) output folder.
>
> Solution that works (I have tried it) is to do following:
>
> bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes
>
> where crawl/index is new (output) folder, crawl/indexes is old indexes
> and crawl/NEWindexes is the new indexes. It is important to know that
> you can do this with as many indexes you want to merge (as many
> re-crawls), you only have to do:
>
> bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ...
>
> but crawl/index must not exist (delete it or backup it).
>
> Nutch search web application will use merged index form crawl/index,
> this is from my web application log:
>
> 2007-09-09 20:30:58,949 INFO  searcher.NutchBean - creating new bean
> 2007-09-09 20:30:59,128 INFO  searcher.NutchBean - opening merged index
> in /home/nutch/test/trunk/crawl/index
>
>
> Hope this will help,
>
> Tomislav
>
>
>
> On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote:
> > /nutch mergesegs $merged_segment -dir $segments
> > if [ $? -ne 0 ]
>
>

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Tomislav Poljak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Alexis,
I think that your problem is not so much in index (or merging indexes)
but in segments, because if you look at the exception you will see root
cause:

java.lang.RuntimeException: java.lang.NullPointerException
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204)

I guess you have segments from old crawl in one place (dir) and segments
from re-crawl in other. All segments should be in same place (I think
so) because web application says (from starting web app log):

2007-09-09 20:30:59,461 INFO  searcher.NutchBean - opening segments
in /home/nutch/test/trunk/crawl/segments  

Tomislav



On Thu, 2007-09-20 at 17:03 +0530, Alexis Votta wrote:
> s showing error.
> type Exception report


Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Alexis Votta :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have merged the old as well as new segments into segments dir. Still
the same error comes.

On 9/20/07, Tomislav Poljak <tpoljak@...> wrote:

> Hi Alexis,
> I think that your problem is not so much in index (or merging indexes)
> but in segments, because if you look at the exception you will see root
> cause:
>
> java.lang.RuntimeException: java.lang.NullPointerException
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204)
>
> I guess you have segments from old crawl in one place (dir) and segments
> from re-crawl in other. All segments should be in same place (I think
> so) because web application says (from starting web app log):
>
> 2007-09-09 20:30:59,461 INFO  searcher.NutchBean - opening segments
> in /home/nutch/test/trunk/crawl/segments
>
> Tomislav
>
>
>
> On Thu, 2007-09-20 at 17:03 +0530, Alexis Votta wrote:
> > s showing error.
> > type Exception report
>
>

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Susam Pal :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

We can do two things to solve this problem.

SOLUTION 'A'

1. Once the 'depth' loop is complete, merge the segments in
'crawl/segments/'. ('crawl/segments/' will have one merged segment of
the past plus all the segments generated in the depth loop, one for
each iteration of the loop.) They are now merged as a single segment
in MERGEDsegments with the following command.

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*

2. Now replace 'crawl/segments' with 'crawl/MERGEDsegments'.

rm -rf crawl/segments
mv $MVARGS crawl/MERGEDsegments crawl/segments

3. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
4. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb crawl/segments/*
5. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
6. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes
7. Delete crawl/NEWindexes. We are done!

I think this is very similar to Jeff's solution. Alexis argued that:-

> I am losing index of previous crawl.

The thing to notice here is that, we can safely delete
crawl/NEWindexes or (OLDindexes in Jeff's case) because, in step. 3
the indexes are generated from a merged segment into which the old
segments have also been merged. So, we are not losing anything.

SOLUTION 'B'

1. Generate the new segments in another directory, say, NEWsegments.

$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/NEWsegments $topN
-adddays $adddays

2. After the depth loop is over, merge the new segments, into
crawl/segments. 'crawl/segments' may have multiple merged segments
(one for each past crawl) if this is not the first crawl.

bin/nutch mergesegs crawl/segments crawl/NEWsegments/*

So, now 'crawl/segments' contains multiple merged segments (one for
each crawl in the past) and another merged segment from the current
re-crawl. Now we don't need 'crawl/NEWsegments'. So, we can delete
'crawl/NEWsegments'.

3. Store the latest merged segment in a variable.

segment=`ls -d crawl/segments/* | tail -1`

(From now onwards, we won't do the remaining operations for all the
segments like we did in solution 'A'. We will do the remaining
operations for the merged segment we have just created.)

4. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
5. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb $segment
6. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

(So with steps 3-6, we generated indexes for the new merged segment
generated with this crawl only.)

7. Let's assume the past indexes were saved as 'crawl/indexes1',
'crawl/indexes2', etc. Now, all of them can be merged as.

$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes/
crawl/indexes1/ crawl/indexes2

I think this is what Tomislav must have done whereas what Alexis must
have done is a mix of solution 'A' and solution 'B'.

For example, if you generate the 'crawl/NEWindexes' from the all the
merged segments (new as well as old) and merge this NEWindexes (which
is not strictly new) with old indexes, you will probably get that
error.

To summarize, in solution 'A' we are generating 'crawl/NEWindexes'
from all the merged segments (new as well as old). So it is not
strictly new. While merging we are merging only this because it has
everything.

In solution 'B' we are generating 'crawl/NEWindexes' from the most
recent merged segment. So this is strictly new. So, while merging, we
are merging NEWindexes with the old indexes into 'crawl/index'.

Regards,
Susam Pal
http://susam.in/

On 9/20/07, Alexis Votta <alexisvotta@...> wrote:

> Hi Tomislav and Nutch users
>
> I could not solve the problem with your instructions.
>
> I crawled two times.  In re-crawl. It generated crawl/NEWindexes.
> crawl/indexes was generated in 1st crawl.
>
> I merged ==> bin/nutch merge crawl/index crawl/indexes/ crawl/NEWindexes/
>
> Now search.jsp is showing error.
> type Exception report
>
> message
>
> description The server encountered an internal error () that prevented
> it from fulfilling this request.
>
> exception
>
> org.apache.jasper.JasperException: java.lang.RuntimeException:
> java.lang.NullPointerException
>         org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:532)
>         org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:426)
>         org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320)
>         org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
>         javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.RuntimeException: java.lang.NullPointerException
>         org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204)
>         org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:342)
>         org.apache.jsp.search_jsp._jspService(search_jsp.java:247)
>         org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
>         javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>         org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:384)
>         org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320)
>         org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
>         javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.NullPointerException
>         org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159)
>         org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177)
>
> Is there any Crawl guru who can help?
>
> On 9/20/07, Tomislav Poljak <tpoljak@...> wrote:
> > Hi,
> > I had the same problem using re-crawl scripts from wiki. They all work
> > fine with nutch versions up to 0.9 (0.9 included), but when using
> > nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that
> > merge in nutch-0.9 (from re-crawl scripts):
> >
> > bin/nutch merge crawl/indexes crawl/NEWindexes
> >
> > did the merging of old indexes from crawl/indexes and the new indexes
> > from crawl/NEWindexes and stored it in crawl/indexes. But with
> > nutch-1.0-dev (from trunk) merge requires empty (new) output folder.
> >
> > Solution that works (I have tried it) is to do following:
> >
> > bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes
> >
> > where crawl/index is new (output) folder, crawl/indexes is old indexes
> > and crawl/NEWindexes is the new indexes. It is important to know that
> > you can do this with as many indexes you want to merge (as many
> > re-crawls), you only have to do:
> >
> > bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ...
> >
> > but crawl/index must not exist (delete it or backup it).
> >
> > Nutch search web application will use merged index form crawl/index,
> > this is from my web application log:
> >
> > 2007-09-09 20:30:58,949 INFO  searcher.NutchBean - creating new bean
> > 2007-09-09 20:30:59,128 INFO  searcher.NutchBean - opening merged index
> > in /home/nutch/test/trunk/crawl/index
> >
> >
> > Hope this will help,
> >
> > Tomislav
> >
> >
> >
> > On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote:
> > > /nutch mergesegs $merged_segment -dir $segments
> > > if [ $? -ne 0 ]
> >
> >
>

RE: Nutch recrawl script for 0.9 doesn't work with trunk. Help

by Bolle, Jeffrey F. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Based on the org.apache.nutch.Crawl class, I've created a ReCrawl class
that can be run similarly to how a nutch intranet crawl is run (or how
the recrawl scripts work).  I've written the ReCrawl class to function
in the manner of "SOLUTION 'A'" below.  However, mine doesn't reload
the web application for you since that wasn't something I needed to
include for my uses.  The usage is something like:

bin/nutch recrawl -dir existingCrawlDir -depth i -add addDays -topN
topN

Is there a reason this functionality wasn't previously built into
nutch?  Once I test this a bit more would the developers like a patch
with my additions?

Jeff


-----Original Message-----
From: Susam Pal [mailto:susam.pal@...]
Sent: Thursday, September 20, 2007 9:54 AM
To: nutch-user@...
Subject: Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

We can do two things to solve this problem.

SOLUTION 'A'

1. Once the 'depth' loop is complete, merge the segments in
'crawl/segments/'. ('crawl/segments/' will have one merged segment of
the past plus all the segments generated in the depth loop, one for
each iteration of the loop.) They are now merged as a single segment
in MERGEDsegments with the following command.

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*

2. Now replace 'crawl/segments' with 'crawl/MERGEDsegments'.

rm -rf crawl/segments
mv $MVARGS crawl/MERGEDsegments crawl/segments

3. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
4. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb crawl/segments/*
5. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
6. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes
7. Delete crawl/NEWindexes. We are done!

I think this is very similar to Jeff's solution. Alexis argued that:-

> I am losing index of previous crawl.

The thing to notice here is that, we can safely delete
crawl/NEWindexes or (OLDindexes in Jeff's case) because, in step. 3
the indexes are generated from a merged segment into which the old
segments have also been merged. So, we are not losing anything.

SOLUTION 'B'

1. Generate the new segments in another directory, say, NEWsegments.

$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/NEWsegments $topN
-adddays $adddays

2. After the depth loop is over, merge the new segments, into
crawl/segments. 'crawl/segments' may have multiple merged segments
(one for each past crawl) if this is not the first crawl.

bin/nutch mergesegs crawl/segments crawl/NEWsegments/*

So, now 'crawl/segments' contains multiple merged segments (one for
each crawl in the past) and another merged segment from the current
re-crawl. Now we don't need 'crawl/NEWsegments'. So, we can delete
'crawl/NEWsegments'.

3. Store the latest merged segment in a variable.

segment=`ls -d crawl/segments/* | tail -1`

(From now onwards, we won't do the remaining operations for all the
segments like we did in solution 'A'. We will do the remaining
operations for the merged segment we have just created.)

4. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
5. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb $segment
6. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

(So with steps 3-6, we generated indexes for the new merged segment
generated with this crawl only.)

7. Let's assume the past indexes were saved as 'crawl/indexes1',
'crawl/indexes2', etc. Now, all of them can be merged as.

$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes/
crawl/indexes1/ crawl/indexes2

I think this is what Tomislav must have done whereas what Alexis must
have done is a mix of solution 'A' and solution 'B'.

For example, if you generate the 'crawl/NEWindexes' from the all the
merged segments (new as well as old) and merge this NEWindexes (which
is not strictly new) with old indexes, you will probably get that
error.

To summarize, in solution 'A' we are generating 'crawl/NEWindexes'
from all the merged segments (new as well as old). So it is not
strictly new. While merging we are merging only this because it has
everything.

In solution 'B' we are generating 'crawl/NEWindexes' from the most
recent merged segment. So this is strictly new. So, while merging, we
are merging NEWindexes with the old indexes into 'crawl/index'.

Regards,
Susam Pal
http://susam.in/

On 9/20/07, Alexis Votta <alexisvotta@...> wrote:
> Hi Tomislav and Nutch users
>
> I could not solve the problem with your instructions.
>
> I crawled two times.  In re-crawl. It generated crawl/NEWindexes.
> crawl/indexes was generated in 1st crawl.
>
> I merged ==> bin/nutch merge crawl/index crawl/indexes/
crawl/NEWindexes/
>
> Now search.jsp is showing error.
> type Exception report
>
> message
>
> description The server encountered an internal error () that
prevented
> it from fulfilling this request.
>
> exception
>
> org.apache.jasper.JasperException: java.lang.RuntimeException:
> java.lang.NullPointerException
>
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServl
etWrapper.java:532)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j
ava:426)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320
)
>
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
>         javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.RuntimeException: java.lang.NullPointerException
>
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.ja
va:204)
>
org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:342)
>         org.apache.jsp.search_jsp._jspService(search_jsp.java:247)
>
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
>         javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j
ava:384)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320
)
>
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
>         javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.NullPointerException
>
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.ja
va:159)
>
org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegm
ents.java:177)
>
> Is there any Crawl guru who can help?
>
> On 9/20/07, Tomislav Poljak <tpoljak@...> wrote:
> > Hi,
> > I had the same problem using re-crawl scripts from wiki. They all
work
> > fine with nutch versions up to 0.9 (0.9 included), but when using
> > nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is
that
> > merge in nutch-0.9 (from re-crawl scripts):
> >
> > bin/nutch merge crawl/indexes crawl/NEWindexes
> >
> > did the merging of old indexes from crawl/indexes and the new
indexes
> > from crawl/NEWindexes and stored it in crawl/indexes. But with
> > nutch-1.0-dev (from trunk) merge requires empty (new) output
folder.
> >
> > Solution that works (I have tried it) is to do following:
> >
> > bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes
> >
> > where crawl/index is new (output) folder, crawl/indexes is old
indexes
> > and crawl/NEWindexes is the new indexes. It is important to know
that
> > you can do this with as many indexes you want to merge (as many
> > re-crawls), you only have to do:
> >
> > bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ...
> >
> > but crawl/index must not exist (delete it or backup it).
> >
> > Nutch search web application will use merged index form
crawl/index,
> > this is from my web application log:
> >
> > 2007-09-09 20:30:58,949 INFO  searcher.NutchBean - creating new
bean
> > 2007-09-09 20:30:59,128 INFO  searcher.NutchBean - opening merged
index

> > in /home/nutch/test/trunk/crawl/index
> >
> >
> > Hope this will help,
> >
> > Tomislav
> >
> >
> >
> > On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote:
> > > /nutch mergesegs $merged_segment -dir $segments
> > > if [ $? -ne 0 ]
> >
> >
>