|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
Nutch recrawl script for 0.9 doesn't work with trunk. HelpThe recrawl script for 0.9 I found in
http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works first time successfully. Second time, it fails with this error. merging indexes to: crawl/index IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory crawl/index already exists! at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74) at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111) I am trying this with the latest version available in trunk. Please help me to rectify this. |
|
|
|
|
|
Re: Nutch recrawl script for 0.9 doesn't work with trunk. HelpHi Jeff...
Your block of code comes from Nutch 0.9 crawl script which is a different article. http://wiki.apache.org/nutch/Crawl I am facing the problem with Nutch 0.9 recrawl script which I found in this article => http://wiki.apache.org/nutch/IntranetRecrawl Even if I follow your approach, I am losing index of previous crawl. You are merging the new indexes only in this line: NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes I want to merge the new indexes with the old index which nutch 0.9 recrawl wants to do but it fails with this error. merging indexes to: crawl/index IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory crawl/index already exists! at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74) at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111) Has anyone used the re-crawl script successfully with trunk? Does it work for nutch 0.9? On 9/20/07, Jeff Van Boxtel <jboxtel@...> wrote: > hmm yea, in my case I merge the indexes into a temp dir, then I delete > the existing index dir and move mine in there: > > rm -rf crawl/MERGEDindexes > $NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes > > # in nutch-site, hadoop.tmp.dir points to crawl/tmp > rm -rf crawl/tmp/* > > # we have to stop tomcat because sometimes it is still accessing the > index file > sudo /etc/init.d/tomcat5.5 stop > > # replace indexes with indexes_merged > rm -rf crawl/OLDindexes > mv --verbose crawl/index crawl/OLDindexes > mv --verbose crawl/MERGEDindexes crawl/index > > echo "----- Restarting Tomcat (Step 10 of $steps) -----" > sudo /etc/init.d/tomcat5.5 start > > I also found I have to stop the tomcat service otherwise I can't delete > the index files. You may not need to do this if you aren't using > Tomcat. > > -Jeff > > >>> alexisvotta@... 9/19/2007 10:34 AM >>> > > The recrawl script for 0.9 I found in > http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works > first time successfully. Second time, it fails with this error. > > merging indexes to: crawl/index > IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException: > Output directory crawl/index already exists! > at > org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74) > at > org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at > org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111) > > I am trying this with the latest version available in trunk. Please > help me to rectify this. > > |
|
|
Re: Nutch recrawl script for 0.9 doesn't work with trunk. HelpI've been having problems with the merge portion of the script too.
My solution was to check the success status of the merge ( $? ), and if it failed, try again, or wait until next time. nutch_bin/nutch mergesegs $merged_segment -dir $segments if [ $? -ne 0 ] then echo "merging segments failed, lets abort now in case of large failure" echo "this was the main sticking point for some reason" echo "removing the merged segment in case of fire" rm -r $merged_segment exit else rm -r $segments/* mv $merged_segment/* $segments/ rm -r $merged_segment fi |
|
|
Re: Nutch recrawl script for 0.9 doesn't work with trunk. HelpHi,
I had the same problem using re-crawl scripts from wiki. They all work fine with nutch versions up to 0.9 (0.9 included), but when using nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that merge in nutch-0.9 (from re-crawl scripts): bin/nutch merge crawl/indexes crawl/NEWindexes did the merging of old indexes from crawl/indexes and the new indexes from crawl/NEWindexes and stored it in crawl/indexes. But with nutch-1.0-dev (from trunk) merge requires empty (new) output folder. Solution that works (I have tried it) is to do following: bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes where crawl/index is new (output) folder, crawl/indexes is old indexes and crawl/NEWindexes is the new indexes. It is important to know that you can do this with as many indexes you want to merge (as many re-crawls), you only have to do: bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ... but crawl/index must not exist (delete it or backup it). Nutch search web application will use merged index form crawl/index, this is from my web application log: 2007-09-09 20:30:58,949 INFO searcher.NutchBean - creating new bean 2007-09-09 20:30:59,128 INFO searcher.NutchBean - opening merged index in /home/nutch/test/trunk/crawl/index Hope this will help, Tomislav On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote: > /nutch mergesegs $merged_segment -dir $segments > if [ $? -ne 0 ] |
|
|
Re: Nutch recrawl script for 0.9 doesn't work with trunk. HelpHi Tomislav and Nutch users
I could not solve the problem with your instructions. I crawled two times. In re-crawl. It generated crawl/NEWindexes. crawl/indexes was generated in 1st crawl. I merged ==> bin/nutch merge crawl/index crawl/indexes/ crawl/NEWindexes/ Now search.jsp is showing error. type Exception report message description The server encountered an internal error () that prevented it from fulfilling this request. exception org.apache.jasper.JasperException: java.lang.RuntimeException: java.lang.NullPointerException org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:532) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:426) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) javax.servlet.http.HttpServlet.service(HttpServlet.java:803) root cause java.lang.RuntimeException: java.lang.NullPointerException org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204) org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:342) org.apache.jsp.search_jsp._jspService(search_jsp.java:247) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) javax.servlet.http.HttpServlet.service(HttpServlet.java:803) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:384) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) javax.servlet.http.HttpServlet.service(HttpServlet.java:803) root cause java.lang.NullPointerException org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159) org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177) Is there any Crawl guru who can help? On 9/20/07, Tomislav Poljak <tpoljak@...> wrote: > Hi, > I had the same problem using re-crawl scripts from wiki. They all work > fine with nutch versions up to 0.9 (0.9 included), but when using > nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that > merge in nutch-0.9 (from re-crawl scripts): > > bin/nutch merge crawl/indexes crawl/NEWindexes > > did the merging of old indexes from crawl/indexes and the new indexes > from crawl/NEWindexes and stored it in crawl/indexes. But with > nutch-1.0-dev (from trunk) merge requires empty (new) output folder. > > Solution that works (I have tried it) is to do following: > > bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes > > where crawl/index is new (output) folder, crawl/indexes is old indexes > and crawl/NEWindexes is the new indexes. It is important to know that > you can do this with as many indexes you want to merge (as many > re-crawls), you only have to do: > > bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ... > > but crawl/index must not exist (delete it or backup it). > > Nutch search web application will use merged index form crawl/index, > this is from my web application log: > > 2007-09-09 20:30:58,949 INFO searcher.NutchBean - creating new bean > 2007-09-09 20:30:59,128 INFO searcher.NutchBean - opening merged index > in /home/nutch/test/trunk/crawl/index > > > Hope this will help, > > Tomislav > > > > On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote: > > /nutch mergesegs $merged_segment -dir $segments > > if [ $? -ne 0 ] > > |
|
|
Re: Nutch recrawl script for 0.9 doesn't work with trunk. HelpHi Alexis,
I think that your problem is not so much in index (or merging indexes) but in segments, because if you look at the exception you will see root cause: java.lang.RuntimeException: java.lang.NullPointerException org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204) I guess you have segments from old crawl in one place (dir) and segments from re-crawl in other. All segments should be in same place (I think so) because web application says (from starting web app log): 2007-09-09 20:30:59,461 INFO searcher.NutchBean - opening segments in /home/nutch/test/trunk/crawl/segments Tomislav On Thu, 2007-09-20 at 17:03 +0530, Alexis Votta wrote: > s showing error. > type Exception report |
|
|
Re: Nutch recrawl script for 0.9 doesn't work with trunk. HelpI have merged the old as well as new segments into segments dir. Still
the same error comes. On 9/20/07, Tomislav Poljak <tpoljak@...> wrote: > Hi Alexis, > I think that your problem is not so much in index (or merging indexes) > but in segments, because if you look at the exception you will see root > cause: > > java.lang.RuntimeException: java.lang.NullPointerException > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204) > > I guess you have segments from old crawl in one place (dir) and segments > from re-crawl in other. All segments should be in same place (I think > so) because web application says (from starting web app log): > > 2007-09-09 20:30:59,461 INFO searcher.NutchBean - opening segments > in /home/nutch/test/trunk/crawl/segments > > Tomislav > > > > On Thu, 2007-09-20 at 17:03 +0530, Alexis Votta wrote: > > s showing error. > > type Exception report > > |
|
|
Re: Nutch recrawl script for 0.9 doesn't work with trunk. HelpWe can do two things to solve this problem.
SOLUTION 'A' 1. Once the 'depth' loop is complete, merge the segments in 'crawl/segments/'. ('crawl/segments/' will have one merged segment of the past plus all the segments generated in the depth loop, one for each iteration of the loop.) They are now merged as a single segment in MERGEDsegments with the following command. $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* 2. Now replace 'crawl/segments' with 'crawl/MERGEDsegments'. rm -rf crawl/segments mv $MVARGS crawl/MERGEDsegments crawl/segments 3. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment 4. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/* 5. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes 6. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes 7. Delete crawl/NEWindexes. We are done! I think this is very similar to Jeff's solution. Alexis argued that:- > I am losing index of previous crawl. The thing to notice here is that, we can safely delete crawl/NEWindexes or (OLDindexes in Jeff's case) because, in step. 3 the indexes are generated from a merged segment into which the old segments have also been merged. So, we are not losing anything. SOLUTION 'B' 1. Generate the new segments in another directory, say, NEWsegments. $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/NEWsegments $topN -adddays $adddays 2. After the depth loop is over, merge the new segments, into crawl/segments. 'crawl/segments' may have multiple merged segments (one for each past crawl) if this is not the first crawl. bin/nutch mergesegs crawl/segments crawl/NEWsegments/* So, now 'crawl/segments' contains multiple merged segments (one for each crawl in the past) and another merged segment from the current re-crawl. Now we don't need 'crawl/NEWsegments'. So, we can delete 'crawl/NEWsegments'. 3. Store the latest merged segment in a variable. segment=`ls -d crawl/segments/* | tail -1` (From now onwards, we won't do the remaining operations for all the segments like we did in solution 'A'. We will do the remaining operations for the merged segment we have just created.) 4. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment 5. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb $segment 6. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes (So with steps 3-6, we generated indexes for the new merged segment generated with this crawl only.) 7. Let's assume the past indexes were saved as 'crawl/indexes1', 'crawl/indexes2', etc. Now, all of them can be merged as. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes/ crawl/indexes1/ crawl/indexes2 I think this is what Tomislav must have done whereas what Alexis must have done is a mix of solution 'A' and solution 'B'. For example, if you generate the 'crawl/NEWindexes' from the all the merged segments (new as well as old) and merge this NEWindexes (which is not strictly new) with old indexes, you will probably get that error. To summarize, in solution 'A' we are generating 'crawl/NEWindexes' from all the merged segments (new as well as old). So it is not strictly new. While merging we are merging only this because it has everything. In solution 'B' we are generating 'crawl/NEWindexes' from the most recent merged segment. So this is strictly new. So, while merging, we are merging NEWindexes with the old indexes into 'crawl/index'. Regards, Susam Pal http://susam.in/ On 9/20/07, Alexis Votta <alexisvotta@...> wrote: > Hi Tomislav and Nutch users > > I could not solve the problem with your instructions. > > I crawled two times. In re-crawl. It generated crawl/NEWindexes. > crawl/indexes was generated in 1st crawl. > > I merged ==> bin/nutch merge crawl/index crawl/indexes/ crawl/NEWindexes/ > > Now search.jsp is showing error. > type Exception report > > message > > description The server encountered an internal error () that prevented > it from fulfilling this request. > > exception > > org.apache.jasper.JasperException: java.lang.RuntimeException: > java.lang.NullPointerException > org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:532) > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:426) > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320) > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > > root cause > > java.lang.RuntimeException: java.lang.NullPointerException > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:204) > org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:342) > org.apache.jsp.search_jsp._jspService(search_jsp.java:247) > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:384) > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320) > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > > root cause > > java.lang.NullPointerException > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159) > org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177) > > Is there any Crawl guru who can help? > > On 9/20/07, Tomislav Poljak <tpoljak@...> wrote: > > Hi, > > I had the same problem using re-crawl scripts from wiki. They all work > > fine with nutch versions up to 0.9 (0.9 included), but when using > > nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that > > merge in nutch-0.9 (from re-crawl scripts): > > > > bin/nutch merge crawl/indexes crawl/NEWindexes > > > > did the merging of old indexes from crawl/indexes and the new indexes > > from crawl/NEWindexes and stored it in crawl/indexes. But with > > nutch-1.0-dev (from trunk) merge requires empty (new) output folder. > > > > Solution that works (I have tried it) is to do following: > > > > bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes > > > > where crawl/index is new (output) folder, crawl/indexes is old indexes > > and crawl/NEWindexes is the new indexes. It is important to know that > > you can do this with as many indexes you want to merge (as many > > re-crawls), you only have to do: > > > > bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ... > > > > but crawl/index must not exist (delete it or backup it). > > > > Nutch search web application will use merged index form crawl/index, > > this is from my web application log: > > > > 2007-09-09 20:30:58,949 INFO searcher.NutchBean - creating new bean > > 2007-09-09 20:30:59,128 INFO searcher.NutchBean - opening merged index > > in /home/nutch/test/trunk/crawl/index > > > > > > Hope this will help, > > > > Tomislav > > > > > > > > On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote: > > > /nutch mergesegs $merged_segment -dir $segments > > > if [ $? -ne 0 ] > > > > > |
|
|
RE: Nutch recrawl script for 0.9 doesn't work with trunk. HelpBased on the org.apache.nutch.Crawl class, I've created a ReCrawl class
that can be run similarly to how a nutch intranet crawl is run (or how the recrawl scripts work). I've written the ReCrawl class to function in the manner of "SOLUTION 'A'" below. However, mine doesn't reload the web application for you since that wasn't something I needed to include for my uses. The usage is something like: bin/nutch recrawl -dir existingCrawlDir -depth i -add addDays -topN topN Is there a reason this functionality wasn't previously built into nutch? Once I test this a bit more would the developers like a patch with my additions? Jeff -----Original Message----- From: Susam Pal [mailto:susam.pal@...] Sent: Thursday, September 20, 2007 9:54 AM To: nutch-user@... Subject: Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help We can do two things to solve this problem. SOLUTION 'A' 1. Once the 'depth' loop is complete, merge the segments in 'crawl/segments/'. ('crawl/segments/' will have one merged segment of the past plus all the segments generated in the depth loop, one for each iteration of the loop.) They are now merged as a single segment in MERGEDsegments with the following command. $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* 2. Now replace 'crawl/segments' with 'crawl/MERGEDsegments'. rm -rf crawl/segments mv $MVARGS crawl/MERGEDsegments crawl/segments 3. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment 4. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/* 5. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes 6. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes 7. Delete crawl/NEWindexes. We are done! I think this is very similar to Jeff's solution. Alexis argued that:- > I am losing index of previous crawl. The thing to notice here is that, we can safely delete crawl/NEWindexes or (OLDindexes in Jeff's case) because, in step. 3 the indexes are generated from a merged segment into which the old segments have also been merged. So, we are not losing anything. SOLUTION 'B' 1. Generate the new segments in another directory, say, NEWsegments. $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/NEWsegments $topN -adddays $adddays 2. After the depth loop is over, merge the new segments, into crawl/segments. 'crawl/segments' may have multiple merged segments (one for each past crawl) if this is not the first crawl. bin/nutch mergesegs crawl/segments crawl/NEWsegments/* So, now 'crawl/segments' contains multiple merged segments (one for each crawl in the past) and another merged segment from the current re-crawl. Now we don't need 'crawl/NEWsegments'. So, we can delete 'crawl/NEWsegments'. 3. Store the latest merged segment in a variable. segment=`ls -d crawl/segments/* | tail -1` (From now onwards, we won't do the remaining operations for all the segments like we did in solution 'A'. We will do the remaining operations for the merged segment we have just created.) 4. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment 5. $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb $segment 6. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes (So with steps 3-6, we generated indexes for the new merged segment generated with this crawl only.) 7. Let's assume the past indexes were saved as 'crawl/indexes1', 'crawl/indexes2', etc. Now, all of them can be merged as. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes/ crawl/indexes1/ crawl/indexes2 I think this is what Tomislav must have done whereas what Alexis must have done is a mix of solution 'A' and solution 'B'. For example, if you generate the 'crawl/NEWindexes' from the all the merged segments (new as well as old) and merge this NEWindexes (which is not strictly new) with old indexes, you will probably get that error. To summarize, in solution 'A' we are generating 'crawl/NEWindexes' from all the merged segments (new as well as old). So it is not strictly new. While merging we are merging only this because it has everything. In solution 'B' we are generating 'crawl/NEWindexes' from the most recent merged segment. So this is strictly new. So, while merging, we are merging NEWindexes with the old indexes into 'crawl/index'. Regards, Susam Pal http://susam.in/ On 9/20/07, Alexis Votta <alexisvotta@...> wrote: > Hi Tomislav and Nutch users > > I could not solve the problem with your instructions. > > I crawled two times. In re-crawl. It generated crawl/NEWindexes. > crawl/indexes was generated in 1st crawl. > > I merged ==> bin/nutch merge crawl/index crawl/indexes/ crawl/NEWindexes/ > > Now search.jsp is showing error. > type Exception report > > message > > description The server encountered an internal error () that prevented > it from fulfilling this request. > > exception > > org.apache.jasper.JasperException: java.lang.RuntimeException: > java.lang.NullPointerException > org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServl etWrapper.java:532) > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j ava:426) > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320 ) > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > > root cause > > java.lang.RuntimeException: java.lang.NullPointerException > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.ja va:204) > org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:342) > org.apache.jsp.search_jsp._jspService(search_jsp.java:247) > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.j ava:384) > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320 ) > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) > javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > > root cause > > java.lang.NullPointerException > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.ja va:159) > org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegm ents.java:177) > > Is there any Crawl guru who can help? > > On 9/20/07, Tomislav Poljak <tpoljak@...> wrote: > > Hi, > > I had the same problem using re-crawl scripts from wiki. They all work > > fine with nutch versions up to 0.9 (0.9 included), but when using > > nutch-1.0-dev (from trunk) they brak at merge of indexes. Reason is that > > merge in nutch-0.9 (from re-crawl scripts): > > > > bin/nutch merge crawl/indexes crawl/NEWindexes > > > > did the merging of old indexes from crawl/indexes and the new indexes > > from crawl/NEWindexes and stored it in crawl/indexes. But with > > nutch-1.0-dev (from trunk) merge requires empty (new) output folder. > > > > Solution that works (I have tried it) is to do following: > > > > bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes > > > > where crawl/index is new (output) folder, crawl/indexes is old indexes > > and crawl/NEWindexes is the new indexes. It is important to know that > > you can do this with as many indexes you want to merge (as many > > re-crawls), you only have to do: > > > > bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2 ... > > > > but crawl/index must not exist (delete it or backup it). > > > > Nutch search web application will use merged index form crawl/index, > > this is from my web application log: > > > > 2007-09-09 20:30:58,949 INFO searcher.NutchBean - creating new bean > > 2007-09-09 20:30:59,128 INFO searcher.NutchBean - opening merged index > > in /home/nutch/test/trunk/crawl/index > > > > > > Hope this will help, > > > > Tomislav > > > > > > > > On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote: > > > /nutch mergesegs $merged_segment -dir $segments > > > if [ $? -ne 0 ] > > > > > |
| Free embeddable forum powered by Nabble | Forum Help |