Hi,
I was wondering if anyone has a simple script using Nutch 1.0 to crawl
an Intranet sites with multiple webservers. I can use
/webroot/oscrawlers/nutch/bin/nutch crawl
/webroot/oscrawlers/nutch/urls/seed.txt -dir
/webroot/oscrawlers/nutch/crawl -depth 8 -topN 1000 and get a big
chunk of the files. I then tried to follow the steps outlined on the
Nutch Tutorial,
http://wiki.apache.org/nutch/NutchTutorial on crawling
"Whole-web" and nothing new seems to get into the index. It seems to
be crawling the same URLs. When I run the "-stats" command against
the database I get the same stats output.
Here is my script
####################################################
#!/bin/sh
####################################################
# nutch_crawler.sh
####################################################
echo " Set UMASK ...";
umask 002;
echo ""
# Set Variables
LIMIT=1 # Max loops to execute
A=0
NUTCHBINARY='/webroot/oscrawlers/nutch/bin/nutch'
NUTCHDB='/webroot/oscrawlers/nutch/crawl/crawldb'
NUTCHSEGMENTS='/webroot/oscrawlers/nutch/crawl/segments'
NUTCHINDEXES='/webroot/oscrawlers/nutch/crawl/indexes'
NUTCHLINKDB='/webroot/oscrawlers/nutch/crawl/linkdb'
# Inject starting URLs into the database
#echo " Injecting Starting URLs ..."
#echo ""
#$NUTCHBINARY inject $NUTCHDB /webroot/oscrawlers/nutch/urls/seed.txt
#sleep 30
while [ $A -le "$LIMIT" ]
do
# Generate a fetch list
echo " Generating fetch list ..."
$NUTCHBINARY generate $NUTCHDB $NUTCHSEGMENTS -topN 1000
# Find the newest created segment
echo ""
echo " Get segment ..."
s1=`ls -d /webroot/oscrawlers/nutch/crawl/segments/2* | tail -1`
echo ""
echo " Segment is: $s1 ..."
# Fetch this segment
$NUTCHBINARY fetch $s1
# Add one to A and continue looping until LIMIT is reached
A=$(($A+1))
sleep 60
done
# Invert links
echo ""
echo " Building inverted links ... "
$NUTCHBINARY invertlinks $NUTCHLINKDB -dir $NUTCHSEGMENTS
# Before I can do this, I need to delete the current indexes. Doesn't
seem to affect the current searches
echo ""
echo " Remove old indexes ..."
rm -rf $NUTCHINDEXES
# Index Segments
echo ""
echo " Build new indexes ..."
$NUTCHBINARY index $NUTCHINDEXES $NUTCHDB $NUTCHLINKDB $NUTCHSEGMENTS/*
echo ""
echo " Done ...";
###########################################################
Jake Jacobson
http://www.linkedin.com/in/jakejacobsonhttp://www.new.facebook.com/people/Jake_Jacobson/622727274Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
-- ANONYMOUS