|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
How to fetch URLs with special charaters '?' & '='I am trying to crawl the URL: http://answers.yahoo.com/dir/index;_ylt=AmQOyqS3boseCSYsZxA495Xpy6IX;_ylv=3?link=list&sid=396545327 with special characters '?' and '='. This URL belongs to Dining-out category of answers.yahoo.com. And I want to crawl the URLs that fall under this sub category. But it seemed to get skipped. I have attached my urllist.txt, regex-urlfilter.txt and crawl-urlfilter.txt with this. Has anyone done similar kind of crawling before?
regex-urlfilter.txt crawl-urlfilter.txt urllist.txt |
|
|
RE: How to fetch URLs with special charaters '?' & '='your webpage could be defined in the robots.txt of yahoo website as no_index, no_follow and review your regular expression ! the character '.' means any caracter, you have to add the '\' beside the '.' like this \. +^http://answers.yahoo.com/dir/index;_ylt=* should be like this +^http://answers\.yahoo\.com/dir/index;_ylt=* > Date: Wed, 4 Nov 2009 07:06:54 -0800 > From: saravanan-2.krishnamoorthy-2@... > To: nutch-user@... > Subject: How to fetch URLs with special charaters '?' & '=' > > > I am trying to crawl the URL: > http://answers.yahoo.com/dir/index;_ylt=AmQOyqS3boseCSYsZxA495Xpy6IX;_ylv=3?link=list&sid=396545327 > with special characters '?' and '='. This URL belongs to Dining-out category > of answers.yahoo.com. And I want to crawl the URLs that fall under this sub > category. But it seemed to get skipped. I have attached my urllist.txt, > regex-urlfilter.txt and crawl-urlfilter.txt with this. Has anyone done > similar kind of crawling before? > http://old.nabble.com/file/p26197881/regex-urlfilter.txt regex-urlfilter.txt > http://old.nabble.com/file/p26197881/crawl-urlfilter.txt crawl-urlfilter.txt > http://old.nabble.com/file/p26197881/urllist.txt urllist.txt > -- > View this message in context: http://old.nabble.com/How-to-fetch-URLs-with-special-charaters-%27-%27---%27%3D%27-tp26197881p26197881.html > Sent from the Nutch - User mailing list archive at Nabble.com. > _________________________________________________________________ Windows Live: Keep your friends up to date with what you do online. http://go.microsoft.com/?linkid=9691815 |
|
|
Re: How to fetch URLs with special charaters '?' & '='By adding the expression +[?=] just below the line that contains -[*!@], the
URL's you mentioned are crawled. You could try that and see if it succeeds for you. -sroy On Wed, Nov 4, 2009 at 8:36 PM, saravan.krish < saravanan-2.krishnamoorthy-2@...> wrote: > > I am trying to crawl the URL: > > http://answers.yahoo.com/dir/index;_ylt=AmQOyqS3boseCSYsZxA495Xpy6IX;_ylv=3?link=list&sid=396545327 > with special characters '?' and '='. This URL belongs to Dining-out > category > of answers.yahoo.com. And I want to crawl the URLs that fall under this > sub > category. But it seemed to get skipped. I have attached my urllist.txt, > regex-urlfilter.txt and crawl-urlfilter.txt with this. Has anyone done > similar kind of crawling before? > http://old.nabble.com/file/p26197881/regex-urlfilter.txtregex-urlfilter.txt > http://old.nabble.com/file/p26197881/crawl-urlfilter.txtcrawl-urlfilter.txt > http://old.nabble.com/file/p26197881/urllist.txt urllist.txt > -- > View this message in context: > http://old.nabble.com/How-to-fetch-URLs-with-special-charaters-%27-%27---%27%3D%27-tp26197881p26197881.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source) email: sroy@... http://www.profound.in |
|
|
decoding nutch readseg -dump 's outputHi,
I'm trying to build a small perl (could be any scripting language) utility that takes nutch readseg -dump 's output as its input, decodes the content field to utf-8 (independent of what encoding the raw page was in) and outputs that decoded content. After a little bit of experimentation, i find myself unable to decode the content field, even when i try using the various charset hints that are available either in the content metadata, or in the raw content itself. I was wondering if someone on the list has already succeeded in building this type of functionality, or is the content returned by readseg using a specific encoding that i don't know of ? cheers, -y |
|
|
Re: decoding nutch readseg -dump 's outputYves Petinot wrote:
> Hi, > > I'm trying to build a small perl (could be any scripting language) > utility that takes nutch readseg -dump 's output as its input, decodes > the content field to utf-8 (independent of what encoding the raw page > was in) and outputs that decoded content. After a little bit of > experimentation, i find myself unable to decode the content field, even > when i try using the various charset hints that are available either in > the content metadata, or in the raw content itself. > > I was wondering if someone on the list has already succeeded in building > this type of functionality, or is the content returned by readseg using > a specific encoding that i don't know of ? The dump functionality is not intended to provide a bit-by-bit copy of the segment, it's mostly for debugging purposes. It uses System.out, which in turn uses the default platform encoding - any characters outside this encoding will be replaced by question marks. If you want to get an exact copy of the raw binary content then please use the SegmentReader API. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: decoding nutch readseg -dump 's outputThanks a lot, Andrzej, this makes perfect sense.
-y Andrzej Bialecki wrote: > Yves Petinot wrote: >> Hi, >> >> I'm trying to build a small perl (could be any scripting language) >> utility that takes nutch readseg -dump 's output as its input, >> decodes the content field to utf-8 (independent of what encoding the >> raw page was in) and outputs that decoded content. After a little bit >> of experimentation, i find myself unable to decode the content field, >> even when i try using the various charset hints that are available >> either in the content metadata, or in the raw content itself. >> >> I was wondering if someone on the list has already succeeded in >> building this type of functionality, or is the content returned by >> readseg using a specific encoding that i don't know of ? > > The dump functionality is not intended to provide a bit-by-bit copy of > the segment, it's mostly for debugging purposes. It uses System.out, > which in turn uses the default platform encoding - any characters > outside this encoding will be replaced by question marks. > > If you want to get an exact copy of the raw binary content then please > use the SegmentReader API. > |
| Free embeddable forum powered by Nabble | Forum Help |