How to fetch URLs with special charaters '?' & '='

View: New views
6 Messages — Rating Filter:   Alert me  

How to fetch URLs with special charaters '?' & '='

by saravan.krish :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I am trying to crawl the URL: http://answers.yahoo.com/dir/index;_ylt=AmQOyqS3boseCSYsZxA495Xpy6IX;_ylv=3?link=list&sid=396545327 with special characters '?' and '='. This URL belongs to Dining-out category of answers.yahoo.com. And I want to crawl the URLs that fall under this sub category. But it seemed to get skipped. I have attached my urllist.txt, regex-urlfilter.txt and crawl-urlfilter.txt with this. Has anyone done similar kind of crawling before?

regex-urlfilter.txt
crawl-urlfilter.txt
urllist.txt

RE: How to fetch URLs with special charaters '?' & '='

by miagomiago :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



your webpage could be defined in the robots.txt of yahoo website  as no_index, no_follow

and review your regular expression ! the character '.' means any caracter, you have to add the '\' beside the '.'  like this \.
+^http://answers.yahoo.com/dir/index;_ylt=*    should be like this

+^http://answers\.yahoo\.com/dir/index;_ylt=*






> Date: Wed, 4 Nov 2009 07:06:54 -0800
> From: saravanan-2.krishnamoorthy-2@...
> To: nutch-user@...
> Subject: How to fetch URLs with special charaters '?' & '='
>
>
> I am trying to crawl the URL:
> http://answers.yahoo.com/dir/index;_ylt=AmQOyqS3boseCSYsZxA495Xpy6IX;_ylv=3?link=list&sid=396545327
> with special characters '?' and '='. This URL belongs to Dining-out category
> of answers.yahoo.com. And I want to crawl the URLs that fall under this sub
> category. But it seemed to get skipped. I have attached my urllist.txt,
> regex-urlfilter.txt and crawl-urlfilter.txt with this. Has anyone done
> similar kind of crawling before?
> http://old.nabble.com/file/p26197881/regex-urlfilter.txt regex-urlfilter.txt
> http://old.nabble.com/file/p26197881/crawl-urlfilter.txt crawl-urlfilter.txt
> http://old.nabble.com/file/p26197881/urllist.txt urllist.txt
> --
> View this message in context: http://old.nabble.com/How-to-fetch-URLs-with-special-charaters-%27-%27---%27%3D%27-tp26197881p26197881.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
     
_________________________________________________________________
Windows Live: Keep your friends up to date with what you do online.
http://go.microsoft.com/?linkid=9691815

Re: How to fetch URLs with special charaters '?' & '='

by Subhojit Roy :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

By adding the expression +[?=] just below the line that contains -[*!@], the
URL's you mentioned are crawled. You could try that and see if it succeeds
for you.

-sroy

On Wed, Nov 4, 2009 at 8:36 PM, saravan.krish <
saravanan-2.krishnamoorthy-2@...> wrote:

>
> I am trying to crawl the URL:
>
> http://answers.yahoo.com/dir/index;_ylt=AmQOyqS3boseCSYsZxA495Xpy6IX;_ylv=3?link=list&sid=396545327
> with special characters '?' and '='. This URL belongs to Dining-out
> category
> of answers.yahoo.com. And I want to crawl the URLs that fall under this
> sub
> category. But it seemed to get skipped. I have attached my urllist.txt,
> regex-urlfilter.txt and crawl-urlfilter.txt with this. Has anyone done
> similar kind of crawling before?
> http://old.nabble.com/file/p26197881/regex-urlfilter.txtregex-urlfilter.txt
> http://old.nabble.com/file/p26197881/crawl-urlfilter.txtcrawl-urlfilter.txt
> http://old.nabble.com/file/p26197881/urllist.txt urllist.txt
> --
> View this message in context:
> http://old.nabble.com/How-to-fetch-URLs-with-special-charaters-%27-%27---%27%3D%27-tp26197881p26197881.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


--
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: sroy@...
http://www.profound.in

decoding nutch readseg -dump 's output

by Yves Petinot :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I'm trying to build a small perl (could be any scripting language)
utility that takes nutch readseg -dump 's output as its input, decodes
the content field to utf-8 (independent of what encoding the raw page
was in) and outputs that decoded content. After a little bit of
experimentation, i find myself unable to decode the content field, even
when i try using the various charset hints that are available either in
the content metadata, or in the raw content itself.

I was wondering if someone on the list has already succeeded in building
this type of functionality, or is the content returned by readseg using
a specific encoding that i don't know of ?

cheers,

-y

Re: decoding nutch readseg -dump 's output

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yves Petinot wrote:

> Hi,
>
> I'm trying to build a small perl (could be any scripting language)
> utility that takes nutch readseg -dump 's output as its input, decodes
> the content field to utf-8 (independent of what encoding the raw page
> was in) and outputs that decoded content. After a little bit of
> experimentation, i find myself unable to decode the content field, even
> when i try using the various charset hints that are available either in
> the content metadata, or in the raw content itself.
>
> I was wondering if someone on the list has already succeeded in building
> this type of functionality, or is the content returned by readseg using
> a specific encoding that i don't know of ?

The dump functionality is not intended to provide a bit-by-bit copy of
the segment, it's mostly for debugging purposes. It uses System.out,
which in turn uses the default platform encoding - any characters
outside this encoding will be replaced by question marks.

If you want to get an exact copy of the raw binary content then please
use the SegmentReader API.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: decoding nutch readseg -dump 's output

by Yves Petinot :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks a lot, Andrzej, this makes perfect sense.

-y

Andrzej Bialecki wrote:

> Yves Petinot wrote:
>> Hi,
>>
>> I'm trying to build a small perl (could be any scripting language)
>> utility that takes nutch readseg -dump 's output as its input,
>> decodes the content field to utf-8 (independent of what encoding the
>> raw page was in) and outputs that decoded content. After a little bit
>> of experimentation, i find myself unable to decode the content field,
>> even when i try using the various charset hints that are available
>> either in the content metadata, or in the raw content itself.
>>
>> I was wondering if someone on the list has already succeeded in
>> building this type of functionality, or is the content returned by
>> readseg using a specific encoding that i don't know of ?
>
> The dump functionality is not intended to provide a bit-by-bit copy of
> the segment, it's mostly for debugging purposes. It uses System.out,
> which in turn uses the default platform encoding - any characters
> outside this encoding will be replaced by question marks.
>
> If you want to get an exact copy of the raw binary content then please
> use the SegmentReader API.
>