Extract full urls from DOM

View: New views
3 Messages — Rating Filter:   Alert me  

Extract full urls from DOM

by zzeran :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello everyone,

I've created a plugin for Nutch 1.0 that extends the parser.

This plugin extract several kinds of information from the document DOM.

In some cases I need to extract an "href" of a certain link. The link in the
DOM is still relative as it was originally written in the html document, so
for example it might be a link with an href of "/music".

My question is - how can I make this link have an absolute url - for example
make "/music" to "http://www.example.com/music"?

Thanks a lot,
Eran

Re: Extract full urls from DOM

by Ken Krugler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 29, 2009, at 4:00am, Eran Zinman wrote:

> Hello everyone,
>
> I've created a plugin for Nutch 1.0 that extends the parser.
>
> This plugin extract several kinds of information from the document  
> DOM.
>
> In some cases I need to extract an "href" of a certain link. The  
> link in the
> DOM is still relative as it was originally written in the html  
> document, so
> for example it might be a link with an href of "/music".
>
> My question is - how can I make this link have an absolute url - for  
> example
> make "/music" to "http://www.example.com/music"?

new URL(baseUrl, relativeString)

will return the full URL, leaving aside a few minor edge cases.

The baseUrl will be the URL of the containing document, or the value  
of the (potentially relative) location: response header field if it  
exists, or the value of the <base> tag in the <head> element, if that  
exists.

-- Ken


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Re: Extract full urls from DOM

by zzeran :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ken,

Thanks a lot! You solved my problem.

Thanks,
Eran

On Thu, Oct 29, 2009 at 2:35 PM, Ken Krugler <kkrugler_lists@...>wrote:

>
> On Oct 29, 2009, at 4:00am, Eran Zinman wrote:
>
>  Hello everyone,
>>
>> I've created a plugin for Nutch 1.0 that extends the parser.
>>
>> This plugin extract several kinds of information from the document DOM.
>>
>> In some cases I need to extract an "href" of a certain link. The link in
>> the
>> DOM is still relative as it was originally written in the html document,
>> so
>> for example it might be a link with an href of "/music".
>>
>> My question is - how can I make this link have an absolute url - for
>> example
>> make "/music" to "http://www.example.com/music"?
>>
>
> new URL(baseUrl, relativeString)
>
> will return the full URL, leaving aside a few minor edge cases.
>
> The baseUrl will be the URL of the containing document, or the value of the
> (potentially relative) location: response header field if it exists, or the
> value of the <base> tag in the <head> element, if that exists.
>
> -- Ken
>
>
> --------------------------
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-210-6378
>
>