|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
Extract full urls from DOMHello everyone,
I've created a plugin for Nutch 1.0 that extends the parser. This plugin extract several kinds of information from the document DOM. In some cases I need to extract an "href" of a certain link. The link in the DOM is still relative as it was originally written in the html document, so for example it might be a link with an href of "/music". My question is - how can I make this link have an absolute url - for example make "/music" to "http://www.example.com/music"? Thanks a lot, Eran |
|
|
Re: Extract full urls from DOMOn Oct 29, 2009, at 4:00am, Eran Zinman wrote: > Hello everyone, > > I've created a plugin for Nutch 1.0 that extends the parser. > > This plugin extract several kinds of information from the document > DOM. > > In some cases I need to extract an "href" of a certain link. The > link in the > DOM is still relative as it was originally written in the html > document, so > for example it might be a link with an href of "/music". > > My question is - how can I make this link have an absolute url - for > example > make "/music" to "http://www.example.com/music"? new URL(baseUrl, relativeString) will return the full URL, leaving aside a few minor edge cases. The baseUrl will be the URL of the containing document, or the value of the (potentially relative) location: response header field if it exists, or the value of the <base> tag in the <head> element, if that exists. -- Ken -------------------------- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-210-6378 |
|
|
Re: Extract full urls from DOMKen,
Thanks a lot! You solved my problem. Thanks, Eran On Thu, Oct 29, 2009 at 2:35 PM, Ken Krugler <kkrugler_lists@...>wrote: > > On Oct 29, 2009, at 4:00am, Eran Zinman wrote: > > Hello everyone, >> >> I've created a plugin for Nutch 1.0 that extends the parser. >> >> This plugin extract several kinds of information from the document DOM. >> >> In some cases I need to extract an "href" of a certain link. The link in >> the >> DOM is still relative as it was originally written in the html document, >> so >> for example it might be a link with an href of "/music". >> >> My question is - how can I make this link have an absolute url - for >> example >> make "/music" to "http://www.example.com/music"? >> > > new URL(baseUrl, relativeString) > > will return the full URL, leaving aside a few minor edge cases. > > The baseUrl will be the URL of the containing document, or the value of the > (potentially relative) location: response header field if it exists, or the > value of the <base> tag in the <head> element, if that exists. > > -- Ken > > > -------------------------- > Ken Krugler > TransPac Software, Inc. > <http://www.transpac.com> > +1 530-210-6378 > > |
| Free embeddable forum powered by Nabble | Forum Help |