Help with HTML parsing

View: New views
13 Messages — Rating Filter:   Alert me  

Help with HTML parsing

by vivnet :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I'm new to Watir\Ruby and need to resolve something that involves HTML
parsing - you could also call it screen scraping. I haven't used either
library before, but I wanted to know if it is better to use Hpricot or
open_uri. The problem is similar to below:

let's say I'm searching Google for some string, "Dungeons & Dragons" for
instance. I want to parse through the first results page and get the
title text and url for the top 5 results. How would I do this using
Hpricot or open_uri or both?

Please help!


Viv.
--
Posted via http://www.ruby-forum.com/.


Re: Help with HTML parsing

by dandiebolt :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

open-uri gives you access to http/https/ftp while Hpricot is a html parser. So for your intended application you would potentially use both - open-uri would fetch the html document and whereas Hpricot would do the parsing using xpath/css selectors etc. open-uri is used when you are loading your html from the web url. The open-uri part happens in the blink of an eye and then you are off manipulating your hpricot doc:
require 'open-uri'
doc = open("http://qwantz.com/") { |f| Hpricot(f) }



Re: Help with HTML parsing

by vivnet :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

ok, i've started to work with hpricot. i'm debating if i should use
open-uri or watir. but thats another discussion. my current issue is,
though, with xpath. this is what i have so far...

#!ruby
require 'watir'
require 'open-uri'
require 'rubygems'
require 'hpricot'


Watir.options_file = 'c:/ruby/options.yml'
Watir::Browser.default = 'ie'

test_site = "http://www.google.com"

br = Watir::Browser.new

br.goto test_site
br.text_field(:name, 'q').set("Dungeons & Dragons")
br.button(:name, 'btnG').click

doc = Hpricot(br.html)

#after the above, i'm trying to store all result elements in an array

x = Array.new
x << doc.search("//li[@class='g w0']/h3/a")

the XPath checker in firefox shows that I have the right path. but the
command doesn't work in irb. how do i drill down to specific elements so
that i can extract the title text and url? and wats wrong with my
xpath??

plz help!


Viv.
--
Posted via http://www.ruby-forum.com/.


Re: Help with HTML parsing

by Phlip :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Vivek Netha wrote:

> x = Array.new
> x << doc.search("//li[@class='g w0']/h3/a")
>
> the XPath checker in firefox shows that I have the right path. but the
> command doesn't work in irb. how do i drill down to specific elements so
> that i can extract the title text and url? and wats wrong with my
> xpath??

Don't use Hpricot - its XPath support only covers a very few predicate and path
types. Use libxml-ruby, if you can install it, or REXML, if you don't mind the
slow speed (the RE stands for Regular Expressions!), or nokogiri, which I don't
know how to recommend yet, but I will soon.

BTW Google for my street-name and XPath for all kinds of fun in Ruby with them.

--
   Phlip


Re: Help with HTML parsing

by vivnet :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Philip,

could you give me more pointers on how exactly you would do it using
REXML. with actual code, if possible.

thx.

Phlip wrote:

> Vivek Netha wrote:
>
>> x = Array.new
>> x << doc.search("//li[@class='g w0']/h3/a")
>>
>> the XPath checker in firefox shows that I have the right path. but the
>> command doesn't work in irb. how do i drill down to specific elements so
>> that i can extract the title text and url? and wats wrong with my
>> xpath??
>
> Don't use Hpricot - its XPath support only covers a very few predicate
> and path
> types. Use libxml-ruby, if you can install it, or REXML, if you don't
> mind the
> slow speed (the RE stands for Regular Expressions!), or nokogiri, which
> I don't
> know how to recommend yet, but I will soon.
>
> BTW Google for my street-name and XPath for all kinds of fun in Ruby
> with them.

--
Posted via http://www.ruby-forum.com/.


Re: Help with HTML parsing

by Phlip :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Vivek Netha wrote:

> could you give me more pointers on how exactly you would do it using
> REXML. with actual code, if possible.

>>> x = Array.new
>>> x << doc.search("//li[@class='g w0']/h3/a")

Sure, but this is just tutorial-level REXML (hint hint), and uncompiled:

   require 'rexml/document'
   doc = REXML::Document.new(my_xml_string)
   x << REXML::XPath.first(doc, '//li[ @class = "g w0" ]/h3/a').text

Now we come to the very nub of the gist. A @class of "g w0" is the same as "w0
g", or any other permutation, yet XPath is literal-minded, and unaware of CSS,
so it will only match one of those permutations, when any other could have been
just as significant.

You could try contains(@class, "w0"), but that would match "w0p0p", which is a
different class.

This is why nokogiri is interesting - it might do CSS Selector notation, which
is less exact than XPath, and more aware of CSS rules. (Look up assert_select()
to see what I mean.)

For further REXML abuse, try this...

   http://www.google.com/codesearch?q=REXML%3A%3AXPath.first

...but be warned (again, I think), it's _almost_ as slow as Windows Vista with
more than one program running, so /caveat emptor/.

--
   Phlip


Re: Help with HTML parsing

by Tom Morris-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 2009-01-08, Phlip <phlip2005@...> wrote:
> Vivek Netha wrote:
>>>> x << doc.search("//li[@class='g w0']/h3/a")
>
> You could try contains(@class, "w0"), but that would match "w0p0p", which is a
> different class.
>

XPath that solves this:
contains(concat(' ', @class, ' '), ' w0 ')

You can join multiple classes together like this:
contains(concat(' ', @class, ' '), ' w0 ') && contains(concat(' ',
@class, ' '), ' g ')

It's not pretty. I've written it too much in my life.
I do wish XPath had a class selector.

--
Tom Morris
<http://tommorris.org>


Re: Help with HTML parsing

by Phlip :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tom Morris wrote:

> XPath that solves this:
> contains(concat(' ', @class, ' '), ' w0 ')
>
> You can join multiple classes together like this:
> contains(concat(' ', @class, ' '), ' w0 ') && contains(concat(' ',
> @class, ' '), ' g ')
>
> It's not pretty. I've written it too much in my life.
> I do wish XPath had a class selector.

Noted. I'm writing an XPath DSL, above the level of a raw XML library, and I
just added to its pending feature list these line-items:

#  TODO  :class => :symbol should do the trick contains(concat(' ', @class, '
'), ' w0 ')
#  TODO  :class => [] should do the trick contains(concat(' ', @class, ' '), '
w0 ') && contains(concat(' ',@class, ' '), ' g ')
#  TODO  :class => a string should be raw.

Those won't fix low-level XPath, but they will be useful when my DSL targets
XHTML...


Re: Help with HTML parsing

by Aaron Patterson-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, Jan 10, 2009 at 06:49:32PM +0900, Phlip wrote:

> Tom Morris wrote:
>
>> XPath that solves this:
>> contains(concat(' ', @class, ' '), ' w0 ')
>>
>> You can join multiple classes together like this:
>> contains(concat(' ', @class, ' '), ' w0 ') && contains(concat(' ',
>> @class, ' '), ' g ')
>>
>> It's not pretty. I've written it too much in my life.
>> I do wish XPath had a class selector.
>
> Noted. I'm writing an XPath DSL, above the level of a raw XML library,
> and I just added to its pending feature list these line-items:
>
> #  TODO  :class => :symbol should do the trick contains(concat(' ',
> @class, ' '), ' w0 ')
> #  TODO  :class => [] should do the trick contains(concat(' ', @class, '
> '), ' w0 ') && contains(concat(' ',@class, ' '), ' g ')
> #  TODO  :class => a string should be raw.
>
> Those won't fix low-level XPath, but they will be useful when my DSL
> targets XHTML...

Couldn't you use CSS selectors when targeting XHTML (or XML even)?  In
Nokogiri, I have a CSS to XPath converter and just push the XPath down
to libxml.  For example:

  require 'nokogiri'
 
  doc = Nokogiri::HTML(<<-eoxml)
  <html>
    <body>
      <div class="one two">
        Hello
      </div>
    </body>
  </html>
  eoxml
 
  puts doc.at('.one.two').content

It also doesn't care if you're using XML or HTML.  The CSS selectors will get
converted to XPath either way.

--
Aaron Patterson
http://tenderlovemaking.com/


Re: Help with HTML parsing

by Phlip :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Couldn't you use CSS selectors when targeting XHTML (or XML even)?

>   puts doc.at('.one.two').content

I'm going to. As soon as I get off my lazy butt, I will make nokogiri into a
pluggable back-end to the DSL. Then the CSS selection notation is free.

But I can't make the DSL itself into a CSS hybrid because that would
disadvantage the other pluggable XML parsers. So if you want pure XML
compliance, use REXML, if you want some XHTML awareness use LibXML, and if you
want CSS notation use Nokogiri. (After I'm done!)


Re: Help with HTML parsing

by Aaron Patterson-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Jan 11, 2009 at 04:09:54AM +0900, Phlip wrote:

>> Couldn't you use CSS selectors when targeting XHTML (or XML even)?
>
>>   puts doc.at('.one.two').content
>
> I'm going to. As soon as I get off my lazy butt, I will make nokogiri
> into a pluggable back-end to the DSL. Then the CSS selection notation is
> free.
>
> But I can't make the DSL itself into a CSS hybrid because that would  
> disadvantage the other pluggable XML parsers. So if you want pure XML  
> compliance, use REXML, if you want some XHTML awareness use LibXML, and
> if you want CSS notation use Nokogiri. (After I'm done!)

The CSS to XPath conversion can stand alone as a pure ruby library.
Feel free to take it from my source if you like.  Then you can give CSS
notation to anything that supports XPath. :-)

--
Aaron Patterson
http://tenderlovemaking.com/


Re: Help with HTML parsing

by Jun Young Kim :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

you can also use ruby library Sanitize (http://wonko.com/post/sanitize)

This library can make you parse html template very easily.

let's see the following examples.

Using Sanitize is easy. First, install it:
  sudo gem install sanitize

Then call it like so:

  require 'rubygems'
  require 'sanitize'

  html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg 
" />'

  Sanitize.clean(html) # => 'foo'

By default, Sanitize removes all HTML. You can use one of the built-in  
configs to tell Sanitize to allow certain attributes and elements:

  Sanitize.clean(html, Sanitize::Config::RESTRICTED)
  # => '<b>foo</b>'

  Sanitize.clean(html, Sanitize::Config::BASIC)
  # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'

  Sanitize.clean(html, Sanitize::Config::RELAXED)
  # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg 
" />'

Or, if you’d like more control over what’s allowed, you can provide  
your own custom configuration:

  Sanitize.clean(html, :elements => ['a', 'span'],
      :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
      :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})

good one :)

2009. 01. 02, 오전 6:42, Vivek Netha 작성:

> Hello,
>
> I'm new to Watir\Ruby and need to resolve something that involves HTML
> parsing - you could also call it screen scraping. I haven't used  
> either
> library before, but I wanted to know if it is better to use Hpricot or
> open_uri. The problem is similar to below:
>
> let's say I'm searching Google for some string, "Dungeons & Dragons"  
> for
> instance. I want to parse through the first results page and get the
> title text and url for the top 5 results. How would I do this using
> Hpricot or open_uri or both?
>
> Please help!
>
>
> Viv.
> --
> Posted via http://www.ruby-forum.com/.
>
>



Re: Help with HTML parsing

by Santosh Turamari :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

  I am using Sanitize.clean(), for freeing contents from html tags, but
the difficulty is I want to preserve some of the tags from removing.. I
have given like this.

html = File.new(file).read
      soup = BeautifulSoup.new(html)
      soup.title.contents=['']
      soup.find_all.each do |tag|
        if tag.string!= nil
          tag.contents  = ['<strong>'+tag.contents.to_s+'</strong>'] if
(tag['style'] =~ /bold/)
          tag.contents  = ['<em>'+tag.contents.to_s+'</em>'] if
(tag['style'] =~ /italic/)
          tag.contents  = ['<u>'+tag.contents.to_s+'</u>'] if
(tag['style'] =~ /underline/)
     end
    end
soup_string = str_replace(soup.html.to_s)

return Sanitize.clean(soup_string.to_s, :elements =>
['div','p','span','center','table','tr','th','td','blockquote', 'br',
'cite', 'code', 'dd', 'dl', 'dt','em','i', 'li', 'ol','pre', 'q',
'small', 'strike','strong', 'sub','sup', 'u', 'ul','tbody']),
  but the problem is that I want to preserver the center and right
justifications also, which is not happening if I give 'center' here. If
any body know how to preserve justifications pls help me.

Thanks In Advance,
Santosh




Jun Young Kim wrote:

> you can also use ruby library Sanitize (http://wonko.com/post/sanitize)
>
> This library can make you parse html template very easily.
>
> let's see the following examples.
>
> Using Sanitize is easy. First, install it:
>   sudo gem install sanitize
>
> Then call it like so:
>
>   require 'rubygems'
>   require 'sanitize'
>
>   html = '<b><a href="http://foo.com/">foo</a></b><img
> src="http://foo.com/bar.jpg
> " />'
>
>   Sanitize.clean(html) # => 'foo'
>
> By default, Sanitize removes all HTML. You can use one of the built-in
> configs to tell Sanitize to allow certain attributes and elements:
>
>   Sanitize.clean(html, Sanitize::Config::RESTRICTED)
>   # => '<b>foo</b>'
>
>   Sanitize.clean(html, Sanitize::Config::BASIC)
>   # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
>
>   Sanitize.clean(html, Sanitize::Config::RELAXED)
>   # => '<b><a href="http://foo.com/">foo</a></b><img
> src="http://foo.com/bar.jpg
> " />'
>
> Or, if you��d like more control over what��s allowed, you can provide
> your own custom configuration:
>
>   Sanitize.clean(html, :elements => ['a', 'span'],
>       :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
>       :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
>
> good one :)
>
> 2009. 01. 02, ���� 6:42, Vivek Netha �ۼ�:

--
Posted via http://www.ruby-forum.com/.