Parsing HTML documents?

View: New views
4 Messages — Rating Filter:   Alert me  

Parsing HTML documents?

by Bradley Kite-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello.

I have been looking for a C library that provides a DOM interface to
parsed HTML documents, however I have been struggling to make it work
the way that I'd like (probably because I'm trying to use it
incorrectly, no doubt!).

Firstly, can gdome be used to parse HTML documents? I am aware that
its more geared towards XML, which, although similar, has obvious
differences!

In any case, I'm using one of the examples as a start, however I'm
getting this error while calling parse():

parser error : StartTag: invalid element name
<!doctype html><head><title>

I guess the main question I have, is am I using the right tool? Or
should I be using something more suited to HTML? If so, would any body
have any recomendations? I need to be able to modify various
components of the DOM, and it needs to be written in C (or C++, but
preferably C).

Many thanks
--
Bradley Kite
_______________________________________________
gdome mailing list
gdome@...
http://mail.gnome.org/mailman/listinfo/gdome

Re: Parsing HTML documents?

by Luca Padovani :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Bradley,

On Sun, Aug 16, 2009 at 10:59 PM, Bradley Kite<bradley.kite@...> wrote:
> I guess the main question I have, is am I using the right tool? Or
> should I be using something more suited to HTML? If so, would any body
> have any recomendations? I need to be able to modify various
> components of the DOM, and it needs to be written in C (or C++, but
> preferably C).

I don't remember what's the current status of HTML support in Gdome2.
Anyway, I'd recommend to have a look at libxml2:

  http://xmlsoft.org/

it is the engine that underlies Gdome2 (which is just a wrapper on top
of it). For sure it does support HTML parsing. Its API is not exactly
the DOM one (which is why Gdome2 exists) but it gets very close. And
it's written in plain C.

Best regards,
--luca
_______________________________________________
gdome mailing list
gdome@...
http://mail.gnome.org/mailman/listinfo/gdome

Re: Parsing HTML documents?

by Paolo Casarini :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Bradley,

  gdome2 works only with well formed XML documents, the only released modules (wrt DOM specifications) are Main, Events and XPath: the HTML module is an old, not working and not maintained implementation.

If you deal only with well formed XHTML documents, you can use gdome2, otherwise, as Luca already said, libxml2 is the right choice for you.

Best regards,
  Paolo.

On Aug 16, 2009 10:59 PM, "Bradley Kite" <bradley.kite@...> wrote:

Hello.

I have been looking for a C library that provides a DOM interface to
parsed HTML documents, however I have been struggling to make it work
the way that I'd like (probably because I'm trying to use it
incorrectly, no doubt!).

Firstly, can gdome be used to parse HTML documents? I am aware that
its more geared towards XML, which, although similar, has obvious
differences!

In any case, I'm using one of the examples as a start, however I'm
getting this error while calling parse():

parser error : StartTag: invalid element name
<!doctype html><head><title>

I guess the main question I have, is am I using the right tool? Or
should I be using something more suited to HTML? If so, would any body
have any recomendations? I need to be able to modify various
components of the DOM, and it needs to be written in C (or C++, but
preferably C).

Many thanks
--
Bradley Kite
_______________________________________________
gdome mailing list
gdome@...
http://mail.gnome.org/mailman/listinfo/gdome


_______________________________________________
gdome mailing list
gdome@...
http://mail.gnome.org/mailman/listinfo/gdome

Re: Parsing HTML documents?

by Bradley Kite-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/8/17 Paolo Casarini <paolo@...>:

> Hello Bradley,
>
>   gdome2 works only with well formed XML documents, the only released
> modules (wrt DOM specifications) are Main, Events and XPath: the HTML module
> is an old, not working and not maintained implementation.
>
> If you deal only with well formed XHTML documents, you can use gdome2,
> otherwise, as Luca already said, libxml2 is the right choice for you.
>
> Best regards,
>   Paolo.
>
> On Aug 16, 2009 10:59 PM, "Bradley Kite" <bradley.kite@...> wrote:
>
> Hello.
>
> I have been looking for a C library that provides a DOM interface to
> parsed HTML documents, however I have been struggling to make it work
> the way that I'd like (probably because I'm trying to use it
> incorrectly, no doubt!).
>
> Firstly, can gdome be used to parse HTML documents? I am aware that
> its more geared towards XML, which, although similar, has obvious
> differences!
>
> In any case, I'm using one of the examples as a start, however I'm
> getting this error while calling parse():
>
> parser error : StartTag: invalid element name
> <!doctype html><head><title>
>
> I guess the main question I have, is am I using the right tool? Or
> should I be using something more suited to HTML? If so, would any body
> have any recomendations? I need to be able to modify various
> components of the DOM, and it needs to be written in C (or C++, but
> preferably C).
>
> Many thanks
> --
> Bradley Kite
> _______________________________________________
> gdome mailing list
> gdome@...
> http://mail.gnome.org/mailman/listinfo/gdome
>

Hi Luca and Paolo

Many thanks for your responses. I have started using libxml2 and found
it to be exactly what I require!

Kind Regards
--
Brad.
_______________________________________________
gdome mailing list
gdome@...
http://mail.gnome.org/mailman/listinfo/gdome