parsing ill-formed rss

View: New views
3 Messages — Rating Filter:   Alert me  

parsing ill-formed rss

by Aaron Dixon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm using ROME to parse generated RSS feeds, but of course many RSS
feeds are ill-formed XML and cause conforming XML parsers to fail. (An
example of the actual ill-formed RSS feed I'm getting is found below.)

Does anyone have experience using a "loose parser" that is forgiving
to ill-formed XML. I know the kind of "loose rules" I would like to
apply but I don't want to have to implement my own parser or stream
filter -- I'd prefer to use a loose parser that lets me hook in some
specific behavior. Anyone done this before?

Part of the feed I'm parsing looks like this:
...
<item>
<title>D;< ugggh.. [cousin's idiotic friends] stupid!!! >;[[[</title>
<link>http://twitter.com/Santaysiaaa/statuses/1920295596</link>
<description><![CDATA[  ]]></description>
<pubDate>Tue, 26 May 2009 04:28:33 +0000</pubDate>
<guid>http://twitter.com/Santaysiaaa/statuses/1920295596</guid>
</item>
...

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: parsing ill-formed rss

by Charles HOPE :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Have you tried Tag Soup? <http://home.ccil.org/~cowan/XML/tagsoup/>



On Tue, Jun 2, 2009 at 5:15 PM, Aaron Dixon <atdixon@...> wrote:
I'm using ROME to parse generated RSS feeds, but of course many RSS
feeds are ill-formed XML and cause conforming XML parsers to fail. (An
example of the actual ill-formed RSS feed I'm getting is found below.)

Does anyone have experience using a "loose parser" that is forgiving
to ill-formed XML. I know the kind of "loose rules" I would like to
apply but I don't want to have to implement my own parser or stream
filter -- I'd prefer to use a loose parser that lets me hook in some
specific behavior. Anyone done this before?

Part of the feed I'm parsing looks like this:
...
<item>
<title>D;< ugggh.. [cousin's idiotic friends] stupid!!! >;[[[</title>
<link>http://twitter.com/Santaysiaaa/statuses/1920295596</link>
<description><![CDATA[  ]]></description>
<pubDate>Tue, 26 May 2009 04:28:33 +0000</pubDate>
<guid>http://twitter.com/Santaysiaaa/statuses/1920295596</guid>
</item>
...

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...




--
Never did I see a second sun
Never did my skin touch a land of glass
Never did my rifle point but true
But in a land empty of enemies
Waiting for the tick-tick-tick of the want
A uranium angel
Crying “behold,”
This land that knew fire is yours
Taken from Corruption
To begin anew

Re: parsing ill-formed rss

by Aaron Dixon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I tried the tag soup parser but I got the same issues.

I wrote an RssFixerReader that allows me to register tag names that I
expect might have bad data. I wrap the Rss input reader with this
fixer whenever I am parsing the feed that I expect to be ill-formed.
It's a pretty simple state machine but fairly tailored to my problem.

On Tue, Jun 2, 2009 at 4:21 PM, Charles HOPE
<lookslikeiwasright@...> wrote:

> Have you tried Tag Soup? <http://home.ccil.org/~cowan/XML/tagsoup/>
>
>
>
> On Tue, Jun 2, 2009 at 5:15 PM, Aaron Dixon <atdixon@...> wrote:
>>
>> I'm using ROME to parse generated RSS feeds, but of course many RSS
>> feeds are ill-formed XML and cause conforming XML parsers to fail. (An
>> example of the actual ill-formed RSS feed I'm getting is found below.)
>>
>> Does anyone have experience using a "loose parser" that is forgiving
>> to ill-formed XML. I know the kind of "loose rules" I would like to
>> apply but I don't want to have to implement my own parser or stream
>> filter -- I'd prefer to use a loose parser that lets me hook in some
>> specific behavior. Anyone done this before?
>>
>> Part of the feed I'm parsing looks like this:
>> ...
>> <item>
>> <title>D;< ugggh.. [cousin's idiotic friends] stupid!!! >;[[[</title>
>> <link>http://twitter.com/Santaysiaaa/statuses/1920295596</link>
>> <description><![CDATA[  ]]></description>
>> <pubDate>Tue, 26 May 2009 04:28:33 +0000</pubDate>
>> <guid>http://twitter.com/Santaysiaaa/statuses/1920295596</guid>
>> </item>
>> ...
>>
>> Thanks!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@...
>> For additional commands, e-mail: users-help@...
>>
>
>
>
> --
> Never did I see a second sun
> Never did my skin touch a land of glass
> Never did my rifle point but true
> But in a land empty of enemies
> Waiting for the tick-tick-tick of the want
> A uranium angel
> Crying “behold,”
> This land that knew fire is yours
> Taken from Corruption
> To begin anew
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...