Parsing feeds with wrongly defined namespaces

View: New views
3 Messages — Rating Filter:   Alert me  

Parsing feeds with wrongly defined namespaces

by Thibaut_ :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Some feeds don't have all namespaces defined (maybe due to an error on their site). Rome should still try to parse those feeds and ignore the wrong tags.

eg. parsing  the following feed http://fr.techcrunch.com/2009/07/21/cest-lete-blog-au-ralenti/feed/ will fail with the following execption as it contains xml code like "<title>Par : <fb:name linked="false" useyou="false" uid="630011441">Jonathan Fischer</fb:name></title>"


com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 113: The prefix "fb" for element "fb:name" is not bound.
        at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:198)
        at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:123)
        at Test101.main(Test101.java:133)
Caused by: org.jdom.input.JDOMParseException: Error on line 113: The prefix "fb" for element "fb:name" is not bound.
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:533)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:946)
        at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:194)
        ... 4 more
Caused by: org.xml.sax.SAXParseException: The prefix "fb" for element "fb:name" is not bound.
        at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
        at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:518)
        ... 6 more


Andy ideas on what I can change on the saxbuilder/xerces setup code to ignore non defined namespace tags?

Thanks,
Thibaut

Re: Parsing feeds with wrongly defined namespaces

by Robert (kebernet) Cooper :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Wow, that feed is all kinds of a mess.

Looking at it, if you just need the basic Atom info, you could do:

namespaces=false

namespace-prefixes=true


on the SAX parser properties. You would lose module support, though, for all of the extensions in the file. That would basically parse the XML tree and give you a namespace-free doc with "fb:whatever" and "dc:whatever" as the element names. Otherwise, I am not sure any of the current Java parsers would deal with that.

For stuff like this, and especially *hehe* it being on TechCrunch, emailing the site admin and just telling them to fix it isn't a bad idea either.


On Tue, Aug 4, 2009 at 10:57 AM, Thibaut_ <tbritz@...> wrote:

Hi,

Some feeds don't have all namespaces defined (maybe due to an error on their
site). Rome should still try to parse those feeds and ignore the wrong tags.

eg. parsing  the following feed
http://fr.techcrunch.com/2009/07/21/cest-lete-blog-au-ralenti/feed/ will
fail with the following execption as it contains xml code like "<title>Par :
<fb:name linked="false" useyou="false" uid="630011441">Jonathan
Fischer</fb:name></title>"


com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 113:
The prefix "fb" for element "fb:name" is not bound.
       at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:198)
       at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:123)
       at Test101.main(Test101.java:133)
Caused by: org.jdom.input.JDOMParseException: Error on line 113: The prefix
"fb" for element "fb:name" is not bound.
       at org.jdom.input.SAXBuilder.build(SAXBuilder.java:533)
       at org.jdom.input.SAXBuilder.build(SAXBuilder.java:946)
       at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:194)
       ... 4 more
Caused by: org.xml.sax.SAXParseException: The prefix "fb" for element
"fb:name" is not bound.
       at
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
Source)
       at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
       at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
       at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
       at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
Source)
       at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
       at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
       at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
       at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
       at org.jdom.input.SAXBuilder.build(SAXBuilder.java:518)
       ... 6 more


Andy ideas on what I can change on the saxbuilder/xerces setup code to
ignore non defined namespace tags?

Thanks,
Thibaut

--
View this message in context: http://www.nabble.com/Parsing-feeds-with-wrongly-defined-namespaces-tp24810235p24810235.html
Sent from the Rome - Development mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...




--
:Robert "kebernet" Cooper
::kebernet@...
Alice's cleartext
Charlie is the attacker
Bob signs and encrypts
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x9E8759F8

Re: Parsing feeds with wrongly defined namespaces

by Thibaut_ :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks,
I will try your suggestions out.

It also happens sometimes on the main site, not just the french subsidary. But it's better to handle the problem at the root (eg modifying rome to add dummy namespace declarations), because other sites might do this as well.

Thibaut




Robert (kebernet) Cooper wrote:
Wow, that feed is all kinds of a mess.
Looking at it, if you just need the basic Atom info, you could do:

namespaces=false

namespace-prefixes=true


on the SAX parser properties. You would lose module support, though, for all
of the extensions in the file. That would basically parse the XML tree and
give you a namespace-free doc with "fb:whatever" and "dc:whatever" as the
element names. Otherwise, I am not sure any of the current Java parsers
would deal with that.

For stuff like this, and especially *hehe* it being on TechCrunch, emailing
the site admin and just telling them to fix it isn't a bad idea either.


On Tue, Aug 4, 2009 at 10:57 AM, Thibaut_ <tbritz@blue.lu> wrote:

>
> Hi,
>
> Some feeds don't have all namespaces defined (maybe due to an error on
> their
> site). Rome should still try to parse those feeds and ignore the wrong
> tags.
>
> eg. parsing  the following feed
> http://fr.techcrunch.com/2009/07/21/cest-lete-blog-au-ralenti/feed/ will
> fail with the following execption as it contains xml code like "<title>Par
> :
> <fb:name linked="false" useyou="false" uid="630011441">Jonathan
> Fischer</fb:name></title>"
>
>
> com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line
> 113:
> The prefix "fb" for element "fb:name" is not bound.
>        at
> com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:198)
>        at
> com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:123)
>        at Test101.main(Test101.java:133)
> Caused by: org.jdom.input.JDOMParseException: Error on line 113: The prefix
> "fb" for element "fb:name" is not bound.
>        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:533)
>        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:946)
>        at
> com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:194)
>        ... 4 more
> Caused by: org.xml.sax.SAXParseException: The prefix "fb" for element
> "fb:name" is not bound.
>        at
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
> Source)
>        at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
> Source)
>        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
> Source)
>        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
> Source)
>        at
> org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown
> Source)
>        at
>
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
>        at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
>        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> Source)
>        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> Source)
>        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
> Source)
>        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:518)
>        ... 6 more
>
>
> Andy ideas on what I can change on the saxbuilder/xerces setup code to
> ignore non defined namespace tags?
>
> Thanks,
> Thibaut
>
> --
> View this message in context:
> http://www.nabble.com/Parsing-feeds-with-wrongly-defined-namespaces-tp24810235p24810235.html
> Sent from the Rome - Development mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@rome.dev.java.net
> For additional commands, e-mail: dev-help@rome.dev.java.net
>
>


--
:Robert "kebernet" Cooper
::kebernet@gmail.com
Alice's cleartext
Charlie is the attacker
Bob signs and encrypts
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x9E8759F8