a strange Encoding issue?

View: New views
4 Messages — Rating Filter:   Alert me  

a strange Encoding issue?

by ianwong :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi everyone,

I want to use Rome to get news entries from following URL:
http://www.trainingpressreleases.co.uk/rss.ashx?incCat=1
but got exception: Invalid XML: Error on line 1: Content is not allowed in prolog.

That rss works fine with firefox, IE and some rss software I test.

I tried to print out html of by connection, content looks strange:
?< ? x m l  v e r s i o n = .....

How can I use Rome to parse URL like that, is it an encoding issue?

Thanks

ian
 

Re: a strange Encoding issue?

by Martin Kurz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Ian,

what version of rome are you using and how are you reading the feed? The
problem is encoding related, the feed is utf-16, this is a double byte
charset and the files first two bytes are marking the utf-16 version
(big endian or little endian, the so called "byte order mark" or "bom"
shortly). So when trying to read the feed, the parser seems not to
recognize the utf-16 encoding and so the parser is seeing some bytes
before the starting xml declaration and that's not allowed. I made a
simple test case:


     try {
       URL feedUrl = new URL(
"http://www.trainingpressreleases.co.uk/rss.ashx?incCat=1" );
       SyndFeedInput input = new SyndFeedInput();
       XmlReader xr = new XmlReader( feedUrl.openStream() );
       System.out.println( "Encoding " + xr.getEncoding() );
       SyndFeed feed = input.build( xr );
       feed.setEncoding( "UTF-8" );
       PrintWriter pw = new PrintWriter( System.out );
       SyndFeedOutput output = new SyndFeedOutput();
       output.output( feed, pw, true );
       pw.flush();
     } catch ( Exception ex ) {
       ex.printStackTrace();
     }

I can parse the feed an convert it to utf-8 for output without any
problem with rome (tested with rome 1.0). Could you validate you can
parse and output the feed with the code above?

Greetings,

Martin

ianwong schrieb:

> Hi everyone,
>
> I want to use Rome to get news entries from following URL:
> http://www.trainingpressreleases.co.uk/rss.ashx?incCat=1
> but got exception: Invalid XML: Error on line 1: Content is not allowed in
> prolog.
>
> That rss works fine with firefox, IE and some rss software I test.
>
> I tried to print out html of by connection, content looks strange:
> ?< ? x m l  v e r s i o n = .....
>
> How can I use Rome to parse URL like that, is it an encoding issue?
>
> Thanks
>
> ian
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: a strange Encoding issue?

by Robert (kebernet) Cooper :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I am quite certain what you are seeing is a UTF-16 XML file that is being declared as UTF-8 in the HTTP header.

Martin is definitely correct. The big thing is make sure you are using the XmlReader, or RomeFetcher when you are parsing. There is a good bit of dark magic in there to deal with everyone's broken content types on the internet.

On Tue, Apr 14, 2009 at 4:36 PM, Martin Kurz <info@...> wrote:
Hi Ian,

what version of rome are you using and how are you reading the feed? The problem is encoding related, the feed is utf-16, this is a double byte charset and the files first two bytes are marking the utf-16 version (big endian or little endian, the so called "byte order mark" or "bom" shortly). So when trying to read the feed, the parser seems not to recognize the utf-16 encoding and so the parser is seeing some bytes before the starting xml declaration and that's not allowed. I made a simple test case:


   try {
     URL feedUrl = new URL( "http://www.trainingpressreleases.co.uk/rss.ashx?incCat=1" );
     SyndFeedInput input = new SyndFeedInput();
     XmlReader xr = new XmlReader( feedUrl.openStream() );
     System.out.println( "Encoding " + xr.getEncoding() );
     SyndFeed feed = input.build( xr );
     feed.setEncoding( "UTF-8" );
     PrintWriter pw = new PrintWriter( System.out );
     SyndFeedOutput output = new SyndFeedOutput();
     output.output( feed, pw, true );
     pw.flush();
   } catch ( Exception ex ) {
     ex.printStackTrace();
   }

I can parse the feed an convert it to utf-8 for output without any problem with rome (tested with rome 1.0). Could you validate you can parse and output the feed with the code above?

Greetings,

Martin

ianwong schrieb:

Hi everyone,

I want to use Rome to get news entries from following URL:
http://www.trainingpressreleases.co.uk/rss.ashx?incCat=1
but got exception: Invalid XML: Error on line 1: Content is not allowed in
prolog.

That rss works fine with firefox, IE and some rss software I test.

I tried to print out html of by connection, content looks strange:
?< ? x m l  v e r s i o n = .....

How can I use Rome to parse URL like that, is it an encoding issue?
Thanks

ian
 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...




--
:Robert "kebernet" Cooper
::kebernet@...
Alice's cleartext
Charlie is the attacker
Bob signs and encrypts
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x9E8759F8

Re: a strange Encoding issue?

by ianwong :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for the help, Martin

I am using Rome1.0. Your explanation is really helpful.

Ian
Martin Kurz wrote:
Hi Ian,

what version of rome are you using and how are you reading the feed? The
problem is encoding related, the feed is utf-16, this is a double byte
charset and the files first two bytes are marking the utf-16 version
(big endian or little endian, the so called "byte order mark" or "bom"
shortly). So when trying to read the feed, the parser seems not to
recognize the utf-16 encoding and so the parser is seeing some bytes
before the starting xml declaration and that's not allowed. I made a
simple test case:


     try {
       URL feedUrl = new URL(
"http://www.trainingpressreleases.co.uk/rss.ashx?incCat=1" );
       SyndFeedInput input = new SyndFeedInput();
       XmlReader xr = new XmlReader( feedUrl.openStream() );
       System.out.println( "Encoding " + xr.getEncoding() );
       SyndFeed feed = input.build( xr );
       feed.setEncoding( "UTF-8" );
       PrintWriter pw = new PrintWriter( System.out );
       SyndFeedOutput output = new SyndFeedOutput();
       output.output( feed, pw, true );
       pw.flush();
     } catch ( Exception ex ) {
       ex.printStackTrace();
     }

I can parse the feed an convert it to utf-8 for output without any
problem with rome (tested with rome 1.0). Could you validate you can
parse and output the feed with the code above?

Greetings,

Martin

ianwong schrieb:
> Hi everyone,
>
> I want to use Rome to get news entries from following URL:
> http://www.trainingpressreleases.co.uk/rss.ashx?incCat=1
> but got exception: Invalid XML: Error on line 1: Content is not allowed in
> prolog.
>
> That rss works fine with firefox, IE and some rss software I test.
>
> I tried to print out html of by connection, content looks strange:
> ?< ? x m l  v e r s i o n = .....
>
> How can I use Rome to parse URL like that, is it an encoding issue?
>
> Thanks
>
> ian
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@rome.dev.java.net
For additional commands, e-mail: users-help@rome.dev.java.net