Encoding problem

View: New views
10 Messages — Rating Filter:   Alert me  

Encoding problem

by Daniele Dellafiore :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi everyone.

I am having trouble parsing this feed:

http://rateyourmusic.com/rss/latest

Problem is with the body of of the review. I can see it fine if I open
that URL with Firefox.
Rome instead is parsing this way:

"They of course miss  "Miss misery", but there is really
underrated "This month's messiah""

and it really seems some encoding error. I parse this way:

                       XmlReader reader = new XmlReader(stream);
                       SyndFeed feed = input.build(reader);

XmlReader has utf-8 encoding, if asked. while SyndFeed has "null" encoding.

Any idea?

--
Daniele Dellafiore
http://blog.ildella.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Encoding problem

by Jasha Joachimsthal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Daniele,

2009/3/31 Daniele Dellafiore <ildella@...>
Hi everyone.

I am having trouble parsing this feed:

http://rateyourmusic.com/rss/latest

Problem is with the body of of the review. I can see it fine if I open
that URL with Firefox.
Rome instead is parsing this way:

"They of course miss &#160;&#34;Miss misery&#34;, but there is really
underrated &#34;This month&#39;s messiah&#34;"

and it really seems some encoding error. I parse this way:

                      XmlReader reader = new XmlReader(stream);
                      SyndFeed feed = input.build(reader);

XmlReader has utf-8 encoding, if asked. while SyndFeed has "null" encoding.

Any idea?

In Rome 1.0 you can do something like
WireFeedInput input = new WireFeedInput();
 XmlReader reader = new XmlReader(stream);
Feed feed = (Feed) input.build(reader, true, "UTF-8"));

Regards,

--
Jasha Joachimsthal

j.joachimsthal@... - jasha@...

www.onehippo.com
Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA 94952-5100 +1 (707) 773-4646

Re: Encoding problem

by Daniele Dellafiore :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi. Thanks for reply.

WireFeedInput does not have a method that accept a XmlReader, boolea
and String parameters.
I can build the XmlReader with

XmlReader reader = new XmlReader(stream, true, "UTF-8");

but the XmlReader already was able to identify the UTF-8 char encoding
so the problem is really with the Feed that has been built.

On Tue, Mar 31, 2009 at 8:57 AM, Jasha Joachimsthal
<j.joachimsthal@...> wrote:
> WireFeedInput



--
Daniele Dellafiore
http://blog.ildella.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Encoding problem

by Jasha Joachimsthal-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



2009/3/31 Daniele Dellafiore <ildella@...>
Hi. Thanks for reply.

WireFeedInput does not have a method that accept a XmlReader, boolea
and String parameters.
I can build the XmlReader with

XmlReader reader = new XmlReader(stream, true, "UTF-8");

but the XmlReader already was able to identify the UTF-8 char encoding
so the problem is really with the Feed that has been built.

Ah you're right, I misread the parentheses. WireFeed does however have a setEncoding method, can't you call that?


--
Jasha Joachimsthal

j.joachimsthal@... - jasha@...

www.onehippo.com
Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA 94952-5100 +1 (707) 773-4646

Re: Encoding problem

by Daniele Dellafiore :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Mar 31, 2009 at 9:44 AM, Jasha Joachimsthal
<j.joachimsthal@...> wrote:

>
>
> 2009/3/31 Daniele Dellafiore <ildella@...>
>>
>> Hi. Thanks for reply.
>>
>> WireFeedInput does not have a method that accept a XmlReader, boolea
>> and String parameters.
>> I can build the XmlReader with
>>
>> XmlReader reader = new XmlReader(stream, true, "UTF-8");
>>
>> but the XmlReader already was able to identify the UTF-8 char encoding
>> so the problem is really with the Feed that has been built.
>
> Ah you're right, I misread the parentheses. WireFeed does however have a
> setEncoding method, can't you call that?

already tried on the SyndFeed but you can call the setEncoding just
after the Feed has been built via the FeedInput, and has no effect.
I can try here but I cannot understand how I get list pof  entries
from the WiredFeed. In fact getModules() returns an empty list and
there is no getEntries like in the SyndFeed.

>
> --
> Jasha Joachimsthal
>
> j.joachimsthal@... - jasha@...
>
> www.onehippo.com
> Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
> San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA 94952-5100
> +1 (707) 773-4646
>



--
Daniele Dellafiore
http://blog.ildella.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Encoding problem

by Daniele Dellafiore :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

you can try to run this code and see what happens.

                URL url = new URL("http://rateyourmusic.com/rss/latest");
                XmlReader reader = new XmlReader(url.openConnection());
                SyndFeed feed = new SyndFeedInput().build(reader);
                List entries = feed.getEntries();
                for (Iterator it = entries.iterator(); it.hasNext();) {
                        SyndEntry entry = (SyndEntry) it.next();
                        System.out.println(entry.getDescription());
                        System.out
                                        .println("---------------*************************************--------------------------");
                        System.out.println();
                        System.out.println();
               


On Tue, Mar 31, 2009 at 10:02 AM, Daniele Dellafiore <ildella@...> wrote:

> On Tue, Mar 31, 2009 at 9:44 AM, Jasha Joachimsthal
> <j.joachimsthal@...> wrote:
>>
>>
>> 2009/3/31 Daniele Dellafiore <ildella@...>
>>>
>>> Hi. Thanks for reply.
>>>
>>> WireFeedInput does not have a method that accept a XmlReader, boolea
>>> and String parameters.
>>> I can build the XmlReader with
>>>
>>> XmlReader reader = new XmlReader(stream, true, "UTF-8");
>>>
>>> but the XmlReader already was able to identify the UTF-8 char encoding
>>> so the problem is really with the Feed that has been built.
>>
>> Ah you're right, I misread the parentheses. WireFeed does however have a
>> setEncoding method, can't you call that?
>
> already tried on the SyndFeed but you can call the setEncoding just
> after the Feed has been built via the FeedInput, and has no effect.
> I can try here but I cannot understand how I get list pof  entries
> from the WiredFeed. In fact getModules() returns an empty list and
> there is no getEntries like in the SyndFeed.
>
>>
>> --
>> Jasha Joachimsthal
>>
>> j.joachimsthal@... - jasha@...
>>
>> www.onehippo.com
>> Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
>> San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA 94952-5100
>> +1 (707) 773-4646
>>
>
>
>
> --
> Daniele Dellafiore
> http://blog.ildella.net/
>



--
Daniele Dellafiore
http://blog.ildella.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Encoding problem

by Martin Kurz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Daniele,

your problem isn't really related to character encoding. All non ascii
chars in the feed are escaped as numeric entities (that's like \u0160 in
java for inserting a special unicode char in a string). ie entity  
is a non breaking space (  in html), " is the doublequote sign
("), when looking at the feed in firefox, these entities are unescaped
to the characters by firefox. So you can do the same thing in java of
course, ie when using commons-lang.jar (http://commons.apache.org), you
could use

StringEscapeUtils.unescapeXml( entry.getDescription() )

for getting the unescaped chars instead of the entities.

Greetings,

Martin

Daniele Dellafiore schrieb:

> you can try to run this code and see what happens.
>
> URL url = new URL("http://rateyourmusic.com/rss/latest");
> XmlReader reader = new XmlReader(url.openConnection());
> SyndFeed feed = new SyndFeedInput().build(reader);
> List entries = feed.getEntries();
> for (Iterator it = entries.iterator(); it.hasNext();) {
> SyndEntry entry = (SyndEntry) it.next();
> System.out.println(entry.getDescription());
> System.out
> .println("---------------*************************************--------------------------");
> System.out.println();
> System.out.println();
>
>
>
> On Tue, Mar 31, 2009 at 10:02 AM, Daniele Dellafiore <ildella@...> wrote:
>> On Tue, Mar 31, 2009 at 9:44 AM, Jasha Joachimsthal
>> <j.joachimsthal@...> wrote:
>>>
>>> 2009/3/31 Daniele Dellafiore <ildella@...>
>>>> Hi. Thanks for reply.
>>>>
>>>> WireFeedInput does not have a method that accept a XmlReader, boolea
>>>> and String parameters.
>>>> I can build the XmlReader with
>>>>
>>>> XmlReader reader = new XmlReader(stream, true, "UTF-8");
>>>>
>>>> but the XmlReader already was able to identify the UTF-8 char encoding
>>>> so the problem is really with the Feed that has been built.
>>> Ah you're right, I misread the parentheses. WireFeed does however have a
>>> setEncoding method, can't you call that?
>> already tried on the SyndFeed but you can call the setEncoding just
>> after the Feed has been built via the FeedInput, and has no effect.
>> I can try here but I cannot understand how I get list pof  entries
>> from the WiredFeed. In fact getModules() returns an empty list and
>> there is no getEntries like in the SyndFeed.
>>
>>> --
>>> Jasha Joachimsthal
>>>
>>> j.joachimsthal@... - jasha@...
>>>
>>> www.onehippo.com
>>> Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
>>> San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA 94952-5100
>>> +1 (707) 773-4646
>>>
>>
>>
>> --
>> Daniele Dellafiore
>> http://blog.ildella.net/
>>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Encoding problem

by Martin Kurz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Daniele,

your problem isn't really related to character encoding. All non ascii
chars in the feed are escaped as numeric entities (that's like \u0160 in
java for inserting a special unicode char in a string). ie entity  
is a non breaking space (  in html), " is the doublequote sign
("), when looking at the feed in firefox, these entities are unescaped
to the characters by firefox. So you can do the same thing in java of
course, ie when using commons-lang.jar (http://commons.apache.org), you
could use

StringEscapeUtils.unescapeXml( entry.getDescription() )

for getting the unescaped chars instead of the entities.

Greetings,

Martin

Daniele Dellafiore schrieb:

> you can try to run this code and see what happens.
>
> URL url = new URL("http://rateyourmusic.com/rss/latest");
> XmlReader reader = new XmlReader(url.openConnection());
> SyndFeed feed = new SyndFeedInput().build(reader);
> List entries = feed.getEntries();
> for (Iterator it = entries.iterator(); it.hasNext();) {
> SyndEntry entry = (SyndEntry) it.next();
> System.out.println(entry.getDescription());
> System.out
> .println("---------------*************************************--------------------------");
> System.out.println();
> System.out.println();
>
>
>
> On Tue, Mar 31, 2009 at 10:02 AM, Daniele Dellafiore <ildella@...> wrote:
>> On Tue, Mar 31, 2009 at 9:44 AM, Jasha Joachimsthal
>> <j.joachimsthal@...> wrote:
>>>
>>> 2009/3/31 Daniele Dellafiore <ildella@...>
>>>> Hi. Thanks for reply.
>>>>
>>>> WireFeedInput does not have a method that accept a XmlReader, boolea
>>>> and String parameters.
>>>> I can build the XmlReader with
>>>>
>>>> XmlReader reader = new XmlReader(stream, true, "UTF-8");
>>>>
>>>> but the XmlReader already was able to identify the UTF-8 char encoding
>>>> so the problem is really with the Feed that has been built.
>>> Ah you're right, I misread the parentheses. WireFeed does however have a
>>> setEncoding method, can't you call that?
>> already tried on the SyndFeed but you can call the setEncoding just
>> after the Feed has been built via the FeedInput, and has no effect.
>> I can try here but I cannot understand how I get list pof  entries
>> from the WiredFeed. In fact getModules() returns an empty list and
>> there is no getEntries like in the SyndFeed.
>>
>>> --
>>> Jasha Joachimsthal
>>>
>>> j.joachimsthal@... - jasha@...
>>>
>>> www.onehippo.com
>>> Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
>>> San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA 94952-5100
>>> +1 (707) 773-4646
>>>
>>
>>
>> --
>> Daniele Dellafiore
>> http://blog.ildella.net/
>>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


RE: Encoding problem

by Nick Lothian :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


There are a number of things going on here:

1) The charset encoding isn't set by the feed parsers. I'm not entirely sure why this is, but it is documented in the Javadoc for WireFeed.setEncoding(..)

2) You can get access to the underlying original WireFeed by doing the following:
                SyndFeedInput input = new SyndFeedInput();
                input.setPreserveWireFeed(true);
                SyndFeed feed = input.build(reader);
                WireFeed wireFeed = feed.originalWireFeed();
                ...

3) You aren't actually having a charset encoding problem here. What you are seeing is HTML entity encoded characters, inside a CDATA XML section. Eg, a description looks like this:

<description><![CDATA[
<img src="http://static.rateyourmusic.com/album_images/s535747.jpg" width="150" style="margin:15px;" align="right" />   Rated: 2.0 Stars<br /> <br />Medebor plays really bland progressive death metal. Attempts to produce an emotive sound by using a little too much melody result in nothing poignant. "Starlight" is the only remotely good song. They sound like a band that Opeth-haters would use as an alternative to Opeth in order to seem more 'true'. That sums up Medebor here pretty well; a lesser Opeth. <em class="rymfmt">Phantasma</em> is more metal that is too serious for its own good, when it really has no point to make. I don't know why I chose this, of all metal albums, to say that for, but there it is, and it applies to hundreds more albums that no one will ever listen to just like this one. ]]></description>

Handling this is outside the scope of ROME, but I believe that Apache Commons StringUtils should be able to help you (eg, http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html#unescapeHtml(java.lang.String) )

I hope this helps you out!

Regards
  Nick Lothian



> -----Original Message-----
> From: Daniele Dellafiore [mailto:ildella@...]
> Sent: Wednesday, 1 April 2009 3:15 AM
> To: users@...
> Subject: Re: Encoding problem
>
> you can try to run this code and see what happens.
>
>               URL url = new URL("http://rateyourmusic.com/rss/latest");
>               XmlReader reader = new XmlReader(url.openConnection());
>               SyndFeed feed = new SyndFeedInput().build(reader);
>               List entries = feed.getEntries();
>               for (Iterator it = entries.iterator(); it.hasNext();) {
>                       SyndEntry entry = (SyndEntry) it.next();
>                       System.out.println(entry.getDescription());
>                       System.out
>                                       .println("---------------
> *************************************--------------------------");
>                       System.out.println();
>                       System.out.println();
>
>
>
> On Tue, Mar 31, 2009 at 10:02 AM, Daniele Dellafiore
> <ildella@...> wrote:
> > On Tue, Mar 31, 2009 at 9:44 AM, Jasha Joachimsthal
> > <j.joachimsthal@...> wrote:
> >>
> >>
> >> 2009/3/31 Daniele Dellafiore <ildella@...>
> >>>
> >>> Hi. Thanks for reply.
> >>>
> >>> WireFeedInput does not have a method that accept a XmlReader,
> boolea
> >>> and String parameters.
> >>> I can build the XmlReader with
> >>>
> >>> XmlReader reader = new XmlReader(stream, true, "UTF-8");
> >>>
> >>> but the XmlReader already was able to identify the UTF-8 char
> encoding
> >>> so the problem is really with the Feed that has been built.
> >>
> >> Ah you're right, I misread the parentheses. WireFeed does however
> have a
> >> setEncoding method, can't you call that?
> >
> > already tried on the SyndFeed but you can call the setEncoding just
> > after the Feed has been built via the FeedInput, and has no effect.
> > I can try here but I cannot understand how I get list pof  entries
> > from the WiredFeed. In fact getModules() returns an empty list and
> > there is no getEntries like in the SyndFeed.
> >
> >>
> >> --
> >> Jasha Joachimsthal
> >>
> >> j.joachimsthal@... - jasha@...
> >>
> >> www.onehippo.com
> >> Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-
> 5224466
> >> San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA
> 94952-5100
> >> +1 (707) 773-4646
> >>
> >
> >
> >
> > --
> > Daniele Dellafiore
> > http://blog.ildella.net/
> >
>
>
>
> --
> Daniele Dellafiore
> http://blog.ildella.net/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@...
> For additional commands, e-mail: users-help@...


IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party. This email represents the views of the individual sender, which do not necessarily reflect those of Education.au except where the sender expressly states otherwise. It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects. education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Encoding problem

by Daniele Dellafiore :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

thanks for all replies, StringEscapeUtils solve my problem indeed :)

On Tue, Mar 31, 2009 at 8:03 PM, Martin Kurz <info@...> wrote:

> Hi Daniele,
>
> your problem isn't really related to character encoding. All non ascii chars
> in the feed are escaped as numeric entities (that's like \u0160 in java for
> inserting a special unicode char in a string). ie entity   is a non
> breaking space (  in html), " is the doublequote sign ("), when
> looking at the feed in firefox, these entities are unescaped to the
> characters by firefox. So you can do the same thing in java of course, ie
> when using commons-lang.jar (http://commons.apache.org), you could use
>
> StringEscapeUtils.unescapeXml( entry.getDescription() )
>
> for getting the unescaped chars instead of the entities.
>
> Greetings,
>
> Martin
>
> Daniele Dellafiore schrieb:
>>
>> you can try to run this code and see what happens.
>>
>>                URL url = new URL("http://rateyourmusic.com/rss/latest");
>>                XmlReader reader = new XmlReader(url.openConnection());
>>                SyndFeed feed = new SyndFeedInput().build(reader);
>>                List entries = feed.getEntries();
>>                for (Iterator it = entries.iterator(); it.hasNext();) {
>>                        SyndEntry entry = (SyndEntry) it.next();
>>                        System.out.println(entry.getDescription());
>>                        System.out
>>
>>  .println("---------------*************************************--------------------------");
>>                        System.out.println();
>>                        System.out.println();
>>
>>
>>
>> On Tue, Mar 31, 2009 at 10:02 AM, Daniele Dellafiore <ildella@...>
>> wrote:
>>>
>>> On Tue, Mar 31, 2009 at 9:44 AM, Jasha Joachimsthal
>>> <j.joachimsthal@...> wrote:
>>>>
>>>> 2009/3/31 Daniele Dellafiore <ildella@...>
>>>>>
>>>>> Hi. Thanks for reply.
>>>>>
>>>>> WireFeedInput does not have a method that accept a XmlReader, boolea
>>>>> and String parameters.
>>>>> I can build the XmlReader with
>>>>>
>>>>> XmlReader reader = new XmlReader(stream, true, "UTF-8");
>>>>>
>>>>> but the XmlReader already was able to identify the UTF-8 char encoding
>>>>> so the problem is really with the Feed that has been built.
>>>>
>>>> Ah you're right, I misread the parentheses. WireFeed does however have a
>>>> setEncoding method, can't you call that?
>>>
>>> already tried on the SyndFeed but you can call the setEncoding just
>>> after the Feed has been built via the FeedInput, and has no effect.
>>> I can try here but I cannot understand how I get list pof  entries
>>> from the WiredFeed. In fact getModules() returns an empty list and
>>> there is no getEntries like in the SyndFeed.
>>>
>>>> --
>>>> Jasha Joachimsthal
>>>>
>>>> j.joachimsthal@... - jasha@...
>>>>
>>>> www.onehippo.com
>>>> Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466
>>>> San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA
>>>> 94952-5100
>>>> +1 (707) 773-4646
>>>>
>>>
>>>
>>> --
>>> Daniele Dellafiore
>>> http://blog.ildella.net/
>>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@...
> For additional commands, e-mail: users-help@...
>
>



--
Daniele Dellafiore
http://blog.ildella.net/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...