entities: text or rawText?

View: New views
3 Messages — Rating Filter:   Alert me  

entities: text or rawText?

by Lukas Theussl-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Vincent,

I'm trying to understand some of the issues we have with entities in the
XmlParser. Is there a special reason why entities are emitted as rawText and not text?

I think they should be emitted as text:

First, custom entities can be used to simply define some replacement text inside
documents (eg <!ENTITY version "1.0">).

Second, the resulting events should be consumable by all sinks, not just x(ht)ml
based ones. Consider for instance the text "&Æ" (where AElig is defined
as <!ENTITY AElig  "Æ">). Currently it is emitted by the XhtmlBaseParser as
one text event "&" and one rawText event "Æ". This means that eg the Latex
Sink will produce wrong output (the AElig should be converted to "\AE" in latex).

IMO the resolved entity should be emitted in a format-independent way, eg as one
(unicode?) character, just like & is emitted as one character above. The
consuming sink then has to transform that into a format-specific representation.

WDYT?
-Lukas



Re: entities: text or rawText?

by Vincent Siveton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Lukas,

2009/5/4 Lukas Theussl <ltheussl@...>:
>
> Vincent,
>
> I'm trying to understand some of the issues we have with entities in the
> XmlParser. Is there a special reason why entities are emitted as rawText and
> not text?

The text used by XhtmlBaseParser#handleEntity() could contain
predefined entities [1] and numeric code entities (ie Æ will
become Æ by XmlPullParser)
XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not.

So using rawText() is to be sure to not escape text with entities.

> I think they should be emitted as text:
>
> First, custom entities can be used to simply define some replacement text
> inside documents (eg <!ENTITY version "1.0">).
>
> Second, the resulting events should be consumable by all sinks, not just
> x(ht)ml based ones. Consider for instance the text "&Æ" (where
> AElig is defined as <!ENTITY AElig  "Æ">). Currently it is emitted by
> the XhtmlBaseParser as one text event "&" and one rawText event "Æ".
> This means that eg the Latex Sink will produce wrong output (the AElig
> should be converted to "\AE" in latex).
>
> IMO the resolved entity should be emitted in a format-independent way, eg as
> one (unicode?) character, just like & is emitted as one character above.
> The consuming sink then has to transform that into a format-specific
> representation.

It could be another implementation.
XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text()

Cheers,

Vincent

[1] http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-predefined-ent

Re: entities: text or rawText?

by Lukas Theussl-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


For reference: the XhtmlBaseParser in Doxia 1.1.1 emits entities as text, except
if they are not recognized (ie haven't been declared), then they are emitted as
unknown events.

-Lukas


Vincent Siveton wrote:

> Hi Lukas,
>
> 2009/5/4 Lukas Theussl <ltheussl@...>:
>> Vincent,
>>
>> I'm trying to understand some of the issues we have with entities in the
>> XmlParser. Is there a special reason why entities are emitted as rawText and
>> not text?
>
> The text used by XhtmlBaseParser#handleEntity() could contain
> predefined entities [1] and numeric code entities (ie Æ will
> become Æ by XmlPullParser)
> XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not.
>
> So using rawText() is to be sure to not escape text with entities.
>
>> I think they should be emitted as text:
>>
>> First, custom entities can be used to simply define some replacement text
>> inside documents (eg <!ENTITY version "1.0">).
>>
>> Second, the resulting events should be consumable by all sinks, not just
>> x(ht)ml based ones. Consider for instance the text "&Æ" (where
>> AElig is defined as <!ENTITY AElig  "Æ">). Currently it is emitted by
>> the XhtmlBaseParser as one text event "&" and one rawText event "Æ".
>> This means that eg the Latex Sink will produce wrong output (the AElig
>> should be converted to "\AE" in latex).
>>
>> IMO the resolved entity should be emitted in a format-independent way, eg as
>> one (unicode?) character, just like & is emitted as one character above.
>> The consuming sink then has to transform that into a format-specific
>> representation.
>
> It could be another implementation.
> XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text()
>
> Cheers,
>
> Vincent
>
> [1] http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-predefined-ent
>