Hi Lukas,
2009/5/4 Lukas Theussl <
ltheussl@...>:
>
> Vincent,
>
> I'm trying to understand some of the issues we have with entities in the
> XmlParser. Is there a special reason why entities are emitted as rawText and
> not text?
The text used by XhtmlBaseParser#handleEntity() could contain
predefined entities [1] and numeric code entities (ie Æ will
become Æ by XmlPullParser)
XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not.
So using rawText() is to be sure to not escape text with entities.
> I think they should be emitted as text:
>
> First, custom entities can be used to simply define some replacement text
> inside documents (eg <!ENTITY version "1.0">).
>
> Second, the resulting events should be consumable by all sinks, not just
> x(ht)ml based ones. Consider for instance the text "&Æ" (where
> AElig is defined as <!ENTITY AElig "Æ">). Currently it is emitted by
> the XhtmlBaseParser as one text event "&" and one rawText event "Æ".
> This means that eg the Latex Sink will produce wrong output (the AElig
> should be converted to "\AE" in latex).
>
> IMO the resolved entity should be emitted in a format-independent way, eg as
> one (unicode?) character, just like & is emitted as one character above.
> The consuming sink then has to transform that into a format-specific
> representation.
It could be another implementation.
XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text()
Cheers,
Vincent
[1]
http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-predefined-ent