|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
entities: text or rawText?Vincent, I'm trying to understand some of the issues we have with entities in the XmlParser. Is there a special reason why entities are emitted as rawText and not text? I think they should be emitted as text: First, custom entities can be used to simply define some replacement text inside documents (eg <!ENTITY version "1.0">). Second, the resulting events should be consumable by all sinks, not just x(ht)ml based ones. Consider for instance the text "&Æ" (where AElig is defined as <!ENTITY AElig "Æ">). Currently it is emitted by the XhtmlBaseParser as one text event "&" and one rawText event "Æ". This means that eg the Latex Sink will produce wrong output (the AElig should be converted to "\AE" in latex). IMO the resolved entity should be emitted in a format-independent way, eg as one (unicode?) character, just like & is emitted as one character above. The consuming sink then has to transform that into a format-specific representation. WDYT? -Lukas |
|
|
Re: entities: text or rawText?Hi Lukas,
2009/5/4 Lukas Theussl <ltheussl@...>: > > Vincent, > > I'm trying to understand some of the issues we have with entities in the > XmlParser. Is there a special reason why entities are emitted as rawText and > not text? The text used by XhtmlBaseParser#handleEntity() could contain predefined entities [1] and numeric code entities (ie Æ will become Æ by XmlPullParser) XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not. So using rawText() is to be sure to not escape text with entities. > I think they should be emitted as text: > > First, custom entities can be used to simply define some replacement text > inside documents (eg <!ENTITY version "1.0">). > > Second, the resulting events should be consumable by all sinks, not just > x(ht)ml based ones. Consider for instance the text "&Æ" (where > AElig is defined as <!ENTITY AElig "Æ">). Currently it is emitted by > the XhtmlBaseParser as one text event "&" and one rawText event "Æ". > This means that eg the Latex Sink will produce wrong output (the AElig > should be converted to "\AE" in latex). > > IMO the resolved entity should be emitted in a format-independent way, eg as > one (unicode?) character, just like & is emitted as one character above. > The consuming sink then has to transform that into a format-specific > representation. It could be another implementation. XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text() Cheers, Vincent [1] http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-predefined-ent |
|
|
Re: entities: text or rawText?For reference: the XhtmlBaseParser in Doxia 1.1.1 emits entities as text, except if they are not recognized (ie haven't been declared), then they are emitted as unknown events. -Lukas Vincent Siveton wrote: > Hi Lukas, > > 2009/5/4 Lukas Theussl <ltheussl@...>: >> Vincent, >> >> I'm trying to understand some of the issues we have with entities in the >> XmlParser. Is there a special reason why entities are emitted as rawText and >> not text? > > The text used by XhtmlBaseParser#handleEntity() could contain > predefined entities [1] and numeric code entities (ie Æ will > become Æ by XmlPullParser) > XhtmlBaseSink#text() escapes chars and XhtmlBaseSink#rawText() not. > > So using rawText() is to be sure to not escape text with entities. > >> I think they should be emitted as text: >> >> First, custom entities can be used to simply define some replacement text >> inside documents (eg <!ENTITY version "1.0">). >> >> Second, the resulting events should be consumable by all sinks, not just >> x(ht)ml based ones. Consider for instance the text "&Æ" (where >> AElig is defined as <!ENTITY AElig "Æ">). Currently it is emitted by >> the XhtmlBaseParser as one text event "&" and one rawText event "Æ". >> This means that eg the Latex Sink will produce wrong output (the AElig >> should be converted to "\AE" in latex). >> >> IMO the resolved entity should be emitted in a format-independent way, eg as >> one (unicode?) character, just like & is emitted as one character above. >> The consuming sink then has to transform that into a format-specific >> representation. > > It could be another implementation. > XhtmlBaseParser#handleEntity() could unescape xml and call only sink.text() > > Cheers, > > Vincent > > [1] http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-predefined-ent > |
| Free embeddable forum powered by Nabble | Forum Help |