Hello everyone,
I'm building an input sanitizer around Charles Reitzel's Tidy.NET
bindings; I'm using it to take potentially-malformed XHTML +
proprietary-namespaced tag soup and produce some kind of valid XML from
it. It's working OK so far but I'm a bit surprised by one of my
testcases - the output is valid XML but it's not what I was expecting it
to be.
From the default options, I explicitly turn on:
input-xml
output-xml
force-output
and give it the input string:
<html xmlns="
http://www.w3.org/1999/xhtml"><body><b>Hello,
<i<i>world!</b></body></html>
Note the '<i<i>' construct before 'world.' I was expecting the output:
<html xmlns="
http://www.w3.org/1999/xhtml"><body><b>Hello,
<i<i>world!</i></b></body></html>
whereby the first < in the <i<i> is encoded as an entity. Instead, what
I'm getting is:
<html xmlns="
http://www.w3.org/1999/xhtml"><body><b>Hello,
<i i="">world!</i></b></body></html>
the <i<i> is becoming <i i="">. How come?
(I have a suspicion that the TidyATL/Tidy.NET packages on Charles' page
are a bit out of date - they're certainly missing some of the more
recent options - so if this is a bug that's already been fixed or
something, I apologise for wasting your time...)
Thanks in advance,
- Richard