XML, control characters and MHonArc

View: New views
3 Messages — Rating Filter:   Alert me  

XML, control characters and MHonArc

by Chris Hastie :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've recently been looking at revamping an archive and having MHonArc
output XML which is then pulled into a PHP based application using
XML_Unserialize.

Mostly this is working fine, but I have the occasional problem with
control characters in badly formatted emails. Specifically, a QP email
with the string =12 - MHonArc outputs the associated control character
to the XML. These characters are not valid in XML and the XML parser
chokes on them.

I see a quick mention of a similar problem back in 2000:
http://www.mhonarc.org/archive/html/mhonarc-users/2000-07/msg00040.html

Have things changed? Is there any way short of writing a custom filter,
or hacking/patching an existing one, that I can persuade MHonArc to
strip out XML illegal control characters?

If not, any hints on where to start hacking?

Thanks

--
Chris Hastie


Re: XML, control characters and MHonArc

by Earl Hood :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On October 5, 2007 at 08:45, Chris Hastie wrote:

> Mostly this is working fine, but I have the occasional problem with
> control characters in badly formatted emails. Specifically, a QP email
> with the string =12 - MHonArc outputs the associated control character
> to the XML. These characters are not valid in XML and the XML parser
> chokes on them.

Have you tried out the TEXTENCODE resource to see how the
control characters are handled?  If generating XML, you may
want to use TEXTENCODE to normalize all character data to UTF-8.
See manual for examples.

> I see a quick mention of a similar problem back in 2000:
> http://www.mhonarc.org/archive/html/mhonarc-users/2000-07/msg00040.html
>
> Have things changed? Is there any way short of writing a custom filter,
> or hacking/patching an existing one, that I can persuade MHonArc to
> strip out XML illegal control characters?

Check the minimal API documented in an appendix of the manual.  There
is a callback you can register after a message has been converted.
Your callback can check for invalid characters and remove them.

--ewh

P.S. Please post you resource settings for creating XML.  Others
may be interested and it may be something to include in the docs.


Re: XML, control characters and MHonArc

by Chris Hastie :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 05 Oct 2007, Earl Hood <earl@...> wrote:

> On October 5, 2007 at 08:45, Chris Hastie wrote:
>
>> Mostly this is working fine, but I have the occasional problem with
>> control characters in badly formatted emails. Specifically, a QP email
>> with the string =12 - MHonArc outputs the associated control character
>> to the XML. These characters are not valid in XML and the XML parser
>> chokes on them.
>
> Have you tried out the TEXTENCODE resource to see how the
> control characters are handled?  If generating XML, you may
> want to use TEXTENCODE to normalize all character data to UTF-8.
> See manual for examples.

I did experiment with TEXTENCODE. It produced some surprising results,
but I may
have been getting the wrong end of the stick.

I started taking everything to UTF-8, and then through
mhonarc::htmlize. My list
is UK based so '£' occurs quite often. This looked fine in the XML
output(viewing with Notepad++), but the final output failed to display it
correctly. I presumed that some issue with PHP reading UTF-8 was to blame. It
was noticable, however, that Notepad++ reported the file as being encoded as
ANSI.

I then tried taking everything to UTF-8 with TEXTENCODE and passing it through
MHonArc::CharEnt::str2sgml. The result was my '£' got encoded as something
very odd, � IIRC.

I'm sure outputting UTF-8 is the 'correct' way to go, it just seems to
cause me
some headaches with later processing.

>> I see a quick mention of a similar problem back in 2000:
>> http://www.mhonarc.org/archive/html/mhonarc-users/2000-07/msg00040.html
>>
>> Have things changed? Is there any way short of writing a custom filter,
>> or hacking/patching an existing one, that I can persuade MHonArc to
>> strip out XML illegal control characters?
>
> Check the minimal API documented in an appendix of the manual.  There
> is a callback you can register after a message has been converted.
> Your callback can check for invalid characters and remove them.
>

Thanks, I'll take a look at that. At the moment I'm stripping non-legal XML
characters in my PHP script before passing the XML to the parser.

> P.S. Please post you resource settings for creating XML.  Others
> may be interested and it may be something to include in the docs.

Will try to tidy them up enough to be useful to someone else in the
next day or
two. Where should they be posted - to this list? As attachments?

Thanks
--
Chris Hastie