Problem using international characters

View: New views
2 Messages — Rating Filter:   Alert me  

Problem using international characters

by NickRob :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have a mysql database and use a php application that captures, stores,
retrieves and displays data correctly - including French language words
with accents. It has been running for around five years. I've recently
written an extension that creates an openoffice writer document using
this data. Everything works apart from the these wretched French
characters!!! If I unzip the odt package and examine content.xml, then
the characters are wrong - but simply cutting and pasting correct ones
in gives me a working document, so the error is definitely in the way I
am creating the content using php.

An example of the problem is Côte. As I've just typed it, the o has a
circumflex accent or 'hat' on it. Within the odt file, the o-circumflex
is shown as ô. Piping this to od -c gives 303 203 302 264. If I take
the o-circumflex character from gnome charmap and od -c this, then I get
303 264. If I copy the character from my php/web app then it is correct.
Where are these two middle bytes coming from? I've tried various
combinations of mbstring functions and ini file settings but without
joy.

Thanks for any help you can give me.


--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: Problem using international characters

by Nisse Engström :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 29 Jun 2009 06:52:23 +0100, Nicholas Robinson wrote:

> I have a mysql database and use a php application that captures, stores,
> retrieves and displays data correctly - including French language words
> with accents. It has been running for around five years. I've recently
> written an extension that creates an openoffice writer document using
> this data. Everything works apart from the these wretched French
> characters!!! If I unzip the odt package and examine content.xml, then
> the characters are wrong - but simply cutting and pasting correct ones
> in gives me a working document, so the error is definitely in the way I
> am creating the content using php.
>
> An example of the problem is Côte. As I've just typed it, the o has a
> circumflex accent or 'hat' on it. Within the odt file, the o-circumflex
> is shown as ô. Piping this to od -c gives 303 203 302 264. If I take
> the o-circumflex character from gnome charmap and od -c this, then I get
> 303 264. If I copy the character from my php/web app then it is correct.
> Where are these two middle bytes coming from? I've tried various
> combinations of mbstring functions and ini file settings but without
> joy.

Hexadecimal is easier on my eyes, so:

  303 203 302 264  ==  c3 83 c2 b4
  303 264          ==  c3 b4

These are UTF-8 encodings:

  <c3 83><c2 b4>  == U+00C3 (LATIN CAPITAL LETTER A WITH TILDE),
                     U+00B4 (ACUTE ACCENT)
  <c3 b4>         == U+00F4 (LATIN SMALL LETTER O WITH CIRCUMFLEX)


In other words, somewhere in the process, a perfectly fine
UTF-8 encoded character:

  <c3 b4> (U+00F4)

has been (incorrectly) converted from ISO-8859-1 (or similar)
to UTF-8, resulting in:

  <c3 83><c2 b4> (U+00C3, U+00B4)


Perhaps this gives you some idea of what's going wrong.


/Nisse

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php