Invalid UTF-8 for Arabic

View: New views
13 Messages — Rating Filter:   Alert me  

Invalid UTF-8 for Arabic

by Yoann Roman-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm using the trunk fribidi2 code, compiled with VS2003 on Windows XP,
from Python with the Pyfribidi extension, also compiled on VS 2003.

When giving log2vis this Arabic Unicode string:

u'\u0647\u0644 \u062a\u062a\u0646\u0627\u0648\u0644 \u0627\u0644\u0641
\u0637\u0648\u0631 \u0643\u0644 \u064a\u0648\u0645\u061f \u0644\u0645
\u0627\u0630\u0627 \u0623\u0648 \u0644\u0645\u0627\u0630\u0627 \u0644
\u0627\u061f'

I get back an invalid UTF-8 code point, as shown below:

'\xd8\x9f\xef\xbb\xbb\xef\xbb\xbf \xef\xba\x8d\xef\xba\xab\xef\xba\x8e
\xef\xbb\xa4\xef\xbb\x9f \xef\xbb\xad\xef\xba\x83 \xef\xba\x8d\xef\xba
\xab\xef\xba\x8e\xef\xbb\xa4\xef\xbb\x9f \xd8\x9f\xef\xbb\xa1\xef\xbb
\xae\xef\xbb\xb3 \xef\xbb\x9e\xef\xbb\x9b \xef\xba\xad\xef\xbb\xae\xef
\xbb\x84\xef\xbb'

The last \xef\xbb should have one more byte.

The Pyfribidi extension works fine with the 0.10.9 code (minus the
Arabic joining), and its only update for the 0.19.1 code is to use
FriBidiParType instead of FriBidiCharType.

Any ideas?

--
Yoann Roman

_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Behdad Esfahbod-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yoann Roman wrote:
> I'm using the trunk fribidi2 code, compiled with VS2003 on Windows XP,
> from Python with the Pyfribidi extension, also compiled on VS 2003.

The first step to debug this is to make sure PyFriBidi is not the culprit.
That is, can you reproduce the bug using C?  If yes, please send the code here.

Cheers,
behdad

> When giving log2vis this Arabic Unicode string:
>
> u'\u0647\u0644 \u062a\u062a\u0646\u0627\u0648\u0644 \u0627\u0644\u0641
> \u0637\u0648\u0631 \u0643\u0644 \u064a\u0648\u0645\u061f \u0644\u0645
> \u0627\u0630\u0627 \u0623\u0648 \u0644\u0645\u0627\u0630\u0627 \u0644
> \u0627\u061f'
>
> I get back an invalid UTF-8 code point, as shown below:
>
> '\xd8\x9f\xef\xbb\xbb\xef\xbb\xbf \xef\xba\x8d\xef\xba\xab\xef\xba\x8e
> \xef\xbb\xa4\xef\xbb\x9f \xef\xbb\xad\xef\xba\x83 \xef\xba\x8d\xef\xba
> \xab\xef\xba\x8e\xef\xbb\xa4\xef\xbb\x9f \xd8\x9f\xef\xbb\xa1\xef\xbb
> \xae\xef\xbb\xb3 \xef\xbb\x9e\xef\xbb\x9b \xef\xba\xad\xef\xbb\xae\xef
> \xbb\x84\xef\xbb'
>
> The last \xef\xbb should have one more byte.
>
> The Pyfribidi extension works fine with the 0.10.9 code (minus the
> Arabic joining), and its only update for the 0.19.1 code is to use
> FriBidiParType instead of FriBidiCharType.
>
> Any ideas?
>
_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Yoann Roman-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Behdad Esfahbod wrote:
>> I'm using the trunk fribidi2 code, compiled with VS2003 on Windows
>> XP, from Python with the Pyfribidi extension, also compiled on VS
>> 2003.
>
> The first step to debug this is to make sure PyFriBidi is not the
> culprit. That is, can you reproduce the bug using C?  If yes, please
> send the code here.

I'm no C expert, so I took a slightly different approach to pull
Pyfribidi out of the equation. I compiled fribidi.exe and used a Hex
editor to check its output. Looks like the lost final byte may be a
Pyfribidi problem. This test did bring up another bug, though.

Attached is a zip with:

  - arabic.input: the Arabic string straight out of Python. This will
    show up correctly in anything with bidi support (e.g., Notepad on
    Windows XP with Arabic support installed). There is no BOM.
 
  - arabic.output: output from running bin\fribidi.exe --nopad
    arabic.input. No Python involved here.
 
  - arabic-correct.png: a correct Word visual representation

  - arabic-incorrect.png: what I get using arabic.output

If you open arabic.output in a Hex editor, you'll see that bytes 5
through 7 contain the UTF-8 BOM sequence. It looks like no characters
are missing, though.

Is this enough info to track this new issue down?

Thanks,

--
Yoann Roman


_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Arabic.zip (14K) Download Attachment

Re: Invalid UTF-8 for Arabic

by Behnam ZWNJ Esfahbod :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear Yoann,

I haven't used fribidi on windows, but I just run the fribid
executable in linux on your input.  The result differs a lot:

$ hexdump -C arabic.output-unix
00000000  d8 9f d8 a7 d9 84 20 d8  a7 d8 b0 d8 a7 d9 85 d9  |...... .........|
00000010  84 20 d9 88 d8 a3 20 d8  a7 d8 b0 d8 a7 d9 85 d9  |. .... .........|
00000020  84 20 d8 9f d9 85 d9 88  d9 8a 20 d9 84 d9 83 20  |. ........ .... |
00000030  d8 b1 d9 88 d8 b7 d9 81  d9 84 d8 a7 20 d9 84 d9  |............ ...|
00000040  88 d8 a7 d9 86 d8 aa d8  aa 20 d9 84 d9 87        |......... ....|
0000004e

compared to your output:

$ hexdump -C arabic.output
00000000  d8 9f ef bb bb ef bb bf  20 ef ba 8d ef ba ab ef  |........ .......|
00000010  ba 8e ef bb a4 ef bb 9f  20 ef bb ad ef ba 83 20  |........ ...... |
00000020  ef ba 8d ef ba ab ef ba  8e ef bb a4 ef bb 9f 20  |............... |
00000030  d8 9f ef bb a1 ef bb ae  ef bb b3 20 ef bb 9e ef  |........... ....|
00000040  bb 9b 20 ef ba ad ef bb  ae ef bb 84 ef bb 94 ef  |.. .............|
00000050  bb 9f ef ba 8d 20 ef bb  9d ef bb ad ef ba 8e ef  |..... ..........|
00000060  bb a8 ef ba 98 ef ba 97  20 ef bb 9e ef bb ab     |........ ......|
0000006f

I'm almost sure the unix output has no wrong utf-8 sequence, but the
windows output seems so wrong.

Just for the reference, the input for both outputs was this:

$ hexdump -C arabic.input
00000000  d9 87 d9 84 20 d8 aa d8  aa d9 86 d8 a7 d9 88 d9  |.... ...........|
00000010  84 20 d8 a7 d9 84 d9 81  d8 b7 d9 88 d8 b1 20 d9  |. ............ .|
00000020  83 d9 84 20 d9 8a d9 88  d9 85 d8 9f 20 d9 84 d9  |... ........ ...|
00000030  85 d8 a7 d8 b0 d8 a7 20  d8 a3 d9 88 20 d9 84 d9  |....... .... ...|
00000040  85 d8 a7 d8 b0 d8 a7 20  d9 84 d8 a7 d8 9f        |....... ......|
0000004e

Hope it helps.

-Behnam ZWNJ


On Sat, Mar 7, 2009 at 2:25 AM, Yoann Roman <yroman@...> wrote:

> Behdad Esfahbod wrote:
>>> I'm using the trunk fribidi2 code, compiled with VS2003 on Windows
>>> XP, from Python with the Pyfribidi extension, also compiled on VS
>>> 2003.
>>
>> The first step to debug this is to make sure PyFriBidi is not the
>> culprit. That is, can you reproduce the bug using C?  If yes, please
>> send the code here.
>
> I'm no C expert, so I took a slightly different approach to pull
> Pyfribidi out of the equation. I compiled fribidi.exe and used a Hex
> editor to check its output. Looks like the lost final byte may be a
> Pyfribidi problem. This test did bring up another bug, though.
>
> Attached is a zip with:
>
>  - arabic.input: the Arabic string straight out of Python. This will
>    show up correctly in anything with bidi support (e.g., Notepad on
>    Windows XP with Arabic support installed). There is no BOM.
>
>  - arabic.output: output from running bin\fribidi.exe --nopad
>    arabic.input. No Python involved here.
>
>  - arabic-correct.png: a correct Word visual representation
>
>  - arabic-incorrect.png: what I get using arabic.output
>
> If you open arabic.output in a Hex editor, you'll see that bytes 5
> through 7 contain the UTF-8 BOM sequence. It looks like no characters
> are missing, though.
>
> Is this enough info to track this new issue down?
>
> Thanks,
>
> --
> Yoann Roman
>
> _______________________________________________
> fribidi mailing list
> fribidi@...
> http://lists.freedesktop.org/mailman/listinfo/fribidi
>
>



--
    '     بهنام اسفهبد
    '     Behnam Esfahbod
   '
  *  ..   http://behnam.esfahbod.info
 *  `  *
  * o *   http://zwnj.org
_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Yoann Roman-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I haven't used fribidi on windows, but I just run the fribid
> executable in linux on your input.  The result differs a lot:
>
> $ hexdump -C arabic.output-unix
> 00000000  d8 9f d8 a7 d9 84 20 d8  a7 d8 b0 d8 a7 d9 85 d9
> 00000010  84 20 d9 88 d8 a3 20 d8  a7 d8 b0 d8 a7 d9 85 d9
> 00000020  84 20 d8 9f d9 85 d9 88  d9 8a 20 d9 84 d9 83 20
> 00000030  d8 b1 d9 88 d8 b7 d9 81  d9 84 d8 a7 20 d9 84 d9
> 00000040  88 d8 a7 d9 86 d8 aa d8  aa 20 d9 84 d9 87
> 0000004e
>
> compared to your output:
>
> $ hexdump -C arabic.output
> 00000000  d8 9f ef bb bb ef bb bf  20 ef ba 8d ef ba ab ef
> 00000010  ba 8e ef bb a4 ef bb 9f  20 ef bb ad ef ba 83 20
> 00000020  ef ba 8d ef ba ab ef ba  8e ef bb a4 ef bb 9f 20
> 00000030  d8 9f ef bb a1 ef bb ae  ef bb b3 20 ef bb 9e ef
> 00000040  bb 9b 20 ef ba ad ef bb  ae ef bb 84 ef bb 94 ef
> 00000050  bb 9f ef ba 8d 20 ef bb  9d ef bb ad ef ba 8e ef
> 00000060  bb a8 ef ba 98 ef ba 97  20 ef bb 9e ef bb ab
> 0000006f
>
> I'm almost sure the unix output has no wrong utf-8 sequence, but the
> windows output seems so wrong.
>
> [snip]

Behdad,

Thanks for the response.

Your output is what I get from fribidi 0.10.9, but that doesn't do
Arabic joining. Other than that BOM mark, the Windows output from
0.19.1 is correct and matches what other non-Fribidi-based, bidi
programs do.

--
Yoann Roman

_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Shachar Shemesh-12 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yoann Roman wrote:

>
> Behdad,
>
> Thanks for the response.
>
> Your output is what I get from fribidi 0.10.9, but that doesn't do
> Arabic joining. Other than that BOM mark, the Windows output from
> 0.19.1 is correct and matches what other non-Fribidi-based, bidi
> programs do.
>
>  
I should just point out, as an aside, that the Windows joining is font
aware. If a certain letter needs to have a left/right join, but the font
does not have the right glyph, then Windows will use (don't remember the
details) the right only or left only joined glyph.

Not really relevant to the discussion, but I should point out that
logical only joining is not full proof.

Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Behnam ZWNJ Esfahbod :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Yoann,

> Behdad, Thanks for the response.

No problam at all, but please note that I'm behNaM, behDaD's brother.


> Your output is what I get from fribidi 0.10.9, but that doesn't do
> Arabic joining. Other than that BOM mark, the Windows output from
> 0.19.1 is correct and matches what other non-Fribidi-based, bidi
> programs do.

I didn't notice you have get the shaping on the string too.  So the
windows output is correct, except the BOM mark.

Seems the BOM mark appears in the place of the first character of the
LAM+ALEF ligature.  This might be a but in the ligature replacement of
the shaping function.

Yep, it's in the CVS HEAD, lib/fribidi-arabic.c
fribidi_shape_arabic_ligature(), which replaces the first char of
ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM.

Behdad, why don't you use U+FFFF for this purpose?  and why
fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't
clean this CHAR_FILLs?

Behdad, I can fix the problem in CVS if you tell me what's the best
way to fix this.


-Behnam


--
    '     بهنام اسفهبد
    '     Behnam Esfahbod
   '
  *  ..   http://behnam.esfahbod.info
 *  `  *
  * o *   http://zwnj.org
_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Yoann Roman-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>> Behdad, Thanks for the response.
>
> No problam at all, but please note that I'm behNaM, behDaD's brother.

Sorry for the name mishap. I was looking at Behdad's message while
replying to yours.

>> Your output is what I get from fribidi 0.10.9, but that doesn't do
>> Arabic joining. Other than that BOM mark, the Windows output from
>> 0.19.1 is correct and matches what other non-Fribidi-based, bidi
>> programs do.
>
> I didn't notice you have get the shaping on the string too.  So the
> windows output is correct, except the BOM mark.
>
> Seems the BOM mark appears in the place of the first character of the
> LAM+ALEF ligature.  This might be a but in the ligature replacement of
> the shaping function.
>
> Yep, it's in the CVS HEAD, lib/fribidi-arabic.c
> fribidi_shape_arabic_ligature(), which replaces the first char of
> ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM.
>
> Behdad, why don't you use U+FFFF for this purpose?  and why
> fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't
> clean this CHAR_FILLs?
>
> Behdad, I can fix the problem in CVS if you tell me what's the best
> way to fix this.

Glad to know it's not a problem just on my end. Thanks for the research.

--
Yoann Roman

_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Yoann Roman-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>> Thanks for the response.
>>
>> Your output is what I get from fribidi 0.10.9, but that doesn't do
>> Arabic joining. Other than that BOM mark, the Windows output from
>> 0.19.1 is correct and matches what other non-Fribidi-based, bidi
>> programs do.
>>
>>  
> I should just point out, as an aside, that the Windows joining is font
> aware. If a certain letter needs to have a left/right join, but the
> font does not have the right glyph, then Windows will use (don't
> remember the details) the right only or left only joined glyph.
>
> Not really relevant to the discussion, but I should point out that
> logical only joining is not full proof.

Thanks for the heads up. In this case, I control the font being used as
well, so I'll just make sure it has all the joining glyphs.

Out of curiosity, how is this handled on other platforms?

--
Yoann Roman

_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Behdad Esfahbod-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 03/10/2009 06:08 AM, Behnam Esfahbod ZWNJ wrote:

> Seems the BOM mark appears in the place of the first character of the
> LAM+ALEF ligature.  This might be a but in the ligature replacement of
> the shaping function.
>
> Yep, it's in the CVS HEAD, lib/fribidi-arabic.c
> fribidi_shape_arabic_ligature(), which replaces the first char of
> ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM.

That's expected.  It will be removed by fribidi_remove_bidi_marks().

> Behdad, why don't you use U+FFFF for this purpose?

Because U+FFFF is not a valid character.  U+FEFF is harmless at that position.
  And has a BN bidi category and is one of the characters that any rendering
pipeline removes anyway.

> and why
> fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't
> clean this CHAR_FILLs?

The shaping functions do not remove any characters.  If they did it would be
hard to keep the mapping between visual and logical strings.  The idea is that
we add filler chars when needing to remove any characters, and
fribidi_remove_bidi_marks() or similar functions will remove the fillers later.

> Behdad, I can fix the problem in CVS if you tell me what's the best
> way to fix this.

No fix needed AFAI'm concerned :).

behdad

> -Behnam
>
>
_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Yoann Roman-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>> Seems the BOM mark appears in the place of the first character of the
>> LAM+ALEF ligature.  This might be a but in the ligature replacement
>> of the shaping function.
>>
>> Yep, it's in the CVS HEAD, lib/fribidi-arabic.c
>> fribidi_shape_arabic_ligature(), which replaces the first char of
>> ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM.
>
> That's expected.  It will be removed by fribidi_remove_bidi_marks().
>
>> Behdad, why don't you use U+FFFF for this purpose?
>
> Because U+FFFF is not a valid character.  U+FEFF is harmless at that
> position. And has a BN bidi category and is one of the characters
> that any rendering pipeline removes anyway.
>
>> and why
>> fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't
>> clean this CHAR_FILLs?
>
> The shaping functions do not remove any characters.  If they did it
> would be hard to keep the mapping between visual and logical strings.
> The idea is that we add filler chars when needing to remove any
> characters, and fribidi_remove_bidi_marks() or similar functions will
> remove the fillers later.
>
>> Behdad, I can fix the problem in CVS if you tell me what's the best
>> way to fix this.
>
> No fix needed AFAI'm concerned :).

So, practically, I need to call fribidi_remove_bidi_marks after calling
log2vis? I see that's done by fribidi.exe if I pass in --clean.

Thanks,

--
Yoann Roman

_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Behdad Esfahbod-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 03/11/2009 10:41 AM, Yoann Roman wrote:

>>> Seems the BOM mark appears in the place of the first character of the
>>> LAM+ALEF ligature.  This might be a but in the ligature replacement
>>> of the shaping function.
>>>
>>> Yep, it's in the CVS HEAD, lib/fribidi-arabic.c
>>> fribidi_shape_arabic_ligature(), which replaces the first char of
>>> ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM.
>> That's expected.  It will be removed by fribidi_remove_bidi_marks().
>>
>>> Behdad, why don't you use U+FFFF for this purpose?
>> Because U+FFFF is not a valid character.  U+FEFF is harmless at that
>> position. And has a BN bidi category and is one of the characters
>> that any rendering pipeline removes anyway.
>>
>>> and why
>>> fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't
>>> clean this CHAR_FILLs?
>> The shaping functions do not remove any characters.  If they did it
>> would be hard to keep the mapping between visual and logical strings.
>> The idea is that we add filler chars when needing to remove any
>> characters, and fribidi_remove_bidi_marks() or similar functions will
>> remove the fillers later.
>>
>>> Behdad, I can fix the problem in CVS if you tell me what's the best
>>> way to fix this.
>> No fix needed AFAI'm concerned :).
>
> So, practically, I need to call fribidi_remove_bidi_marks after calling
> log2vis? I see that's done by fribidi.exe if I pass in --clean.

Yes.

behdad

> Thanks,
>
_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi

Re: Invalid UTF-8 for Arabic

by Behnam ZWNJ Esfahbod :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Got it.  Thanks for the explanation. :)


On Wed, Mar 11, 2009 at 5:31 PM, Behdad Esfahbod <behdad@...> wrote:

> On 03/10/2009 06:08 AM, Behnam Esfahbod ZWNJ wrote:
>
>> Seems the BOM mark appears in the place of the first character of the
>> LAM+ALEF ligature.  This might be a but in the ligature replacement of
>> the shaping function.
>>
>> Yep, it's in the CVS HEAD, lib/fribidi-arabic.c
>> fribidi_shape_arabic_ligature(), which replaces the first char of
>> ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM.
>
> That's expected.  It will be removed by fribidi_remove_bidi_marks().
>
>> Behdad, why don't you use U+FFFF for this purpose?
>
> Because U+FFFF is not a valid character.  U+FEFF is harmless at that
> position.  And has a BN bidi category and is one of the characters that any
> rendering pipeline removes anyway.
>
>> and why
>> fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't
>> clean this CHAR_FILLs?
>
> The shaping functions do not remove any characters.  If they did it would be
> hard to keep the mapping between visual and logical strings.  The idea is
> that we add filler chars when needing to remove any characters, and
> fribidi_remove_bidi_marks() or similar functions will remove the fillers
> later.
>
>> Behdad, I can fix the problem in CVS if you tell me what's the best
>> way to fix this.
>
> No fix needed AFAI'm concerned :).
>
> behdad
>
>> -Behnam
>>
>>
>



--
    '     بهنام اسفهبد
    '     Behnam Esfahbod
   '
  *  ..   http://behnam.esfahbod.info
 *  `  *
  * o *   http://zwnj.org
_______________________________________________
fribidi mailing list
fribidi@...
http://lists.freedesktop.org/mailman/listinfo/fribidi