|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
Invalid UTF-8 for ArabicI'm using the trunk fribidi2 code, compiled with VS2003 on Windows XP,
from Python with the Pyfribidi extension, also compiled on VS 2003. When giving log2vis this Arabic Unicode string: u'\u0647\u0644 \u062a\u062a\u0646\u0627\u0648\u0644 \u0627\u0644\u0641 \u0637\u0648\u0631 \u0643\u0644 \u064a\u0648\u0645\u061f \u0644\u0645 \u0627\u0630\u0627 \u0623\u0648 \u0644\u0645\u0627\u0630\u0627 \u0644 \u0627\u061f' I get back an invalid UTF-8 code point, as shown below: '\xd8\x9f\xef\xbb\xbb\xef\xbb\xbf \xef\xba\x8d\xef\xba\xab\xef\xba\x8e \xef\xbb\xa4\xef\xbb\x9f \xef\xbb\xad\xef\xba\x83 \xef\xba\x8d\xef\xba \xab\xef\xba\x8e\xef\xbb\xa4\xef\xbb\x9f \xd8\x9f\xef\xbb\xa1\xef\xbb \xae\xef\xbb\xb3 \xef\xbb\x9e\xef\xbb\x9b \xef\xba\xad\xef\xbb\xae\xef \xbb\x84\xef\xbb' The last \xef\xbb should have one more byte. The Pyfribidi extension works fine with the 0.10.9 code (minus the Arabic joining), and its only update for the 0.19.1 code is to use FriBidiParType instead of FriBidiCharType. Any ideas? -- Yoann Roman _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for ArabicYoann Roman wrote:
> I'm using the trunk fribidi2 code, compiled with VS2003 on Windows XP, > from Python with the Pyfribidi extension, also compiled on VS 2003. The first step to debug this is to make sure PyFriBidi is not the culprit. That is, can you reproduce the bug using C? If yes, please send the code here. Cheers, behdad > When giving log2vis this Arabic Unicode string: > > u'\u0647\u0644 \u062a\u062a\u0646\u0627\u0648\u0644 \u0627\u0644\u0641 > \u0637\u0648\u0631 \u0643\u0644 \u064a\u0648\u0645\u061f \u0644\u0645 > \u0627\u0630\u0627 \u0623\u0648 \u0644\u0645\u0627\u0630\u0627 \u0644 > \u0627\u061f' > > I get back an invalid UTF-8 code point, as shown below: > > '\xd8\x9f\xef\xbb\xbb\xef\xbb\xbf \xef\xba\x8d\xef\xba\xab\xef\xba\x8e > \xef\xbb\xa4\xef\xbb\x9f \xef\xbb\xad\xef\xba\x83 \xef\xba\x8d\xef\xba > \xab\xef\xba\x8e\xef\xbb\xa4\xef\xbb\x9f \xd8\x9f\xef\xbb\xa1\xef\xbb > \xae\xef\xbb\xb3 \xef\xbb\x9e\xef\xbb\x9b \xef\xba\xad\xef\xbb\xae\xef > \xbb\x84\xef\xbb' > > The last \xef\xbb should have one more byte. > > The Pyfribidi extension works fine with the 0.10.9 code (minus the > Arabic joining), and its only update for the 0.19.1 code is to use > FriBidiParType instead of FriBidiCharType. > > Any ideas? > fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for ArabicBehdad Esfahbod wrote:
>> I'm using the trunk fribidi2 code, compiled with VS2003 on Windows >> XP, from Python with the Pyfribidi extension, also compiled on VS >> 2003. > > The first step to debug this is to make sure PyFriBidi is not the > culprit. That is, can you reproduce the bug using C? If yes, please > send the code here. I'm no C expert, so I took a slightly different approach to pull Pyfribidi out of the equation. I compiled fribidi.exe and used a Hex editor to check its output. Looks like the lost final byte may be a Pyfribidi problem. This test did bring up another bug, though. Attached is a zip with: - arabic.input: the Arabic string straight out of Python. This will show up correctly in anything with bidi support (e.g., Notepad on Windows XP with Arabic support installed). There is no BOM. - arabic.output: output from running bin\fribidi.exe --nopad arabic.input. No Python involved here. - arabic-correct.png: a correct Word visual representation - arabic-incorrect.png: what I get using arabic.output If you open arabic.output in a Hex editor, you'll see that bytes 5 through 7 contain the UTF-8 BOM sequence. It looks like no characters are missing, though. Is this enough info to track this new issue down? Thanks, -- Yoann Roman _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for ArabicDear Yoann,
I haven't used fribidi on windows, but I just run the fribid executable in linux on your input. The result differs a lot: $ hexdump -C arabic.output-unix 00000000 d8 9f d8 a7 d9 84 20 d8 a7 d8 b0 d8 a7 d9 85 d9 |...... .........| 00000010 84 20 d9 88 d8 a3 20 d8 a7 d8 b0 d8 a7 d9 85 d9 |. .... .........| 00000020 84 20 d8 9f d9 85 d9 88 d9 8a 20 d9 84 d9 83 20 |. ........ .... | 00000030 d8 b1 d9 88 d8 b7 d9 81 d9 84 d8 a7 20 d9 84 d9 |............ ...| 00000040 88 d8 a7 d9 86 d8 aa d8 aa 20 d9 84 d9 87 |......... ....| 0000004e compared to your output: $ hexdump -C arabic.output 00000000 d8 9f ef bb bb ef bb bf 20 ef ba 8d ef ba ab ef |........ .......| 00000010 ba 8e ef bb a4 ef bb 9f 20 ef bb ad ef ba 83 20 |........ ...... | 00000020 ef ba 8d ef ba ab ef ba 8e ef bb a4 ef bb 9f 20 |............... | 00000030 d8 9f ef bb a1 ef bb ae ef bb b3 20 ef bb 9e ef |........... ....| 00000040 bb 9b 20 ef ba ad ef bb ae ef bb 84 ef bb 94 ef |.. .............| 00000050 bb 9f ef ba 8d 20 ef bb 9d ef bb ad ef ba 8e ef |..... ..........| 00000060 bb a8 ef ba 98 ef ba 97 20 ef bb 9e ef bb ab |........ ......| 0000006f I'm almost sure the unix output has no wrong utf-8 sequence, but the windows output seems so wrong. Just for the reference, the input for both outputs was this: $ hexdump -C arabic.input 00000000 d9 87 d9 84 20 d8 aa d8 aa d9 86 d8 a7 d9 88 d9 |.... ...........| 00000010 84 20 d8 a7 d9 84 d9 81 d8 b7 d9 88 d8 b1 20 d9 |. ............ .| 00000020 83 d9 84 20 d9 8a d9 88 d9 85 d8 9f 20 d9 84 d9 |... ........ ...| 00000030 85 d8 a7 d8 b0 d8 a7 20 d8 a3 d9 88 20 d9 84 d9 |....... .... ...| 00000040 85 d8 a7 d8 b0 d8 a7 20 d9 84 d8 a7 d8 9f |....... ......| 0000004e Hope it helps. -Behnam ZWNJ On Sat, Mar 7, 2009 at 2:25 AM, Yoann Roman <yroman@...> wrote: > Behdad Esfahbod wrote: >>> I'm using the trunk fribidi2 code, compiled with VS2003 on Windows >>> XP, from Python with the Pyfribidi extension, also compiled on VS >>> 2003. >> >> The first step to debug this is to make sure PyFriBidi is not the >> culprit. That is, can you reproduce the bug using C? If yes, please >> send the code here. > > I'm no C expert, so I took a slightly different approach to pull > Pyfribidi out of the equation. I compiled fribidi.exe and used a Hex > editor to check its output. Looks like the lost final byte may be a > Pyfribidi problem. This test did bring up another bug, though. > > Attached is a zip with: > > - arabic.input: the Arabic string straight out of Python. This will > show up correctly in anything with bidi support (e.g., Notepad on > Windows XP with Arabic support installed). There is no BOM. > > - arabic.output: output from running bin\fribidi.exe --nopad > arabic.input. No Python involved here. > > - arabic-correct.png: a correct Word visual representation > > - arabic-incorrect.png: what I get using arabic.output > > If you open arabic.output in a Hex editor, you'll see that bytes 5 > through 7 contain the UTF-8 BOM sequence. It looks like no characters > are missing, though. > > Is this enough info to track this new issue down? > > Thanks, > > -- > Yoann Roman > > _______________________________________________ > fribidi mailing list > fribidi@... > http://lists.freedesktop.org/mailman/listinfo/fribidi > > -- ' بهنام اسفهبد ' Behnam Esfahbod ' * .. http://behnam.esfahbod.info * ` * * o * http://zwnj.org _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for Arabic> I haven't used fribidi on windows, but I just run the fribid
> executable in linux on your input. The result differs a lot: > > $ hexdump -C arabic.output-unix > 00000000 d8 9f d8 a7 d9 84 20 d8 a7 d8 b0 d8 a7 d9 85 d9 > 00000010 84 20 d9 88 d8 a3 20 d8 a7 d8 b0 d8 a7 d9 85 d9 > 00000020 84 20 d8 9f d9 85 d9 88 d9 8a 20 d9 84 d9 83 20 > 00000030 d8 b1 d9 88 d8 b7 d9 81 d9 84 d8 a7 20 d9 84 d9 > 00000040 88 d8 a7 d9 86 d8 aa d8 aa 20 d9 84 d9 87 > 0000004e > > compared to your output: > > $ hexdump -C arabic.output > 00000000 d8 9f ef bb bb ef bb bf 20 ef ba 8d ef ba ab ef > 00000010 ba 8e ef bb a4 ef bb 9f 20 ef bb ad ef ba 83 20 > 00000020 ef ba 8d ef ba ab ef ba 8e ef bb a4 ef bb 9f 20 > 00000030 d8 9f ef bb a1 ef bb ae ef bb b3 20 ef bb 9e ef > 00000040 bb 9b 20 ef ba ad ef bb ae ef bb 84 ef bb 94 ef > 00000050 bb 9f ef ba 8d 20 ef bb 9d ef bb ad ef ba 8e ef > 00000060 bb a8 ef ba 98 ef ba 97 20 ef bb 9e ef bb ab > 0000006f > > I'm almost sure the unix output has no wrong utf-8 sequence, but the > windows output seems so wrong. > > [snip] Behdad, Thanks for the response. Your output is what I get from fribidi 0.10.9, but that doesn't do Arabic joining. Other than that BOM mark, the Windows output from 0.19.1 is correct and matches what other non-Fribidi-based, bidi programs do. -- Yoann Roman _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for ArabicYoann Roman wrote:
> > Behdad, > > Thanks for the response. > > Your output is what I get from fribidi 0.10.9, but that doesn't do > Arabic joining. Other than that BOM mark, the Windows output from > 0.19.1 is correct and matches what other non-Fribidi-based, bidi > programs do. > > aware. If a certain letter needs to have a left/right join, but the font does not have the right glyph, then Windows will use (don't remember the details) the right only or left only joined glyph. Not really relevant to the discussion, but I should point out that logical only joining is not full proof. Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for ArabicHi Yoann,
> Behdad, Thanks for the response. No problam at all, but please note that I'm behNaM, behDaD's brother. > Your output is what I get from fribidi 0.10.9, but that doesn't do > Arabic joining. Other than that BOM mark, the Windows output from > 0.19.1 is correct and matches what other non-Fribidi-based, bidi > programs do. I didn't notice you have get the shaping on the string too. So the windows output is correct, except the BOM mark. Seems the BOM mark appears in the place of the first character of the LAM+ALEF ligature. This might be a but in the ligature replacement of the shaping function. Yep, it's in the CVS HEAD, lib/fribidi-arabic.c fribidi_shape_arabic_ligature(), which replaces the first char of ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM. Behdad, why don't you use U+FFFF for this purpose? and why fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't clean this CHAR_FILLs? Behdad, I can fix the problem in CVS if you tell me what's the best way to fix this. -Behnam -- ' بهنام اسفهبد ' Behnam Esfahbod ' * .. http://behnam.esfahbod.info * ` * * o * http://zwnj.org _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for Arabic>> Behdad, Thanks for the response.
> > No problam at all, but please note that I'm behNaM, behDaD's brother. Sorry for the name mishap. I was looking at Behdad's message while replying to yours. >> Your output is what I get from fribidi 0.10.9, but that doesn't do >> Arabic joining. Other than that BOM mark, the Windows output from >> 0.19.1 is correct and matches what other non-Fribidi-based, bidi >> programs do. > > I didn't notice you have get the shaping on the string too. So the > windows output is correct, except the BOM mark. > > Seems the BOM mark appears in the place of the first character of the > LAM+ALEF ligature. This might be a but in the ligature replacement of > the shaping function. > > Yep, it's in the CVS HEAD, lib/fribidi-arabic.c > fribidi_shape_arabic_ligature(), which replaces the first char of > ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM. > > Behdad, why don't you use U+FFFF for this purpose? and why > fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't > clean this CHAR_FILLs? > > Behdad, I can fix the problem in CVS if you tell me what's the best > way to fix this. Glad to know it's not a problem just on my end. Thanks for the research. -- Yoann Roman _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for Arabic>> Thanks for the response.
>> >> Your output is what I get from fribidi 0.10.9, but that doesn't do >> Arabic joining. Other than that BOM mark, the Windows output from >> 0.19.1 is correct and matches what other non-Fribidi-based, bidi >> programs do. >> >> > I should just point out, as an aside, that the Windows joining is font > aware. If a certain letter needs to have a left/right join, but the > font does not have the right glyph, then Windows will use (don't > remember the details) the right only or left only joined glyph. > > Not really relevant to the discussion, but I should point out that > logical only joining is not full proof. Thanks for the heads up. In this case, I control the font being used as well, so I'll just make sure it has all the joining glyphs. Out of curiosity, how is this handled on other platforms? -- Yoann Roman _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for ArabicOn 03/10/2009 06:08 AM, Behnam Esfahbod ZWNJ wrote:
> Seems the BOM mark appears in the place of the first character of the > LAM+ALEF ligature. This might be a but in the ligature replacement of > the shaping function. > > Yep, it's in the CVS HEAD, lib/fribidi-arabic.c > fribidi_shape_arabic_ligature(), which replaces the first char of > ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM. That's expected. It will be removed by fribidi_remove_bidi_marks(). > Behdad, why don't you use U+FFFF for this purpose? Because U+FFFF is not a valid character. U+FEFF is harmless at that position. And has a BN bidi category and is one of the characters that any rendering pipeline removes anyway. > and why > fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't > clean this CHAR_FILLs? The shaping functions do not remove any characters. If they did it would be hard to keep the mapping between visual and logical strings. The idea is that we add filler chars when needing to remove any characters, and fribidi_remove_bidi_marks() or similar functions will remove the fillers later. > Behdad, I can fix the problem in CVS if you tell me what's the best > way to fix this. No fix needed AFAI'm concerned :). behdad > -Behnam > > _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for Arabic>> Seems the BOM mark appears in the place of the first character of the
>> LAM+ALEF ligature. This might be a but in the ligature replacement >> of the shaping function. >> >> Yep, it's in the CVS HEAD, lib/fribidi-arabic.c >> fribidi_shape_arabic_ligature(), which replaces the first char of >> ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM. > > That's expected. It will be removed by fribidi_remove_bidi_marks(). > >> Behdad, why don't you use U+FFFF for this purpose? > > Because U+FFFF is not a valid character. U+FEFF is harmless at that > position. And has a BN bidi category and is one of the characters > that any rendering pipeline removes anyway. > >> and why >> fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't >> clean this CHAR_FILLs? > > The shaping functions do not remove any characters. If they did it > would be hard to keep the mapping between visual and logical strings. > The idea is that we add filler chars when needing to remove any > characters, and fribidi_remove_bidi_marks() or similar functions will > remove the fillers later. > >> Behdad, I can fix the problem in CVS if you tell me what's the best >> way to fix this. > > No fix needed AFAI'm concerned :). So, practically, I need to call fribidi_remove_bidi_marks after calling log2vis? I see that's done by fribidi.exe if I pass in --clean. Thanks, -- Yoann Roman _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for ArabicOn 03/11/2009 10:41 AM, Yoann Roman wrote:
>>> Seems the BOM mark appears in the place of the first character of the >>> LAM+ALEF ligature. This might be a but in the ligature replacement >>> of the shaping function. >>> >>> Yep, it's in the CVS HEAD, lib/fribidi-arabic.c >>> fribidi_shape_arabic_ligature(), which replaces the first char of >>> ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM. >> That's expected. It will be removed by fribidi_remove_bidi_marks(). >> >>> Behdad, why don't you use U+FFFF for this purpose? >> Because U+FFFF is not a valid character. U+FEFF is harmless at that >> position. And has a BN bidi category and is one of the characters >> that any rendering pipeline removes anyway. >> >>> and why >>> fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't >>> clean this CHAR_FILLs? >> The shaping functions do not remove any characters. If they did it >> would be hard to keep the mapping between visual and logical strings. >> The idea is that we add filler chars when needing to remove any >> characters, and fribidi_remove_bidi_marks() or similar functions will >> remove the fillers later. >> >>> Behdad, I can fix the problem in CVS if you tell me what's the best >>> way to fix this. >> No fix needed AFAI'm concerned :). > > So, practically, I need to call fribidi_remove_bidi_marks after calling > log2vis? I see that's done by fribidi.exe if I pass in --clean. Yes. behdad > Thanks, > _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
|
|
Re: Invalid UTF-8 for ArabicGot it. Thanks for the explanation. :)
On Wed, Mar 11, 2009 at 5:31 PM, Behdad Esfahbod <behdad@...> wrote: > On 03/10/2009 06:08 AM, Behnam Esfahbod ZWNJ wrote: > >> Seems the BOM mark appears in the place of the first character of the >> LAM+ALEF ligature. This might be a but in the ligature replacement of >> the shaping function. >> >> Yep, it's in the CVS HEAD, lib/fribidi-arabic.c >> fribidi_shape_arabic_ligature(), which replaces the first char of >> ligature with FRIBIDI_CHAR_FILL, which is ZWNBSP/BOM. > > That's expected. It will be removed by fribidi_remove_bidi_marks(). > >> Behdad, why don't you use U+FFFF for this purpose? > > Because U+FFFF is not a valid character. U+FEFF is harmless at that > position. And has a BN bidi category and is one of the characters that any > rendering pipeline removes anyway. > >> and why >> fribidi_shape_arabic() or fribidi_shape_arabic_ligature() doesn't >> clean this CHAR_FILLs? > > The shaping functions do not remove any characters. If they did it would be > hard to keep the mapping between visual and logical strings. The idea is > that we add filler chars when needing to remove any characters, and > fribidi_remove_bidi_marks() or similar functions will remove the fillers > later. > >> Behdad, I can fix the problem in CVS if you tell me what's the best >> way to fix this. > > No fix needed AFAI'm concerned :). > > behdad > >> -Behnam >> >> > -- ' بهنام اسفهبد ' Behnam Esfahbod ' * .. http://behnam.esfahbod.info * ` * * o * http://zwnj.org _______________________________________________ fribidi mailing list fribidi@... http://lists.freedesktop.org/mailman/listinfo/fribidi |
| Free embeddable forum powered by Nabble | Forum Help |