Java JNI functions' UTF-8 not valid

View: New views
4 Messages — Rating Filter:   Alert me  

Java JNI functions' UTF-8 not valid

by Andrew Cowie :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tests this week have uncovered serious problems when you try to put a
supplementary character (a Unicode character whose index is > 0xFFFF and
which therefore needs > 1 2-byte char to represent). The case I've been
working with is U+1D45B, the 𝑛 character from the Mathematical
Alphanumeric Symbols block
http://www.fileformat.info/info/unicode/char/1d45b/index.htm

It doesn't matter to us what Java does internally because when we leave
Java and pass string data to GNOME we have [of course] used the JNI
function GetStringUTFChars() which takes a jstring [ie,
java.lang.String] and returns a char* [ie, const gchar*] to a UTF-8
representation of the string.

I'd tested this heavily in the past; ValidatePangoTextRendering and
others. All good, especially with characters > 0x7f and > 0xFF, ie, two
or three bytes in UTF-8.

But when I tried entering 𝑛 and 𝕊 [which have to be 2 Java chars wide
(UTF-16) and more to the point are *four* UTF-8 bytes wide] into a
TextView, the resultant call through gtk_text_buffer_insert() resulted
in an g_utf8_validate() assertion failure. Oh shit.

Not valid UTF-8! What the heck?

The obvious conclusion is that GetStringUTFChars() doesn't return valid
UTF-8 after all. Serkan tells me that this is a fairly well known
problem; he pointed me to http://www.ingrid.org/java/i18n/utf-16/
(frankly that page makes it even _more_ confusing) but digging around
the JNI spec it vaguely talks about the UTF-8 support being limited to
3-bytes. So, oh shit, confirmed.

Yes, I could live without the 𝑛 character, but the work I'm doing is
aimed at (among other things) mathematical and physics manuscripts, and
so I know I'm going to have people hitting this problem. More to the
point, it just doesn't do for a valid character that the rest of GNOME
handles fine to not work when paste or typed into a java-gnome Widget.
And I get the impression that there's a lot of CJK and Indic activity up
in supplementary range, so this needs addressing.

What to do about this?

I poked around a bit, and it didn't take long to find that GLib has a
set of conversion functions. g_utf16_to_utf8() seems promising, right?

Anyone who read 1b-Homework knows that JNI has two sets of string
functions: GetStringChars() and GetStringUTFChars(); NewString() and
NewStringUTF(). I'd long wondered about the existence of the first set;
their documentation talks about "getting real unicode characters"
instead of UTF-8 encoded ones.

        [Ah, such hubris. Remember when everyone laughed at Java because
        we all thought 2 byte char was hardly necessary? Amazing that 10
        years later 2 bytes is not enough; makes the UTF-8 choice back
        around the time of GLib and XML look like a pretty good call]

So that got me wondering just what Java's (well, JNI's) idea of a
"unicode character" (sic) actually is. It took a bit of digging, but it
seems they actually mean UTF-16 encoding. See
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#core-textrep which seems to say fairly authoritatively (nothing like a FAQ, heh) that Java uses UTF-16 in the char[] that back String objects.

Ok. If GetStringChars()'s "unicode characters" (sic) are valid UTF-16
characters, then we can use g_utf16_to_utf8() to get *real* UTF-8 to
feed to GTK's and other GNOME libraries' functions.

And g_utf8_to_utf16() + NewString() to go the other way.

++

I prototyped it and it stopped my apps from thundering in. Good sign. So
I implemented it for real. Doing the code generator was a snap (yeay);
it only took a few hours to get all the hand written code as well.

See branch 'hackers/andrew/unicode' at
bzr://research.operationaldynamics.com/bzr/java-gnome/hackers/andrew/unicode/


New functions:

        bindings_java_getString();
        bindings_java_newString();

There's a:

        bindings_java_releaseString()

which mirrors JNI's ReleaseStringUTFChars() and should be called on
strings returned by the bindings_java_getString() above.

I was actually able to use JNI's GetStringCritical() and
ReleaseStringCritical() functions. These are "dangerous" but gives us a
jchar* pointer to the actual char[] in the VM's managed heap space,
which saves us a copy. I think they're being used correctly.

Did I use const correctly?

++

As far as I can tell this all works and is safe.

Performance of bindings_java_getString() seems comparable to the JNI
GetStringUTFChars(), so that's good.

Please PLEASE test this branch with your apps. It is a hugely invasive
change to java-gnome's internal working, so it's not just a case of "oh,
I don't use high-range characters". This impacts *all* string handling,
so I'd appreciate some testing before we decide to accept this approach.

If you've got a better idea for an implementation, we now have all Java
-> C string conversion abstracted into one place, so you can experiment
there if you want to.

AfC
Sydney

--
Andrew Frederick Cowie

Operational Dynamics is an operations and engineering consultancy
focusing on IT strategy, organizational architecture, systems
review, and effective procedures for change management: enabling
successful deployment of mission critical information technology in
enterprises, worldwide.

http://www.operationaldynamics.com/

Sydney   New York   Toronto   London


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
java-gnome-hackers mailing list
java-gnome-hackers@...
https://lists.sourceforge.net/lists/listinfo/java-gnome-hackers

signature.asc (205 bytes) Download Attachment

Re: Java JNI functions' UTF-8 not valid

by Serkan Kaba-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Although I don't work with that math symbols nor weird characters,
thanks for discovering and fixing the issue. I tested libnotify and my
last.fm tutorial against it and they work just fine.

Sincerely,
Serkan KABA

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
java-gnome-hackers mailing list
java-gnome-hackers@...
https://lists.sourceforge.net/lists/listinfo/java-gnome-hackers

Re: Java JNI functions' UTF-8 not valid

by Andrew Cowie :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 2009-07-31 at 11:34 +0300, Serkan Kaba wrote:
> Although I don't work with that math symbols nor weird characters,
> thanks for discovering and fixing the issue. I tested libnotify and my
> last.fm tutorial against it and they work just fine.

Thanks Serkan. That makes three people who have checked in and said
things were ok.

I think we'll go with this; merged to 'mainline'.

If anyone who encounters a problem, *please* send a TestCase subclass
[sure, go ahead and just add a fixture to ValidateUnicode] that fails so
we can debug with your use case. Thanks!

AfC
Sydney


--
Andrew Frederick Cowie

Operational Dynamics is an operations and engineering consultancy
focusing on IT strategy, organizational architecture, systems
review, and effective procedures for change management: enabling
successful deployment of mission critical information technology in
enterprises, worldwide.

http://www.operationaldynamics.com/

Sydney   New York   Toronto   London


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
java-gnome-hackers mailing list
java-gnome-hackers@...
https://lists.sourceforge.net/lists/listinfo/java-gnome-hackers

signature.asc (205 bytes) Download Attachment

Re: Java JNI functions' UTF-8 not valid

by Serkan Kaba-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Just to point out I found the original documents I intented to mention.

http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/types.html section
on "Modified UTF-8" (Actually I spotted this somewhere in api docs but
anyway it explains the exact same stuff)
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
and also javadoc of java.lang.Character

Sincerely
Serkan KABA

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
java-gnome-hackers mailing list
java-gnome-hackers@...
https://lists.sourceforge.net/lists/listinfo/java-gnome-hackers