« Return to Thread: Jython 2.5.1 and various encodings support - LookupError: unknown encoding

Re: Jython 2.5.1 and various encodings support - LookupError: unknown encoding

by Chris Clark-2 :: Rate this Message:

| View in Thread

Chris Clark wrote:

> Philip Jenvey wrote:
>  
>> #1066 is the main bug for this issue -- we just currently lack support for the asian codecs like shiftjis. The ImportError in sample #2 is a symptom of that. The same ImportError happens when you attempt to use the codec but it's masked as a LookupError.
>>
>> Supporting these via the JVM's nio codecs is definitely doable but nobody's gotten around to it yet.
>>  
>>    
>
> Is http://java.sun.com/j2se/1.4.2/docs/guide/nio/ the package you are
> referring to? I'm not a big Java guy but I may start hacking on a Python
> layer on top of this as an experiment/proof-of-concept. Presumably
> http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/package-summary.html 
> is what needs wrapping?
>  
I had some time this afternoon whilst waiting for some builds to
complete... So I started experimenting on using nio from Python along
with a quick attempt at a shift_jis

I'm seeking feedback on a very INCOMPLETE demo that is attached. Sample
session:

C:\users\clach04\python\jython_character_encoding>c:\jython2.5.1\jython.bat
Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54)
[Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.6.0_02
Type "help", "copyright", "credits" or "license" for more information.
 >>> x=''
 >>> x.decode('shift_jis') # at this point there is a shift_jis.py in curdir
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding 'shift_jis'
 >>> import shift_jis # register the local module/encoding
 >>> x.decode('shift_jis')
u''
 >>>

There is no support for errors (or less strict conversion options),
there are imports in the middle of the script and you have to import the
encoding you need (and right now there is only one but it is easy to do
multiple with a template). I'm beginning to wonder if it would simply be
cleaner to use the CPython gencodec.py script and generate input to it
by using the CPython encodings. I've done this for some Windows (single
byte) encodings that are not supported by Python by auto-generating
tables from Windows codepages like cp708. The tables would be pretty big
though :-)

I'm really looking for "yes nio from Python approach is worth pursuing"
or "this is stupid, you should stop now" comments. I'm pretty sure
performance wise this approach is not a good idea but it is infinitely
faster than "doesn't work at all" :-)


Here is a slightly more real example:
C:\users\clach04\python\jython_character_encoding>c:\jython2.5.1\jython.bat
Jython 2.5.1 (Release_2_5_1:6813, Sep 26 2009, 13:47:54)
[Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.6.0_02
Type "help", "copyright", "credits" or "license" for more information.
 >>> import shift_jis # register the local module/encoding
 >>> x = u"\u3042" # '3042 HIRAGANA LETTER A'
 >>> x.encode('shift_jis')
'\x82\xa0'
 >>> # hey! Looks like it matches
http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003&b=82&s=ALL#layout

Finally, does anyone know how IronPython handles CJK (or do they simply
make use of .NET strings)?

Chris


# java imports
import java.nio.charset
import java.nio.CharBuffer
import java.nio.ByteBuffer

# python imports
import array


class MyBaseException(Exception):
    pass


def nio_unicode_to_bytes(nio_charset_name, unicode_string_data):
    """Take Python Unicode string and return python str type (byte)
    encoded in nio_charset_name
        nio_charset_name is a java.nio.charset name
    """
    assert isinstance(unicode_string_data, unicode)

    nio_charset = java.nio.charset.Charset.forName(nio_charset_name)
    # TODO lookup could fail

    nio_charset_encoder = nio_charset.newEncoder()
    try:
        bbuf = nio_charset_encoder.encode(java.nio.CharBuffer.wrap(unicode_string_data))
    except java.nio.charset.UnmappableCharacterException:
        # not possible to represent one or more Unicode character(s) in this encoding
        raise MyBaseException('nio encoding failure - not implemented support yet')
    tmp_byte_array = array.array('b', bbuf.array())
    return tmp_byte_array .tostring()


def nio_bytes_to_unicode(nio_charset_name, byte_string_data):
    """Take Python str (byte) string and return python Unicode string type decoded using nio_charset_name
    nio_charset_name is a java.nio.charset name
    """
    assert isinstance(byte_string_data, str)

    nio_charset = java.nio.charset.Charset.forName(nio_charset_name)
    # TODO lookup could fail

    nio_charset_decoder = nio_charset.newDecoder()
    tmp_byte_buffer = java.nio.ByteBuffer.wrap(byte_string_data)
    try:
        cbuf = nio_charset_decoder.decode(tmp_byte_buffer)
    except java.nio.charset.MalformedInputException:
        raise MyBaseException('nio decoding failure - not implemented support yet')
    tmp_unicode_str = cbuf.toString()
    return tmp_unicode_str


## could probably use a decorator here....
def decode_Shift_JIS(input):
    return nio_bytes_to_unicode('Shift_JIS', input)


def encode_Shift_JIS(input):
    return nio_unicode_to_bytes('Shift_JIS', input)

#### Pretty much boiler plate codec

import codecs


def decode(input, errors='strict'):
    return decode_Shift_JIS(input), len(input)


def encode(input, errors='strict'):
    return encode_Shift_JIS(input), len(input)


class Codec(codecs.Codec):
    def decode(self, input, errors='strict'):
        return decode(input, errors)

    def encode(self, input, errors='strict'):
        return encode(input, errors)


class StreamReader(codecs.Codec, codecs.StreamReader):
    pass


class StreamWriter(codecs.Codec, codecs.StreamWriter):
    pass


# entry point
def getregentry():
    return (encode, decode, StreamReader, StreamWriter)


##### not so boiler plate.....


def shift_jis_search_function(name):
    if name == 'shift_jis':
        import shift_jis
        codec = shift_jis.Codec()
        return (codec.encode, codec.decode, shift_jis.StreamReader, shift_jis.StreamWriter)
    else:
        return None

## works for 2.4 and 2.5
codecs.register(shift_jis_search_function)

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Jython-users mailing list
Jython-users@...
https://lists.sourceforge.net/lists/listinfo/jython-users

 « Return to Thread: Jython 2.5.1 and various encodings support - LookupError: unknown encoding