Weird issue with codecs.BOM_UTF8

View: New views
10 Messages — Rating Filter:   Alert me  

Weird issue with codecs.BOM_UTF8

by Leonides Saguisag :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi everyone,

I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.

Here is the test script that I put together:


import sys
sys.path.append(r'D:\Python25\Lib')
import codecs

print sys.version
myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r')
lines = myfile.readlines()
myfile.close()
if lines[0].startswith(codecs.BOM_UTF8):
        print ('UTF-8 BOM detected!')
else:
        print ('UTF-8 BOM not detected!')

myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r')
lines = myfile.readlines()
myfile.close()
if lines[0].startswith(codecs.BOM_UTF8):
        print ('UTF-8 BOM detected!')
else:
        print ('UTF-8 BOM not detected!')


If I run the executable that I get from SharpDevelop this is what I get:
bin\Debug> Test.exe
2.5.0 ()
UTF-8 BOM detected!
UTF-8 BOM detected!


But if I run the same script using the standard python interpreter, this is what I get:
bin\Debug> D:\Python25\python.exe ..\..\Program.py
2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
UTF-8 BOM detected!
UTF-8 BOM not detected!


The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.

Any ideas what is going wrong?

Thanks!

Best regards,
-- Leo
_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Re: Weird issue with codecs.BOM_UTF8

by Michael Foord-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Leonides Saguisag wrote:

> Hi everyone,
>
> I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.
>
> Here is the test script that I put together:
>
>
> import sys
> sys.path.append(r'D:\Python25\Lib')
> import codecs
>
> print sys.version
> myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r')
> lines = myfile.readlines()
> myfile.close()
> if lines[0].startswith(codecs.BOM_UTF8):
> print ('UTF-8 BOM detected!')
> else:
> print ('UTF-8 BOM not detected!')
>
> myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r')
> lines = myfile.readlines()
> myfile.close()
> if lines[0].startswith(codecs.BOM_UTF8):
> print ('UTF-8 BOM detected!')
> else:
> print ('UTF-8 BOM not detected!')
>
>
> If I run the executable that I get from SharpDevelop this is what I get:
> bin\Debug> Test.exe
> 2.5.0 ()
> UTF-8 BOM detected!
> UTF-8 BOM detected!
>
>
> But if I run the same script using the standard python interpreter, this is what I get:
> bin\Debug> D:\Python25\python.exe ..\..\Program.py
> 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
> UTF-8 BOM detected!
> UTF-8 BOM not detected!
>
>
> The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.
>
> Any ideas what is going wrong?
>  
I'm not in a position to check right now, but this could happen if
codes.UTF8_BOM is set to the empty string.

Michael

> Thanks!
>
> Best regards,
> -- Leo
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>  


--
http://www.ironpythoninaction.com/

_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Re: Weird issue with codecs.BOM_UTF8

by Leonides Saguisag :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thank you for taking the time to reply.  Any idea why this would happen in IronPython but not with the standard Python interpreter?  What is weirding me out is that the exact same script behaves differently depending on whether I use IronPython or the standard Python interpreter.

Thanks!

-- Leo

-----Original Message-----
From: users-bounces@... [mailto:users-bounces@...] On Behalf Of Michael Foord
Sent: 2009?11?10? 13:17
To: Discussion of IronPython
Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8

Leonides Saguisag wrote:

> Hi everyone,
>
> I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.
>
> Here is the test script that I put together:
>
>
> import sys
> sys.path.append(r'D:\Python25\Lib')
> import codecs
>
> print sys.version
> myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r') lines =
> myfile.readlines()
> myfile.close()
> if lines[0].startswith(codecs.BOM_UTF8):
> print ('UTF-8 BOM detected!')
> else:
> print ('UTF-8 BOM not detected!')
>
> myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r') lines =
> myfile.readlines()
> myfile.close()
> if lines[0].startswith(codecs.BOM_UTF8):
> print ('UTF-8 BOM detected!')
> else:
> print ('UTF-8 BOM not detected!')
>
>
> If I run the executable that I get from SharpDevelop this is what I get:
> bin\Debug> Test.exe
> 2.5.0 ()
> UTF-8 BOM detected!
> UTF-8 BOM detected!
>
>
> But if I run the same script using the standard python interpreter, this is what I get:
> bin\Debug> D:\Python25\python.exe ..\..\Program.py
> 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
> UTF-8 BOM detected!
> UTF-8 BOM not detected!
>
>
> The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.
>
> Any ideas what is going wrong?
>  
I'm not in a position to check right now, but this could happen if codes.UTF8_BOM is set to the empty string.

Michael

> Thanks!
>
> Best regards,
> -- Leo
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>  


--
http://www.ironpythoninaction.com/

_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Re: Weird issue with codecs.BOM_UTF8

by Michael Foord-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Leonides Saguisag wrote:
> Thank you for taking the time to reply.  Any idea why this would happen in IronPython but not with the standard Python interpreter?  What is weirding me out is that the exact same script behaves differently depending on whether I use IronPython or the standard Python interpreter.
>  
Well, if codecs.BOM_UTF8 is set to the empty string (you didn't say if
you have tried this yet?) then it would be due to a bug in IronPython
somewhere - but at least you would know what was causing it.

If it is the empty string, purely speculating, it could be due to the
way the .NET framework treats the BOM at the start of strings. Pure
speculation though - that might not be the problem at all or it could be
caused by something entirely different.

In .NET it would be more normal to check for the BOM with bytes, as by
the time you have a string you have (usually) decoded already.
IronPython 2.X is a bit odd for the .NET framework in this respect.

Michael

> Thanks!
>
> -- Leo
>
> -----Original Message-----
> From: users-bounces@... [mailto:users-bounces@...] On Behalf Of Michael Foord
> Sent: 2009?11?10? 13:17
> To: Discussion of IronPython
> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>
> Leonides Saguisag wrote:
>  
>> Hi everyone,
>>
>> I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.
>>
>> Here is the test script that I put together:
>>
>>
>> import sys
>> sys.path.append(r'D:\Python25\Lib')
>> import codecs
>>
>> print sys.version
>> myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r') lines =
>> myfile.readlines()
>> myfile.close()
>> if lines[0].startswith(codecs.BOM_UTF8):
>> print ('UTF-8 BOM detected!')
>> else:
>> print ('UTF-8 BOM not detected!')
>>
>> myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r') lines =
>> myfile.readlines()
>> myfile.close()
>> if lines[0].startswith(codecs.BOM_UTF8):
>> print ('UTF-8 BOM detected!')
>> else:
>> print ('UTF-8 BOM not detected!')
>>
>>
>> If I run the executable that I get from SharpDevelop this is what I get:
>> bin\Debug> Test.exe
>> 2.5.0 ()
>> UTF-8 BOM detected!
>> UTF-8 BOM detected!
>>
>>
>> But if I run the same script using the standard python interpreter, this is what I get:
>> bin\Debug> D:\Python25\python.exe ..\..\Program.py
>> 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
>> UTF-8 BOM detected!
>> UTF-8 BOM not detected!
>>
>>
>> The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.
>>
>> Any ideas what is going wrong?
>>  
>>    
> I'm not in a position to check right now, but this could happen if codes.UTF8_BOM is set to the empty string.
>
> Michael
>
>  
>> Thanks!
>>
>> Best regards,
>> -- Leo
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>  
>>    
>
>
> --
> http://www.ironpythoninaction.com/
>
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>  


--
http://www.ironpythoninaction.com/

_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Re: Weird issue with codecs.BOM_UTF8

by Leonides Saguisag :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Michael,

I just verified the empty string theory that you mentioned and Python25\lib\codecs.py (comes with the standard library in Python 2.5) has the following defined:


# UTF-8
BOM_UTF8 = '\xef\xbb\xbf'


So it is not an empty string.

Maybe I am approaching this wrong and you guys can provide me with an alternative way of doing this.  I am trying to read a file and determine if the file is encoded in UTF-8 or not.  The approach I took was to use python's built-in open function to read the text file into an array of strings and check if the first line starts with the UTF-8 byte order mark by using line.startswith(codecs.BOM_UTF8).  As I noted below, this works fine in Python 2.5 but in IronPython it just keeps saying it found a UTF-8 BOM even though there is none present.

Thanks!

-- Leo

-----Original Message-----
From: users-bounces@... [mailto:users-bounces@...] On Behalf Of Michael Foord
Sent: 2009?11?10? 13:32
To: Discussion of IronPython
Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8

Leonides Saguisag wrote:
> Thank you for taking the time to reply.  Any idea why this would happen in IronPython but not with the standard Python interpreter?  What is weirding me out is that the exact same script behaves differently depending on whether I use IronPython or the standard Python interpreter.
>  
Well, if codecs.BOM_UTF8 is set to the empty string (you didn't say if you have tried this yet?) then it would be due to a bug in IronPython somewhere - but at least you would know what was causing it.

If it is the empty string, purely speculating, it could be due to the way the .NET framework treats the BOM at the start of strings. Pure speculation though - that might not be the problem at all or it could be caused by something entirely different.

In .NET it would be more normal to check for the BOM with bytes, as by the time you have a string you have (usually) decoded already.
IronPython 2.X is a bit odd for the .NET framework in this respect.

Michael

> Thanks!
>
> -- Leo
>
> -----Original Message-----
> From: users-bounces@...
> [mailto:users-bounces@...] On Behalf Of Michael Foord
> Sent: 2009?11?10? 13:17
> To: Discussion of IronPython
> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>
> Leonides Saguisag wrote:
>  
>> Hi everyone,
>>
>> I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.
>>
>> Here is the test script that I put together:
>>
>>
>> import sys
>> sys.path.append(r'D:\Python25\Lib')
>> import codecs
>>
>> print sys.version
>> myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r') lines =
>> myfile.readlines()
>> myfile.close()
>> if lines[0].startswith(codecs.BOM_UTF8):
>> print ('UTF-8 BOM detected!')
>> else:
>> print ('UTF-8 BOM not detected!')
>>
>> myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r') lines =
>> myfile.readlines()
>> myfile.close()
>> if lines[0].startswith(codecs.BOM_UTF8):
>> print ('UTF-8 BOM detected!')
>> else:
>> print ('UTF-8 BOM not detected!')
>>
>>
>> If I run the executable that I get from SharpDevelop this is what I get:
>> bin\Debug> Test.exe
>> 2.5.0 ()
>> UTF-8 BOM detected!
>> UTF-8 BOM detected!
>>
>>
>> But if I run the same script using the standard python interpreter, this is what I get:
>> bin\Debug> D:\Python25\python.exe ..\..\Program.py
>> 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
>> UTF-8 BOM detected!
>> UTF-8 BOM not detected!
>>
>>
>> The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.
>>
>> Any ideas what is going wrong?
>>  
>>    
> I'm not in a position to check right now, but this could happen if codes.UTF8_BOM is set to the empty string.
>
> Michael
>
>  
>> Thanks!
>>
>> Best regards,
>> -- Leo
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>  
>>    
>
>
> --
> http://www.ironpythoninaction.com/
>
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>  


--
http://www.ironpythoninaction.com/

_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Re: Weird issue with codecs.BOM_UTF8

by Michael Foord-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Leonides Saguisag wrote:
> Hi Michael,
>
> I just verified the empty string theory that you mentioned and Python25\lib\codecs.py (comes with the standard library in Python 2.5) has the following defined:
>
>  
You're using IronPython 2.0 then?

If I use IronPython 2.6 it correctly reports a text file as not starting
with the BOM:

 >>> import codecs
 >>> codecs.BOM_UTF8
u'\xef\xbb\xbf'
 >>> lines = open('foo.txt').readlines()
 >>> lines
['foo']
 >>> lines[0].startswith(codecs.BOM_UTF8)
False


All the best,

Michael Foord

> # UTF-8
> BOM_UTF8 = '\xef\xbb\xbf'
>
>
> So it is not an empty string.
>
> Maybe I am approaching this wrong and you guys can provide me with an alternative way of doing this.  I am trying to read a file and determine if the file is encoded in UTF-8 or not.  The approach I took was to use python's built-in open function to read the text file into an array of strings and check if the first line starts with the UTF-8 byte order mark by using line.startswith(codecs.BOM_UTF8).  As I noted below, this works fine in Python 2.5 but in IronPython it just keeps saying it found a UTF-8 BOM even though there is none present.
>
> Thanks!
>
> -- Leo
>
> -----Original Message-----
> From: users-bounces@... [mailto:users-bounces@...] On Behalf Of Michael Foord
> Sent: 2009?11?10? 13:32
> To: Discussion of IronPython
> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>
> Leonides Saguisag wrote:
>  
>> Thank you for taking the time to reply.  Any idea why this would happen in IronPython but not with the standard Python interpreter?  What is weirding me out is that the exact same script behaves differently depending on whether I use IronPython or the standard Python interpreter.
>>  
>>    
> Well, if codecs.BOM_UTF8 is set to the empty string (you didn't say if you have tried this yet?) then it would be due to a bug in IronPython somewhere - but at least you would know what was causing it.
>
> If it is the empty string, purely speculating, it could be due to the way the .NET framework treats the BOM at the start of strings. Pure speculation though - that might not be the problem at all or it could be caused by something entirely different.
>
> In .NET it would be more normal to check for the BOM with bytes, as by the time you have a string you have (usually) decoded already.
> IronPython 2.X is a bit odd for the .NET framework in this respect.
>
> Michael
>
>  
>> Thanks!
>>
>> -- Leo
>>
>> -----Original Message-----
>> From: users-bounces@...
>> [mailto:users-bounces@...] On Behalf Of Michael Foord
>> Sent: 2009?11?10? 13:17
>> To: Discussion of IronPython
>> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>>
>> Leonides Saguisag wrote:
>>  
>>    
>>> Hi everyone,
>>>
>>> I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.
>>>
>>> Here is the test script that I put together:
>>>
>>>
>>> import sys
>>> sys.path.append(r'D:\Python25\Lib')
>>> import codecs
>>>
>>> print sys.version
>>> myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r') lines =
>>> myfile.readlines()
>>> myfile.close()
>>> if lines[0].startswith(codecs.BOM_UTF8):
>>> print ('UTF-8 BOM detected!')
>>> else:
>>> print ('UTF-8 BOM not detected!')
>>>
>>> myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r') lines =
>>> myfile.readlines()
>>> myfile.close()
>>> if lines[0].startswith(codecs.BOM_UTF8):
>>> print ('UTF-8 BOM detected!')
>>> else:
>>> print ('UTF-8 BOM not detected!')
>>>
>>>
>>> If I run the executable that I get from SharpDevelop this is what I get:
>>> bin\Debug> Test.exe
>>> 2.5.0 ()
>>> UTF-8 BOM detected!
>>> UTF-8 BOM detected!
>>>
>>>
>>> But if I run the same script using the standard python interpreter, this is what I get:
>>> bin\Debug> D:\Python25\python.exe ..\..\Program.py
>>> 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]
>>> UTF-8 BOM detected!
>>> UTF-8 BOM not detected!
>>>
>>>
>>> The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.
>>>
>>> Any ideas what is going wrong?
>>>  
>>>    
>>>      
>> I'm not in a position to check right now, but this could happen if codes.UTF8_BOM is set to the empty string.
>>
>> Michael
>>
>>  
>>    
>>> Thanks!
>>>
>>> Best regards,
>>> -- Leo
>>> _______________________________________________
>>> Users mailing list
>>> Users@...
>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>  
>>>    
>>>      
>> --
>> http://www.ironpythoninaction.com/
>>
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>  
>>    
>
>
> --
> http://www.ironpythoninaction.com/
>
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>  


--
http://www.ironpythoninaction.com/

_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Re: Weird issue with codecs.BOM_UTF8

by Leonides Saguisag :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Michael,

I am using SharpDevelop 3.1 which comes with "2.5.0 (IronPython 2.0.2 (2.0.0.0) on .NET 2.0.50727.3603)".

So this issue is resolved with IronPython 2.6, then?

Thanks!

-- Leo

-----Original Message-----
From: users-bounces@... [mailto:users-bounces@...] On Behalf Of Michael Foord
Sent: 2009?11?10? 14:05
To: Discussion of IronPython
Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8

Leonides Saguisag wrote:
> Hi Michael,
>
> I just verified the empty string theory that you mentioned and Python25\lib\codecs.py (comes with the standard library in Python 2.5) has the following defined:
>
>  
You're using IronPython 2.0 then?

If I use IronPython 2.6 it correctly reports a text file as not starting with the BOM:

 >>> import codecs
 >>> codecs.BOM_UTF8
u'\xef\xbb\xbf'
 >>> lines = open('foo.txt').readlines()  >>> lines ['foo']  >>> lines[0].startswith(codecs.BOM_UTF8)
False


All the best,

Michael Foord

> # UTF-8
> BOM_UTF8 = '\xef\xbb\xbf'
>
>
> So it is not an empty string.
>
> Maybe I am approaching this wrong and you guys can provide me with an alternative way of doing this.  I am trying to read a file and determine if the file is encoded in UTF-8 or not.  The approach I took was to use python's built-in open function to read the text file into an array of strings and check if the first line starts with the UTF-8 byte order mark by using line.startswith(codecs.BOM_UTF8).  As I noted below, this works fine in Python 2.5 but in IronPython it just keeps saying it found a UTF-8 BOM even though there is none present.
>
> Thanks!
>
> -- Leo
>
> -----Original Message-----
> From: users-bounces@...
> [mailto:users-bounces@...] On Behalf Of Michael Foord
> Sent: 2009?11?10? 13:32
> To: Discussion of IronPython
> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>
> Leonides Saguisag wrote:
>  
>> Thank you for taking the time to reply.  Any idea why this would happen in IronPython but not with the standard Python interpreter?  What is weirding me out is that the exact same script behaves differently depending on whether I use IronPython or the standard Python interpreter.
>>  
>>    
> Well, if codecs.BOM_UTF8 is set to the empty string (you didn't say if you have tried this yet?) then it would be due to a bug in IronPython somewhere - but at least you would know what was causing it.
>
> If it is the empty string, purely speculating, it could be due to the way the .NET framework treats the BOM at the start of strings. Pure speculation though - that might not be the problem at all or it could be caused by something entirely different.
>
> In .NET it would be more normal to check for the BOM with bytes, as by the time you have a string you have (usually) decoded already.
> IronPython 2.X is a bit odd for the .NET framework in this respect.
>
> Michael
>
>  
>> Thanks!
>>
>> -- Leo
>>
>> -----Original Message-----
>> From: users-bounces@...
>> [mailto:users-bounces@...] On Behalf Of Michael
>> Foord
>> Sent: 2009?11?10? 13:17
>> To: Discussion of IronPython
>> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>>
>> Leonides Saguisag wrote:
>>  
>>    
>>> Hi everyone,
>>>
>>> I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.
>>>
>>> Here is the test script that I put together:
>>>
>>>
>>> import sys
>>> sys.path.append(r'D:\Python25\Lib')
>>> import codecs
>>>
>>> print sys.version
>>> myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r') lines =
>>> myfile.readlines()
>>> myfile.close()
>>> if lines[0].startswith(codecs.BOM_UTF8):
>>> print ('UTF-8 BOM detected!')
>>> else:
>>> print ('UTF-8 BOM not detected!')
>>>
>>> myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r') lines
>>> =
>>> myfile.readlines()
>>> myfile.close()
>>> if lines[0].startswith(codecs.BOM_UTF8):
>>> print ('UTF-8 BOM detected!')
>>> else:
>>> print ('UTF-8 BOM not detected!')
>>>
>>>
>>> If I run the executable that I get from SharpDevelop this is what I get:
>>> bin\Debug> Test.exe
>>> 2.5.0 ()
>>> UTF-8 BOM detected!
>>> UTF-8 BOM detected!
>>>
>>>
>>> But if I run the same script using the standard python interpreter, this is what I get:
>>> bin\Debug> D:\Python25\python.exe ..\..\Program.py
>>> 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
>>> (Intel)]
>>> UTF-8 BOM detected!
>>> UTF-8 BOM not detected!
>>>
>>>
>>> The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.
>>>
>>> Any ideas what is going wrong?
>>>  
>>>    
>>>      
>> I'm not in a position to check right now, but this could happen if codes.UTF8_BOM is set to the empty string.
>>
>> Michael
>>
>>  
>>    
>>> Thanks!
>>>
>>> Best regards,
>>> -- Leo
>>> _______________________________________________
>>> Users mailing list
>>> Users@...
>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>  
>>>    
>>>      
>> --
>> http://www.ironpythoninaction.com/
>>
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>  
>>    
>
>
> --
> http://www.ironpythoninaction.com/
>
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>  


--
http://www.ironpythoninaction.com/

_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Re: Weird issue with codecs.BOM_UTF8

by Michael Foord-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Leonides Saguisag wrote:
> Hi Michael,
>
> I am using SharpDevelop 3.1 which comes with "2.5.0 (IronPython 2.0.2 (2.0.0.0) on .NET 2.0.50727.3603)".
>
>  
Yeah, that's IronPython 2.0 - which is fine but not as good as
IronPython 2.6. ;-)

> So this issue is resolved with IronPython 2.6, then?
>
>  

No idea. I can't reproduce the problem with IronPython 2.6 though. Try
installing IronPython 2 and seeing what happens from the interactive
interpreter (whether you can reproduce the problem or not). It is
*possible* that it's caused by the way SharpDevelop generates its
executables, but that's highly unlikely to be the cause of the problem.

All the best,

Michael Foord

> Thanks!
>
> -- Leo
>
> -----Original Message-----
> From: users-bounces@... [mailto:users-bounces@...] On Behalf Of Michael Foord
> Sent: 2009?11?10? 14:05
> To: Discussion of IronPython
> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>
> Leonides Saguisag wrote:
>  
>> Hi Michael,
>>
>> I just verified the empty string theory that you mentioned and Python25\lib\codecs.py (comes with the standard library in Python 2.5) has the following defined:
>>
>>  
>>    
> You're using IronPython 2.0 then?
>
> If I use IronPython 2.6 it correctly reports a text file as not starting with the BOM:
>
>  >>> import codecs
>  >>> codecs.BOM_UTF8
> u'\xef\xbb\xbf'
>  >>> lines = open('foo.txt').readlines()  >>> lines ['foo']  >>> lines[0].startswith(codecs.BOM_UTF8)
> False
>
>
> All the best,
>
> Michael Foord
>  
>> # UTF-8
>> BOM_UTF8 = '\xef\xbb\xbf'
>>
>>
>> So it is not an empty string.
>>
>> Maybe I am approaching this wrong and you guys can provide me with an alternative way of doing this.  I am trying to read a file and determine if the file is encoded in UTF-8 or not.  The approach I took was to use python's built-in open function to read the text file into an array of strings and check if the first line starts with the UTF-8 byte order mark by using line.startswith(codecs.BOM_UTF8).  As I noted below, this works fine in Python 2.5 but in IronPython it just keeps saying it found a UTF-8 BOM even though there is none present.
>>
>> Thanks!
>>
>> -- Leo
>>
>> -----Original Message-----
>> From: users-bounces@...
>> [mailto:users-bounces@...] On Behalf Of Michael Foord
>> Sent: 2009?11?10? 13:32
>> To: Discussion of IronPython
>> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>>
>> Leonides Saguisag wrote:
>>  
>>    
>>> Thank you for taking the time to reply.  Any idea why this would happen in IronPython but not with the standard Python interpreter?  What is weirding me out is that the exact same script behaves differently depending on whether I use IronPython or the standard Python interpreter.
>>>  
>>>    
>>>      
>> Well, if codecs.BOM_UTF8 is set to the empty string (you didn't say if you have tried this yet?) then it would be due to a bug in IronPython somewhere - but at least you would know what was causing it.
>>
>> If it is the empty string, purely speculating, it could be due to the way the .NET framework treats the BOM at the start of strings. Pure speculation though - that might not be the problem at all or it could be caused by something entirely different.
>>
>> In .NET it would be more normal to check for the BOM with bytes, as by the time you have a string you have (usually) decoded already.
>> IronPython 2.X is a bit odd for the .NET framework in this respect.
>>
>> Michael
>>
>>  
>>    
>>> Thanks!
>>>
>>> -- Leo
>>>
>>> -----Original Message-----
>>> From: users-bounces@...
>>> [mailto:users-bounces@...] On Behalf Of Michael
>>> Foord
>>> Sent: 2009?11?10? 13:17
>>> To: Discussion of IronPython
>>> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>>>
>>> Leonides Saguisag wrote:
>>>  
>>>    
>>>      
>>>> Hi everyone,
>>>>
>>>> I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.
>>>>
>>>> Here is the test script that I put together:
>>>>
>>>>
>>>> import sys
>>>> sys.path.append(r'D:\Python25\Lib')
>>>> import codecs
>>>>
>>>> print sys.version
>>>> myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r') lines =
>>>> myfile.readlines()
>>>> myfile.close()
>>>> if lines[0].startswith(codecs.BOM_UTF8):
>>>> print ('UTF-8 BOM detected!')
>>>> else:
>>>> print ('UTF-8 BOM not detected!')
>>>>
>>>> myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r') lines
>>>> =
>>>> myfile.readlines()
>>>> myfile.close()
>>>> if lines[0].startswith(codecs.BOM_UTF8):
>>>> print ('UTF-8 BOM detected!')
>>>> else:
>>>> print ('UTF-8 BOM not detected!')
>>>>
>>>>
>>>> If I run the executable that I get from SharpDevelop this is what I get:
>>>> bin\Debug> Test.exe
>>>> 2.5.0 ()
>>>> UTF-8 BOM detected!
>>>> UTF-8 BOM detected!
>>>>
>>>>
>>>> But if I run the same script using the standard python interpreter, this is what I get:
>>>> bin\Debug> D:\Python25\python.exe ..\..\Program.py
>>>> 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
>>>> (Intel)]
>>>> UTF-8 BOM detected!
>>>> UTF-8 BOM not detected!
>>>>
>>>>
>>>> The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.
>>>>
>>>> Any ideas what is going wrong?
>>>>  
>>>>    
>>>>      
>>>>        
>>> I'm not in a position to check right now, but this could happen if codes.UTF8_BOM is set to the empty string.
>>>
>>> Michael
>>>
>>>  
>>>    
>>>      
>>>> Thanks!
>>>>
>>>> Best regards,
>>>> -- Leo
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@...
>>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>>  
>>>>    
>>>>      
>>>>        
>>> --
>>> http://www.ironpythoninaction.com/
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users@...
>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>> _______________________________________________
>>> Users mailing list
>>> Users@...
>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>  
>>>    
>>>      
>> --
>> http://www.ironpythoninaction.com/
>>
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>  
>>    
>
>
> --
> http://www.ironpythoninaction.com/
>
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>  


--
http://www.ironpythoninaction.com/

_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Re: Weird issue with codecs.BOM_UTF8

by Leonides Saguisag :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Michael,

It seems to be a bug with IronPython 2.0.x then.  I just installed IronPython 2.0.3 and this is what I found:


C:\>"C:\Program Files\IronPython 2.0.3\ipy.exe"
IronPython 2.0.3 (2.0.0.0) on .NET 2.0.50727.3603
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append(r'D:\Python25\lib')
>>> import codecs
>>> lines = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r').readlines()
>>> print lines
['This is a text file without a UTF-8 BOM.\n', 'Line 2\n', 'Line 3']
>>> lines[0].startswith(codecs.BOM_UTF8)
True
>>> ^Z

C:\>
 

It returned 'True' even though the text file did not have a UTF-8 BOM.  Contrasted with standard Python 2.5:


C:\> D:\Python25\python.exe
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import codecs
>>> lines = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r').readlines()
>>> print lines
['This is a text file without a UTF-8 BOM.\n', 'Line 2\n', 'Line 3']
>>> lines[0].startswith(codecs.BOM_UTF8)
False
>>> ^Z

C:\>


So it looks like there was a bug in IronPython 2.0.x with regards to the handling of codecs.BOM_UTF8 that now appears to be fixed in IronPython 2.6.  Does that sound like a fair assessment?

Thanks!

-- Leo

-----Original Message-----
From: users-bounces@... [mailto:users-bounces@...] On Behalf Of Michael Foord
Sent: 2009?11?10? 14:30
To: Discussion of IronPython
Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8

Leonides Saguisag wrote:
> Hi Michael,
>
> I am using SharpDevelop 3.1 which comes with "2.5.0 (IronPython 2.0.2 (2.0.0.0) on .NET 2.0.50727.3603)".
>
>  
Yeah, that's IronPython 2.0 - which is fine but not as good as IronPython 2.6. ;-)

> So this issue is resolved with IronPython 2.6, then?
>
>  

No idea. I can't reproduce the problem with IronPython 2.6 though. Try installing IronPython 2 and seeing what happens from the interactive interpreter (whether you can reproduce the problem or not). It is
*possible* that it's caused by the way SharpDevelop generates its executables, but that's highly unlikely to be the cause of the problem.

All the best,

Michael Foord

> Thanks!
>
> -- Leo
>
> -----Original Message-----
> From: users-bounces@...
> [mailto:users-bounces@...] On Behalf Of Michael Foord
> Sent: 2009?11?10? 14:05
> To: Discussion of IronPython
> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>
> Leonides Saguisag wrote:
>  
>> Hi Michael,
>>
>> I just verified the empty string theory that you mentioned and Python25\lib\codecs.py (comes with the standard library in Python 2.5) has the following defined:
>>
>>  
>>    
> You're using IronPython 2.0 then?
>
> If I use IronPython 2.6 it correctly reports a text file as not starting with the BOM:
>
>  >>> import codecs
>  >>> codecs.BOM_UTF8
> u'\xef\xbb\xbf'
>  >>> lines = open('foo.txt').readlines()  >>> lines ['foo']  >>>
> lines[0].startswith(codecs.BOM_UTF8)
> False
>
>
> All the best,
>
> Michael Foord
>  
>> # UTF-8
>> BOM_UTF8 = '\xef\xbb\xbf'
>>
>>
>> So it is not an empty string.
>>
>> Maybe I am approaching this wrong and you guys can provide me with an alternative way of doing this.  I am trying to read a file and determine if the file is encoded in UTF-8 or not.  The approach I took was to use python's built-in open function to read the text file into an array of strings and check if the first line starts with the UTF-8 byte order mark by using line.startswith(codecs.BOM_UTF8).  As I noted below, this works fine in Python 2.5 but in IronPython it just keeps saying it found a UTF-8 BOM even though there is none present.
>>
>> Thanks!
>>
>> -- Leo
>>
>> -----Original Message-----
>> From: users-bounces@...
>> [mailto:users-bounces@...] On Behalf Of Michael
>> Foord
>> Sent: 2009?11?10? 13:32
>> To: Discussion of IronPython
>> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>>
>> Leonides Saguisag wrote:
>>  
>>    
>>> Thank you for taking the time to reply.  Any idea why this would happen in IronPython but not with the standard Python interpreter?  What is weirding me out is that the exact same script behaves differently depending on whether I use IronPython or the standard Python interpreter.
>>>  
>>>    
>>>      
>> Well, if codecs.BOM_UTF8 is set to the empty string (you didn't say if you have tried this yet?) then it would be due to a bug in IronPython somewhere - but at least you would know what was causing it.
>>
>> If it is the empty string, purely speculating, it could be due to the way the .NET framework treats the BOM at the start of strings. Pure speculation though - that might not be the problem at all or it could be caused by something entirely different.
>>
>> In .NET it would be more normal to check for the BOM with bytes, as by the time you have a string you have (usually) decoded already.
>> IronPython 2.X is a bit odd for the .NET framework in this respect.
>>
>> Michael
>>
>>  
>>    
>>> Thanks!
>>>
>>> -- Leo
>>>
>>> -----Original Message-----
>>> From: users-bounces@...
>>> [mailto:users-bounces@...] On Behalf Of Michael
>>> Foord
>>> Sent: 2009?11?10? 13:17
>>> To: Discussion of IronPython
>>> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>>>
>>> Leonides Saguisag wrote:
>>>  
>>>    
>>>      
>>>> Hi everyone,
>>>>
>>>> I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.
>>>>
>>>> Here is the test script that I put together:
>>>>
>>>>
>>>> import sys
>>>> sys.path.append(r'D:\Python25\Lib')
>>>> import codecs
>>>>
>>>> print sys.version
>>>> myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r') lines =
>>>> myfile.readlines()
>>>> myfile.close()
>>>> if lines[0].startswith(codecs.BOM_UTF8):
>>>> print ('UTF-8 BOM detected!')
>>>> else:
>>>> print ('UTF-8 BOM not detected!')
>>>>
>>>> myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r') lines
>>>> =
>>>> myfile.readlines()
>>>> myfile.close()
>>>> if lines[0].startswith(codecs.BOM_UTF8):
>>>> print ('UTF-8 BOM detected!')
>>>> else:
>>>> print ('UTF-8 BOM not detected!')
>>>>
>>>>
>>>> If I run the executable that I get from SharpDevelop this is what I get:
>>>> bin\Debug> Test.exe
>>>> 2.5.0 ()
>>>> UTF-8 BOM detected!
>>>> UTF-8 BOM detected!
>>>>
>>>>
>>>> But if I run the same script using the standard python interpreter, this is what I get:
>>>> bin\Debug> D:\Python25\python.exe ..\..\Program.py
>>>> 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
>>>> (Intel)]
>>>> UTF-8 BOM detected!
>>>> UTF-8 BOM not detected!
>>>>
>>>>
>>>> The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.
>>>>
>>>> Any ideas what is going wrong?
>>>>  
>>>>    
>>>>      
>>>>        
>>> I'm not in a position to check right now, but this could happen if codes.UTF8_BOM is set to the empty string.
>>>
>>> Michael
>>>
>>>  
>>>    
>>>      
>>>> Thanks!
>>>>
>>>> Best regards,
>>>> -- Leo
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@...
>>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>>  
>>>>    
>>>>      
>>>>        
>>> --
>>> http://www.ironpythoninaction.com/
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users@...
>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>> _______________________________________________
>>> Users mailing list
>>> Users@...
>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>  
>>>    
>>>      
>> --
>> http://www.ironpythoninaction.com/
>>
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>  
>>    
>
>
> --
> http://www.ironpythoninaction.com/
>
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>  


--
http://www.ironpythoninaction.com/

_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com

Re: Weird issue with codecs.BOM_UTF8

by Michael Foord-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Leonides Saguisag wrote:

> Hi Michael,
>
> It seems to be a bug with IronPython 2.0.x then.  I just installed IronPython 2.0.3 and this is what I found:
>
>
> C:\>"C:\Program Files\IronPython 2.0.3\ipy.exe"
> IronPython 2.0.3 (2.0.0.0) on .NET 2.0.50727.3603
> Type "help", "copyright", "credits" or "license" for more information.
>  
>>>> import sys
>>>> sys.path.append(r'D:\Python25\lib')
>>>> import codecs
>>>> lines = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r').readlines()
>>>> print lines
>>>>        
> ['This is a text file without a UTF-8 BOM.\n', 'Line 2\n', 'Line 3']
>  
>>>> lines[0].startswith(codecs.BOM_UTF8)
>>>>        
> True
>  
>>>> ^Z
>>>>        
>
> C:\>
>  
>
> It returned 'True' even though the text file did not have a UTF-8 BOM.  Contrasted with standard Python 2.5:
>
>
> C:\> D:\Python25\python.exe
> Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
>  
>>>> import sys
>>>> import codecs
>>>> lines = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r').readlines()
>>>> print lines
>>>>        
> ['This is a text file without a UTF-8 BOM.\n', 'Line 2\n', 'Line 3']
>  
>>>> lines[0].startswith(codecs.BOM_UTF8)
>>>>        
> False
>  
>>>> ^Z
>>>>        
>
> C:\>
>
>
> So it looks like there was a bug in IronPython 2.0.x with regards to the handling of codecs.BOM_UTF8 that now appears to be fixed in IronPython 2.6.  Does that sound like a fair assessment?
>  

Looks to me like you are right. If the code that checks for the BOM is
in your code rather than a library module then you'll have to find
another way to do the check I'm afraid. (Or upgrade to IronPython 2.6 of
course.)

All the best,

Michael Foord

> Thanks!
>
> -- Leo
>
> -----Original Message-----
> From: users-bounces@... [mailto:users-bounces@...] On Behalf Of Michael Foord
> Sent: 2009?11?10? 14:30
> To: Discussion of IronPython
> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>
> Leonides Saguisag wrote:
>  
>> Hi Michael,
>>
>> I am using SharpDevelop 3.1 which comes with "2.5.0 (IronPython 2.0.2 (2.0.0.0) on .NET 2.0.50727.3603)".
>>
>>  
>>    
> Yeah, that's IronPython 2.0 - which is fine but not as good as IronPython 2.6. ;-)
>
>  
>> So this issue is resolved with IronPython 2.6, then?
>>
>>  
>>    
>
> No idea. I can't reproduce the problem with IronPython 2.6 though. Try installing IronPython 2 and seeing what happens from the interactive interpreter (whether you can reproduce the problem or not). It is
> *possible* that it's caused by the way SharpDevelop generates its executables, but that's highly unlikely to be the cause of the problem.
>
> All the best,
>
> Michael Foord
>
>  
>> Thanks!
>>
>> -- Leo
>>
>> -----Original Message-----
>> From: users-bounces@...
>> [mailto:users-bounces@...] On Behalf Of Michael Foord
>> Sent: 2009?11?10? 14:05
>> To: Discussion of IronPython
>> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>>
>> Leonides Saguisag wrote:
>>  
>>    
>>> Hi Michael,
>>>
>>> I just verified the empty string theory that you mentioned and Python25\lib\codecs.py (comes with the standard library in Python 2.5) has the following defined:
>>>
>>>  
>>>    
>>>      
>> You're using IronPython 2.0 then?
>>
>> If I use IronPython 2.6 it correctly reports a text file as not starting with the BOM:
>>
>>  >>> import codecs
>>  >>> codecs.BOM_UTF8
>> u'\xef\xbb\xbf'
>>  >>> lines = open('foo.txt').readlines()  >>> lines ['foo']  >>>
>> lines[0].startswith(codecs.BOM_UTF8)
>> False
>>
>>
>> All the best,
>>
>> Michael Foord
>>  
>>    
>>> # UTF-8
>>> BOM_UTF8 = '\xef\xbb\xbf'
>>>
>>>
>>> So it is not an empty string.
>>>
>>> Maybe I am approaching this wrong and you guys can provide me with an alternative way of doing this.  I am trying to read a file and determine if the file is encoded in UTF-8 or not.  The approach I took was to use python's built-in open function to read the text file into an array of strings and check if the first line starts with the UTF-8 byte order mark by using line.startswith(codecs.BOM_UTF8).  As I noted below, this works fine in Python 2.5 but in IronPython it just keeps saying it found a UTF-8 BOM even though there is none present.
>>>
>>> Thanks!
>>>
>>> -- Leo
>>>
>>> -----Original Message-----
>>> From: users-bounces@...
>>> [mailto:users-bounces@...] On Behalf Of Michael
>>> Foord
>>> Sent: 2009?11?10? 13:32
>>> To: Discussion of IronPython
>>> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>>>
>>> Leonides Saguisag wrote:
>>>  
>>>    
>>>      
>>>> Thank you for taking the time to reply.  Any idea why this would happen in IronPython but not with the standard Python interpreter?  What is weirding me out is that the exact same script behaves differently depending on whether I use IronPython or the standard Python interpreter.
>>>>  
>>>>    
>>>>      
>>>>        
>>> Well, if codecs.BOM_UTF8 is set to the empty string (you didn't say if you have tried this yet?) then it would be due to a bug in IronPython somewhere - but at least you would know what was causing it.
>>>
>>> If it is the empty string, purely speculating, it could be due to the way the .NET framework treats the BOM at the start of strings. Pure speculation though - that might not be the problem at all or it could be caused by something entirely different.
>>>
>>> In .NET it would be more normal to check for the BOM with bytes, as by the time you have a string you have (usually) decoded already.
>>> IronPython 2.X is a bit odd for the .NET framework in this respect.
>>>
>>> Michael
>>>
>>>  
>>>    
>>>      
>>>> Thanks!
>>>>
>>>> -- Leo
>>>>
>>>> -----Original Message-----
>>>> From: users-bounces@...
>>>> [mailto:users-bounces@...] On Behalf Of Michael
>>>> Foord
>>>> Sent: 2009?11?10? 13:17
>>>> To: Discussion of IronPython
>>>> Subject: Re: [IronPython] Weird issue with codecs.BOM_UTF8
>>>>
>>>> Leonides Saguisag wrote:
>>>>  
>>>>    
>>>>      
>>>>        
>>>>> Hi everyone,
>>>>>
>>>>> I am encountering a weird issue with getting to codecs.BOM_UTF8 to work correctly.  I am using SharpDevelop 3.1.
>>>>>
>>>>> Here is the test script that I put together:
>>>>>
>>>>>
>>>>> import sys
>>>>> sys.path.append(r'D:\Python25\Lib')
>>>>> import codecs
>>>>>
>>>>> print sys.version
>>>>> myfile = open(r'D:\Temp\text_file_with_utf8_bom.txt', 'r') lines =
>>>>> myfile.readlines()
>>>>> myfile.close()
>>>>> if lines[0].startswith(codecs.BOM_UTF8):
>>>>> print ('UTF-8 BOM detected!')
>>>>> else:
>>>>> print ('UTF-8 BOM not detected!')
>>>>>
>>>>> myfile = open(r'D:\Temp\text_file_without_utf8_bom.txt', 'r') lines
>>>>> =
>>>>> myfile.readlines()
>>>>> myfile.close()
>>>>> if lines[0].startswith(codecs.BOM_UTF8):
>>>>> print ('UTF-8 BOM detected!')
>>>>> else:
>>>>> print ('UTF-8 BOM not detected!')
>>>>>
>>>>>
>>>>> If I run the executable that I get from SharpDevelop this is what I get:
>>>>> bin\Debug> Test.exe
>>>>> 2.5.0 ()
>>>>> UTF-8 BOM detected!
>>>>> UTF-8 BOM detected!
>>>>>
>>>>>
>>>>> But if I run the same script using the standard python interpreter, this is what I get:
>>>>> bin\Debug> D:\Python25\python.exe ..\..\Program.py
>>>>> 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
>>>>> (Intel)]
>>>>> UTF-8 BOM detected!
>>>>> UTF-8 BOM not detected!
>>>>>
>>>>>
>>>>> The script works correctly with the standard python interpreter but for some reason is not working right with IronPython.
>>>>>
>>>>> Any ideas what is going wrong?
>>>>>  
>>>>>    
>>>>>      
>>>>>        
>>>>>          
>>>> I'm not in a position to check right now, but this could happen if codes.UTF8_BOM is set to the empty string.
>>>>
>>>> Michael
>>>>
>>>>  
>>>>    
>>>>      
>>>>        
>>>>> Thanks!
>>>>>
>>>>> Best regards,
>>>>> -- Leo
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users@...
>>>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>>>  
>>>>>    
>>>>>      
>>>>>        
>>>>>          
>>>> --
>>>> http://www.ironpythoninaction.com/
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@...
>>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@...
>>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>>  
>>>>    
>>>>      
>>>>        
>>> --
>>> http://www.ironpythoninaction.com/
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users@...
>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>> _______________________________________________
>>> Users mailing list
>>> Users@...
>>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>>  
>>>    
>>>      
>> --
>> http://www.ironpythoninaction.com/
>>
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>> _______________________________________________
>> Users mailing list
>> Users@...
>> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>>  
>>    
>
>
> --
> http://www.ironpythoninaction.com/
>
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
> _______________________________________________
> Users mailing list
> Users@...
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>  


--
http://www.ironpythoninaction.com/

_______________________________________________
Users mailing list
Users@...
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com