I have a problem about Big 5 plus HKSCS character sets.
I am wondering whether ICU code page for Big 5 plus HKSCS includes all the characters provided in Microsoft's code pages.
Does anybody know where can I get a complete ICU code page for Big 5 plus HKSCS?
Any information is much appreciated.
Regards
Zhao Yingying
Internationalization Development Engineer
SAS Research and Development (Beijing) Co., Ltd.
TEL: +86 10 63103355
Email:
yingying.zhao@...
Web: www.sas.com
SAS ... The Power to Know
-----Original Message-----
From:
icu-support-bounces@... [mailto:
icu-support-bounces@...] On Behalf Of
icu-support-request@...
Sent: Tuesday, September 19, 2006 7:27 AM
To:
icu-support@...
Subject: icu-support Digest, Vol 4, Issue 12
Send icu-support mailing list submissions to
icu-support@...
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.sourceforge.net/lists/listinfo/icu-supportor, via email, send a message with subject or body 'help' to
icu-support-request@...
You can reach the person managing the list at
icu-support-owner@...
When replying, please edit your Subject line so it is more specific
than "Re: Contents of icu-support digest..."
Today's Topics:
1. UCharacter.getNumericValue(int) returns -1 on some full width
Latin capital letters (Tony Wu)
2. A suggestion on character detection function (Yingying Zhao)
3. why is udatatst.c:205 commented out?
(stat(data/out/build/unorm.icu)) (Markus Scherer)
4. Lenient Calendar not adding values properly (Raziel Alvarez)
----------------------------------------------------------------------
Message: 1
Date: Mon, 18 Sep 2006 15:30:33 +0800
From: "Tony Wu" <
wuyuehao@...>
Subject: [icu-support] UCharacter.getNumericValue(int) returns -1 on
some full width Latin capital letters
To:
icu-support@...
Message-ID:
<
211709bc0609180030m1f368c06qc7691ad9c078b113@...>
Content-Type: text/plain; charset="iso-2022-jp"
I encounter a problem on the numeric value of full width Latin capital
letters. It returns -1 after the character U+FF31.
Here is the testcase, can you help me?
public class TestICU {
public static void main(String[] args) {
System.out.println(UCharacter.getNumericValue((int)'\uFF2F'));
System.out.println(UCharacter.getNumericValue((int)'\uFF30'));
System.out.println(UCharacter.getNumericValue((int)'\uFF31'));
System.out.println(UCharacter.getNumericValue((int)'\uFF32'));
System.out.println(UCharacter.getNumericValue((int)'\uFF33'));
System.out.println(UCharacter.getNumericValue((int)'\uFF34'));
System.out.println(UCharacter.getNumericValue((int)'\uFF35'));
System.out.println(UCharacter.getNumericValue((int)'\uFF36'));
System.out.println(UCharacter.getNumericValue((int)'\uFF37'));
System.out.println(UCharacter.getNumericValue((int)'\uFF38'));
System.out.println(UCharacter.getNumericValue((int)'\uFF39'));
System.out.println(UCharacter.getNumericValue((int)'\uFF3A'));
}
}
returns on jdk1.5
24
25
26
-1
-1
-1
-1
-1
-1
-1
-1
-1
--
Tony Wu
China Software Development Lab, IBM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://sourceforge.net/mailarchive/forum.php?forum=icu-support/attachments/20060918/e819da09/attachment.html
------------------------------
Message: 2
Date: Mon, 18 Sep 2006 16:44:04 +0800
From: "Yingying Zhao" <
yingying.zhao@...>
Subject: [icu-support] A suggestion on character detection function
To: <
icu-support@...>
Cc:
andy.heninger@...
Message-ID:
<
983FA94EFCF1AF49B0CE828BB7D72403831C8E@...>
Content-Type: text/plain; charset="us-ascii"
There are mainly 2 points talked about in this mail:
1.
>From version 3.4.4 to the latest ICU4J version, file CharsetRecog_mbcs.java: the code of line 80 (match ())was changed from if (doubleByteCharCount == 0 && badCharCount== 0) to if (doubleByteCharCount == 10 && badCharCount== 0)
Although this modification gets rid of the posibility of "divided by zero" on line 112:
double scaleFactor = 90.0 / maxVal;
It reduced the accuracy when short text strings are given. It turns out that when strings are below 20 bytes, the detection is always failed.
2.
I am confused about line 111:
double maxVal = Math.log((float)doubleByteCharCount / 4);
Why doubleByteCharCount should be divided by 4? Why not use the same pattern as 113 and just plus 1? It can ensure maxVal to be positive (> 0) and so as to avoid any further problem in line 112.
And from its following calculation, we can easily get this relationship:
confidence = (int)(Math.log(commonCharCount+1) * 90.0 / Math.log((float)doubleByteCharCount / 4)+ 10);
When commanChar number is larger than 1/4 of double byte char number, the confidence is above 100.That will be invalid and must be remedied by line 114:
fidence = Math.min(confidence, 100);
While if it use a consistent pattern with line 113 in line 111 and change it to :
double maxVal = Math.log(doubleByteCharCount + 1);
The confidence can be controlled within 100.
Also line 80 can be restored back to
(doubleByteCharCount == 0 && badCharCount== 0)
In this way, the inaccuracy problems referred in point 1 can be get rid of too.
Is this suggestion feasible? Or can it lead to any other problems?
Any discussion about it will be much appreciated.
Regards
Zhao Yingying
Internationalization Development Engineer
SAS Research and Development (Beijing) Co., Ltd.
TEL: +86 10 63103355
Email:
yingying.zhao@...
Web: www.sas.com
SAS ... The Power to Know
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://sourceforge.net/mailarchive/forum.php?forum=icu-support/attachments/20060918/1e29cafd/attachment.html
------------------------------
Message: 3
Date: Mon, 18 Sep 2006 16:15:18 -0700
From: "Markus Scherer" <
markus.icu@...>
Subject: [icu-support] why is udatatst.c:205 commented out?
(stat(data/out/build/unorm.icu))
To: "ICU support mailing list" <
icu-support@...>
Message-ID:
<
6bb028490609181615k33e43601w2f21ce89705daca7@...>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Why is udatatst.c line 205 commented out? It checks for whether
source/data/out/build/unorm.icu exists, and runs certain tests only if
the file does exist.
Well, in my build, the file does not exist, and the line is commented
out, so the certain tests fail. :-(
It looks like George commented out the line...
Code surrounding the line, with CVS annotate:
1.39 (srl 17-Jul-02): strcat(icuDataFilePath, U_ICUDATA_NAME);
1.39 (srl 17-Jul-02): strcat(icuDataFilePath, "_");
1.51 (grhoten- 28-Aug-03): strcat(icuDataFilePath, "unorm.icu");
1.39 (srl 17-Jul-02):
1.39 (srl 17-Jul-02): /* lots_of_mallocs(); */
1.51 (grhoten- 28-Aug-03): /* if (stat(icuDataFilePath,
&stat_buf) == 0)*/
1.30 (aheninge 16-Nov-01): {
1.30 (aheninge 16-Nov-01): int i;
1.39 (srl 17-Jul-02): log_verbose("%s exists,
so..\n", icuDataFilePath);
CVS log:
Revision : 1.51
Date : 2003/8/28 22:52:38
Author : 'grhoten-oss'
State : 'Exp'
Lines : +9 -8
Description :
jitterbug 2966: tz.icu no longer exists.
Ideas?
markus
------------------------------
Message: 4
Date: Mon, 18 Sep 2006 19:27:20 -0400
From: "Raziel Alvarez" <
raziel.ag@...>
Subject: [icu-support] Lenient Calendar not adding values properly
To: "ICU support mailing list" <
icu-support@...>
Message-ID:
<
2b256b200609181627m3737eb32wc0848f7cd53be406@...>
Content-Type: text/plain; charset="iso-8859-1"
---------- Forwarded message ----------
From: Raziel Alvarez <
raziel.ag@...>
Date: Sep 15, 2006 4:04 PM
Subject: Lenient Calendar not adding values properly
To: ICU support mailing list <
icu-support@...>
I create a lenient IslamicCalendar, to which I set dates whose months might
be outside of range; for example, the month value may be 12 or -1, and I
expect to roll over to the next and previous month/year respectively.
The later works fine for regular years, but for leap years I get what for me
it's an inconsistent behaviour: it gives me the same month, the same year,
but the day is the last day of the month. My guess is that the
implementation adds a fixed number of days (to the internal julian days
representation) but it's not taking into account that for leap years the
last Islamic month has an extra day, thus, when the date is recreated it end
up in the same month/year but at the last day of the month.
For example:
IslamicCalendar isCal = new IslamicCalendar();
isCal.set(1427, 12, 1); // 1427 is not leap year
System.out.println(
isCal.get(Calendar.YEAR)+" "+ // 1428
isCal.get(Calendar.MONTH)+" "+ // 0
isCal.get(Calendar.DAY_OF_MONTH)); // 1
isCal.set(1428, 12, 1); // 1428 is leap year
System.out.println(
isCal.get(Calendar.YEAR)+" "+ // 1428 but I'd expect
1429/0/1
isCal.get(Calendar.MONTH)+" "+ // 11
isCal.get(Calendar.DAY_OF_MONTH)); // 30
The same occurs if I use the add method, which states that it will "add a
signed amount to a specified field, *using this calendar's rules*"
isCal.set(1428, 11, 1);
isCal.add(Calendar.MONTH, 1);
System.out.println(
isCal.get(Calendar.YEAR)+" "+ // 1428
isCal.get(Calendar.MONTH )+" "+ // 11
isCal.get(Calendar.DAY_OF_MONTH)); // 30
Can somebody explain this to me please? This seems a bug to me. I do the
same thing with a Gregorian calendar without any problem, probably because
it doesn't have the leap year added to the last month.
If this is a bug, is it going to be solved in the next release? Currently
I'm working with v 3.4
Thanks in advance
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://sourceforge.net/mailarchive/forum.php?forum=icu-support/attachments/20060918/6e6046c3/attachment.html
------------------------------
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV------------------------------
_______________________________________________
icu-support mailing list
icu-support@...
https://lists.sourceforge.net/lists/listinfo/icu-supportEnd of icu-support Digest, Vol 4, Issue 12
******************************************
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/_______________________________________________
icu-support mailing list -
icu-support@...
To Un/Subscribe:
https://lists.sourceforge.net/lists/listinfo/icu-support