|
View:
New views
12 Messages
—
Rating Filter:
Alert me
|
|
|
|
|
|
Re: Release announcement simplified Chinese translation updateAnthony Wong wrote:
> I suggest 1. to convert all existing Chinese WML files for the Debian > website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion > table to do both TC-SC and SC-TC conversions. This way, we no longer > need to care which script the translators use and the burden for them to > use Big5 is lifted. > Any comments? Excellent idea. :) Can we get the MediaWiki conversion table packaged as a library for debian? Cheers Arne -- Arne Götje (高盛華) <arne@...> PGP/GnuPG key: 1024D/685D1E8C Fingerprint: 2056 F6B7 DEA8 B478 311F 1C34 6E9F D06E 685D 1E8C Key available at wwwkeys.pgp.net. Encrypted e-mail preferred. -- To UNSUBSCRIBE, email to debian-chinese-big5-REQUEST@... with a subject of "unsubscribe". Trouble? Contact listmaster@... |
|
|
Re: Release announcement simplified Chinese translation updateOn Wed, Feb 18, 2009 at 02:43:11AM +0800, Anthony Wong wrote:
> I have been thinking that using Big5 as the primary encoding for both TC > (Traditional Chinese) and SC (Simplified Chinese) versions of Debian website > are detrimental to user contributions. To summarize the current situation of > the Chinese versions of Debian website, translations must be done in Big5 > WML files, TC version is basically converted simply from WML to HTML, but to > generate the SC versions, Big5 files must be converted to GB2312 first. It > is done so due to the one-to-many SC-TC mappings problem. To deal with the > differences of terms for the same meaning in TC and SC, like 文件 and 檔案, we > use a simple mapping table written in Perl and for some terms that are > rarely used, inline WML substitution syntax is used, like > [CN:文件:][HKTW:檔案:]. > I suggest 1. to convert all existing Chinese WML files for the Debian > website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion > table to do both TC-SC and SC-TC conversions. Will conditionals as [CN:...] still be used/required? What dialect (TC resp. SC) do you suggest for committed files? Both is probably not possible or requires that the build system is extented to either support a file name suffix to recognize the dialect or a WML tag such as #use debian::chinese::Traditional_Chinese. Jens -- To UNSUBSCRIBE, email to debian-chinese-big5-REQUEST@... with a subject of "unsubscribe". Trouble? Contact listmaster@... |
|
|
Re: Release announcement simplified Chinese translation updateAnthony Wong <ypwong@...> writes:
> 2009/2/17 Arne Goetje <arne@...> >> >> Matt Kraai wrote: >> > On Mon, Feb 16, 2009 at 11:58:20PM +0800, Arne Goetje wrote: >> >> Matt Kraai wrote: >> >>> We already appear to use a single source version for all three Chinese >> >>> translations: Big5. Whether it's possible to change to UTF-8 is for >> >>> someone more familiar with Chinese to say. It's not sufficient to >> >>> just switch the encoding of this file, though: >> >>> >> >>> $ make >> >>> cd . && wml -q -D CUR_YEAR=2009 -o > UNDEFuZH@uCNuCNHKuCNTW:20090214.zh-cn.html.tmp@g+w -o > UNDEFuZH@uHKuCNHKuHKTW:20090214.zh-hk.html.tmp@g+w -o > UNDEFuZH@uTWuCNTWuHKTW:20090214.zh-tw.html.tmp@g+w --prolog=../../bin/ > fix_big5.pl 20090214.wml >> >>> * Converting: [zh_CN.GB2312], /usr/bin/iconv: illegal input sequence at > position 233 >> >>> make: *** [20090214.zh-cn.html] Error 1 >> >>> >> >> Doesn't surprise me. A number of characters which are present in Big5 >> >> are not present in GB2312 (and vice versa). Using iconv to convert those >> >> characters will lead to such errors. >> >> >> >> zh-autoconvert might give better results. >> >> >> >> Else, if you can give me the link to the source, then I can take a look. >> > >> > Sure, it's available in the webwml CVS module at >> > chinese/News/2009/20090214.wml. You can find instructions for >> > accessing the repository at >> > >> > http://www.debian.org/devel/website/using_cvs >> > >> >> OK, attached are the results for review. >> >> Build-Depends: zh-autoconvert >> >> To convert from Big5 into GB2312: >> autob5 -o gb < 20090214.wml > 20090214_gb2312.wml >> To convert from Big5 into UTF-8: >> autob5 -o utf8 < 20090214.wml > 20090214_zht_utf8.wml >> To convert from Big5 into simplified Chinese UTF-8: >> autob5 -o gb < 20090214.wml | autogb -o utf8 > 20090214_zhs_utf8.wml >> >> I used the latter two commands to generate the attached files. >> >> The difference between iconv and zh-autoconvert is that iconv simply >> tries to convert the codepoints one to one and zh-autoconvert uses a >> dictionary to map traditional characters to their simplified >> counterparts. Since the database is quite old, it may not work for >> simplified <-> traditional mappings where simplified characters have >> been added later (GBK) or where the document contains HKSCS characters, >> which use the Big5 Private Use Area. Those characters cannot be converted. >> I have long wanted to create a new library where a full Unicode >> compatible mapping takes place. Unfortunately I don't have the time for >> that. But if there are any volunteers out there, I'm willing to >> coordinate such a project. >> >> Cheers >> Arne > > Hi all, > > I have been thinking that using Big5 as the primary encoding for both > TC (Traditional Chinese) and SC (Simplified Chinese) versions of > Debian website are detrimental to user contributions. To summarize the > current situation of the Chinese versions of Debian website, > translations must be done in Big5 WML files, TC version is basically > converted simply from WML to HTML, but to generate the SC versions, > Big5 files must be converted to GB2312 first. It is done so due to the > one-to-many SC-TC mappings problem. To deal with the differences of > terms for the same meaning in TC and SC, like 文件 and 檔案, we use a > simple mapping table written in Perl and for some terms that are > rarely used, inline WML substitution syntax is used, like [CN:文 > 件:][HKTW:檔案:]. > > This puts a hurdle for SC users to submit translations to Debian, > because they write in SC but then have to use whatever method to > convert it to Big5 for submission. And there is also the possibility > that the converted Big5 file may not contain proper TC > words/phrases. It also gives people the impression that SC > contributors are treated like "second-class citizens" (am I too > sensitive?). Not to mention that Big5 and GB2312 are both considered > as outdated encodings now and should better be replaced by UTF-8, to > make the same file accessible to both TC and SC users. > > I suggest 1. to convert all existing Chinese WML files for the Debian website > from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion table to do > both TC-SC and SC-TC conversions. This way, we no longer need to care which > script the translators use and the burden for them to use Big5 is lifted. > > For MediaWiki's Chinese conversion system, please see: > > 1. http://meta.wikimedia.org/wiki/ > Automatic_conversion_between_simplified_and_traditional_Chinese > 2. http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/ > ZhConversion.php?revision=47314&view=markup > 3. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hant > 4. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hans > > > Any comments? > > -- > Anthony It is great to migrate to UTF-8 encoding to ease encoding conversion. However, I'm a little bit concerned with solution for automatic dialect handling in mediawiki, which is complicated and possibly error-prone. It'll be good if the inline diversion solution currently in use can be retained. Plus, several diversions that are synonyms can be unified, as the example given above. Ideas? Regards, Deng Xiyue -- To UNSUBSCRIBE, email to debian-chinese-big5-REQUEST@... with a subject of "unsubscribe". Trouble? Contact listmaster@... |
|
|
Re: Release announcement simplified Chinese translation updateon 三, 2009-02-18 at 02:43 +0800, Anthony Wong wrote:
> I suggest 1. to convert all existing Chinese WML files for the Debian website > from Big5 to UTF-8 > > Any comments? > 如果全部转换成 UTF-8 格式可能会存在问题,假设有两个用户(一个简体,一个繁体)都 贡献了一个翻译: % cat foo.tc 中國 % cat foo.sc 中国 % enca foo.sc foo.tc foo.sc: Universal transformation format 8 bits; UTF-8 foo.tc: Universal transformation format 8 bits; UTF-8 把简体用户贡献的翻译从 UTF-8 转到 GB2312 是正常的 ~% iconv -f utf8 -t gb2312 foo.sc > foo.sc.gb 但是把繁体用户贡献的翻译从 UTF-8 转到 GB2312 是错误的 ~% iconv -f utf8 -t gb2312 foo.tc > foo.tc.gb iconv: illegal input sequence at position 3 同理,把简体用户贡献的翻译从 UTF-8 转到 BIG5 也是错误的 ~% iconv -f utf8 -t big5 foo.sc > foo.sc.big iconv: illegal input sequence at position 3 ~% iconv -f utf8 -t big5 foo.tc > foo.tc.big ~% enca foo.* foo.sc: Universal transformation format 8 bits; UTF-8 foo.sc.big: Traditional Chinese Industrial Standard; Big5 foo.sc.gb: Simplified Chinese National Standard; GB2312 foo.tc: Universal transformation format 8 bits; UTF-8 foo.tc.big: Traditional Chinese Industrial Standard; Big5 foo.tc.gb: Simplified Chinese National Standard; GB2312 -- Vern 2009-02-18 |
|
|
Re: Release announcement simplified Chinese translation update2009/2/18 Vern Sun <s5unty@...>:
> on 三, 2009-02-18 at 02:43 +0800, Anthony Wong wrote: >> I suggest 1. to convert all existing Chinese WML files for the Debian website >> from Big5 to UTF-8 >> >> Any comments? >> > 如果全部转换成 UTF-8 格式可能会存在问题,假设有两个用户(一个简体,一个繁体)都 > 贡献了一个翻译: > > % cat foo.tc > 中國 > > % cat foo.sc > 中国 > > % enca foo.sc foo.tc > foo.sc: Universal transformation format 8 bits; UTF-8 > foo.tc: Universal transformation format 8 bits; UTF-8 > > 把简体用户贡献的翻译从 UTF-8 转到 GB2312 是正常的 > ~% iconv -f utf8 -t gb2312 foo.sc > foo.sc.gb > > 但是把繁体用户贡献的翻译从 UTF-8 转到 GB2312 是错误的 > ~% iconv -f utf8 -t gb2312 foo.tc > foo.tc.gb > iconv: illegal input sequence at position 3 > > 同理,把简体用户贡献的翻译从 UTF-8 转到 BIG5 也是错误的 > ~% iconv -f utf8 -t big5 foo.sc > foo.sc.big > iconv: illegal input sequence at position 3 > > ~% iconv -f utf8 -t big5 foo.tc > foo.tc.big > 我不明白,为什么还死抱着 GB2312/Big5 不放手,直接使用 UTF-8 不好吗? sc <=> tc 应该只转换内容,不应该多此一举的转换到过时的编码。 --- Dongsheng Song |
|
|
Re: Release announcement simplified Chinese translation updateon 四, 2009-02-19 at 09:06 +0800, Dongsheng Song wrote:
> 我不明白,为什么还死抱着 GB2312/Big5 不放手,直接使用 UTF-8 不好吗? > sc <=> tc 应该只转换内容,不应该多此一举的转换到过时的编码。 > 假设一个用户要对 index.wml 文件贡献一些翻译条目,例如 "Debian 是什麼",目前的 处理方式由于所提交的 index.wml 是 BIG5 编码的,通过 iconv 可以直接把 "Debian 是什麼" 转换成 "Debian 是什么",那么简/繁各自的 HTML 文件都能正确显示。 如果把 index.wml 文件由 BIG5 编码转换成 UTF-8 编码,并且假设最终生成的 HTML 文 件也使用 UTF-8 编码,问题是如何在简体 HTML 文件中显示 "Debian 是什么",而在繁 体 HTML 文件中显示 "Debian 是什麼"。 另外考虑一下 UTF-8 编码格式的 index.wml 文件,如果简繁用户先后贡献了两条翻译条 目,并且他们所录入的文字,使用的都是各自习惯的文字系统(简体/繁体)。这时 index.wml 文件同时保存着简体中文和繁体中文两种系统的文字,如何把 wml 文件向 html 文件以正确的文字系统对应关系进行转换呢? 举例来说,一个简体用户在 index.wml 的第6行贡献了一个翻译条目 "<h2>Debian 是什 么</h2>",若干月之后一个繁体用户在 index.wml 的第23行贡献了一个条目 "<h2>現在 開始</h2>"。有什么办法可以实现: * 将 index.wml 文件转换成 index.zh-cn.html 时, 把第23行的内容转换成 "<h2>现在开始</h2>"。 第6行的内容保持不变 * 将 index.wml 文件转换成 index.zh-tw.html/index.zh-hk.html 时, 把第6行的内容转换成 "<h2>Debian 是什麼</h2>"。 第23行的内容保持不变 解决了这个问题,就可以实现 wml 文件格式从 BIG5 向 UTF-8 迁移了。 -- Vern 2009-02-19 |
|
|
Re: Release announcement simplified Chinese translation update2009/2/19 Vern Sun <s5unty@...>:
> on 四, 2009-02-19 at 09:06 +0800, Dongsheng Song wrote: >> 我不明白,为什么还死抱着 GB2312/Big5 不放手,直接使用 UTF-8 不好吗? >> sc <=> tc 应该只转换内容,不应该多此一举的转换到过时的编码。 >> > > 假设一个用户要对 index.wml 文件贡献一些翻译条目,例如 "Debian 是什麼",目前的 > 处理方式由于所提交的 index.wml 是 BIG5 编码的,通过 iconv 可以直接把 "Debian > 是什麼" 转换成 "Debian 是什么",那么简/繁各自的 HTML 文件都能正确显示。 > > 如果把 index.wml 文件由 BIG5 编码转换成 UTF-8 编码,并且假设最终生成的 HTML 文 > 件也使用 UTF-8 编码,问题是如何在简体 HTML 文件中显示 "Debian 是什么",而在繁 > 体 HTML 文件中显示 "Debian 是什麼"。 > > 另外考虑一下 UTF-8 编码格式的 index.wml 文件,如果简繁用户先后贡献了两条翻译条 > 目,并且他们所录入的文字,使用的都是各自习惯的文字系统(简体/繁体)。这时 > index.wml 文件同时保存着简体中文和繁体中文两种系统的文字,如何把 wml 文件向 > html 文件以正确的文字系统对应关系进行转换呢? > > 举例来说,一个简体用户在 index.wml 的第6行贡献了一个翻译条目 "<h2>Debian 是什 > 么</h2>",若干月之后一个繁体用户在 index.wml 的第23行贡献了一个条目 "<h2>現在 > 開始</h2>"。有什么办法可以实现: > > * 将 index.wml 文件转换成 index.zh-cn.html 时, > 把第23行的内容转换成 "<h2>现在开始</h2>"。 > 第6行的内容保持不变 > > * 将 index.wml 文件转换成 index.zh-tw.html/index.zh-hk.html 时, > 把第6行的内容转换成 "<h2>Debian 是什麼</h2>"。 > 第23行的内容保持不变 > > 解决了这个问题,就可以实现 wml 文件格式从 BIG5 向 UTF-8 迁移了。 > -- > Vern > 2009-02-19 > 目前似乎是强迫大家都用繁体(big5)做贡献,怎么又要考虑简体和繁体 混排了呢? 我的意见是,每个*.zh.wml 只用繁体或简体,取决于主要贡献者的选择。 例如 SC 贡献者用简体,TC 贡献者用繁体(不必自动侦测编码,与放置 对应的英文版本信息类似,直接加个标记即可)。如果某个 SC 贡献者 不再维护此 wml 了,由 TC 贡献者接手,那么就由 TC 贡献者自己决定 继续用简体,或者切换到繁体。反之亦然。 简繁体的转换不能简单的用 iconv 来转换,应该考虑使用前面讨论的 MediaWiki 转换方法。当然,目前可以暂时用 iconv 作为权宜之计。 规定统一用繁体相当打击 SC 贡献者的积极性,而且可能在转换后降低简体 的质量(简体与繁体的对应不是线性的),以及丢失字符(gb2312 字符集太小了)。 --- Dongsheng Song |
|
|
Re: Release announcement simplified Chinese translation updateon 四, 2009-02-19 at 14:24 +0800, Dongsheng Song wrote:
> 目前似乎是强迫大家都用繁体(big5)做贡献,怎么又要考虑简体和繁体 > 混排了呢? > 如果统一采用 UTF-8 编码,简繁混编的问题应该要考虑。想像一个有很多内容的 wml 文 件,最初是由一位繁体用户贡献的大部分翻译条目,过了一两个月,有一位简体用户也希 望对该 wml 文件作贡献,这位简体用户很可能只关心他所贡献的条目,所以他并没有意 识到他采用的简体中文所贡献的条目,已经和原始 wml 文件产生了简繁混编的问题。 > 我的意见是,每个*.zh.wml 只用繁体或简体,取决于主要贡献者的选择。 > 这三个 html 文件(*.zh-cn.html/*.zh-tw.html/*.zh-hk.html) 都是由一个 *.wml 文件 生成的。 到底用简体还是繁体?很可能 A.wml 采用简体,B.wml 采用繁体,那么在生成 html 文 件的时候如何处理?这就是我们要讨论的问题。 > 例如 SC 贡献者用简体,TC 贡献者用繁体(不必自动侦测编码,与放置 > 对应的英文版本信息类似,直接加个标记即可)。如果某个 SC 贡献者 > 不再维护此 wml 了,由 TC 贡献者接手,那么就由 TC 贡献者自己决定 > 继续用简体,或者切换到繁体。反之亦然。 > 贡献翻译不像维护软件包,不存在接管的问题。翻译是交替进行的。 > 简繁体的转换不能简单的用 iconv 来转换,应该考虑使用前面讨论的 MediaWiki > 转换方法。当然,目前可以暂时用 iconv 作为权宜之计。 > 我对 MediaWiki 并不了解,我自己的理解是,MediaWiki 只是解决了简体/繁体中文之间 习惯用语的转换,例如简体中文所说的 "软件",在繁体中文应该称作 "軟體"。但是能否 依靠 MediaWiki 把所有简体汉字转换成繁体汉字?例如把 "国" 转换成"國",把 "么" 转换成 "麼"? > 规定统一用繁体相当打击 SC 贡献者的积极性,而且可能在转换后降低简体 > 的质量(简体与繁体的对应不是线性的),以及丢失字符(gb2312 字符集太小了)。 > 我不认为目前的翻译机制打击了贡献者的积极性,毕竟这是没有办法的办法。否则也不存 在今天我们讨论的问题了。 -- Vern 2009-02-19 |
|
|
Re: Release announcement simplified Chinese translation updateOn Thu, Feb 19 2009, Vern Sun wrote:
> on 四, 2009-02-19 at 14:24 +0800, Dongsheng Song wrote: >> 我的意见是,每个*.zh.wml 只用繁体或简体,取决于主要贡献者的选择。 >> > 这三个 html 文件(*.zh-cn.html/*.zh-tw.html/*.zh-hk.html) 都是由一个 *.wml 文件 > 生成的。 > > 到底用简体还是繁体?很可能 A.wml 采用简体,B.wml 采用繁体,那么在生成 html 文 > 件的时候如何处理?这就是我们要讨论的问题。 > 統一採用 UTF-8 編碼之後,我認為最好是可以自動轉換,此時源 .wml 內可能簡繁 共存,不再需要考慮貢獻者的語言習慣。 >> 例如 SC 贡献者用简体,TC 贡献者用繁体(不必自动侦测编码,与放置 >> 对应的英文版本信息类似,直接加个标记即可)。如果某个 SC 贡献者 >> 不再维护此 wml 了,由 TC 贡献者接手,那么就由 TC 贡献者自己决定 >> 继续用简体,或者切换到繁体。反之亦然。 >> > 贡献翻译不像维护软件包,不存在接管的问题。翻译是交替进行的。 > >> 简繁体的转换不能简单的用 iconv 来转换,应该考虑使用前面讨论的 MediaWiki >> 转换方法。当然,目前可以暂时用 iconv 作为权宜之计。 >> > 我对 MediaWiki 并不了解,我自己的理解是,MediaWiki 只是解决了简体/繁体中文之间 > 习惯用语的转换,例如简体中文所说的 "软件",在繁体中文应该称作 "軟體"。但是能否 > 依靠 MediaWiki 把所有简体汉字转换成繁体汉字?例如把 "国" 转换成"國",把 "么" > 转换成 "麼"? > 另外再加上 wml 現有的語言標籤機制,應該可以處理自動轉換時的語意錯誤讓產生 的最終文件更加精確。 >> 规定统一用繁体相当打击 SC 贡献者的积极性,而且可能在转换后降低简体 >> 的质量(简体与繁体的对应不是线性的),以及丢失字符(gb2312 字符集太小了)。 >> > 我不认为目前的翻译机制打击了贡献者的积极性,毕竟这是没有办法的办法。否则也不存 > 在今天我们讨论的问题了。 我認為多少是有點影響的,不然如今也不會需要考慮轉向 UTF-8 編碼了 :) |
|
|
Re: Release announcement simplified Chinese translation updateOn Wed, Feb 18 2009, Arne Goetje wrote:
> Anthony Wong wrote: > > I suggest 1. to convert all existing Chinese WML files for the Debian >> website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion >> table to do both TC-SC and SC-TC conversions. This way, we no longer >> need to care which script the translators use and the burden for them to >> use Big5 is lifted. > >> Any comments? > > Excellent idea. :) > Can we get the MediaWiki conversion table packaged as a library for debian? > > Cheers > Arne command line interface or library interface? Another source of conversion table I'd like to suggest is TongWenTang[1,2], which is a javascript based firefox addons that provides online web page conversion. I have wrote a simple script tongwen.py[3] that import table from TongWenTang and convert a file. It's usable, there maybe exist other similar projects though :) - Kanru [1]: http://tongwen.mozdev.org/ [2]: http://of.openfoundry.org/projects/333/rt [3]: http://github.com/kanru/tongwen-py/ |
|
|
Re: Release announcement simplified Chinese translation updateOn Wed, Feb 18 2009, Jens Seidel wrote:
> On Wed, Feb 18, 2009 at 02:43:11AM +0800, Anthony Wong wrote: >> I have been thinking that using Big5 as the primary encoding for both TC >> (Traditional Chinese) and SC (Simplified Chinese) versions of Debian website >> are detrimental to user contributions. To summarize the current situation of >> the Chinese versions of Debian website, translations must be done in Big5 >> WML files, TC version is basically converted simply from WML to HTML, but to >> generate the SC versions, Big5 files must be converted to GB2312 first. It >> is done so due to the one-to-many SC-TC mappings problem. To deal with the >> differences of terms for the same meaning in TC and SC, like 文件 and 檔案, we >> use a simple mapping table written in Perl and for some terms that are >> rarely used, inline WML substitution syntax is used, like >> [CN:文件:][HKTW:檔案:]. > >> I suggest 1. to convert all existing Chinese WML files for the Debian >> website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion >> table to do both TC-SC and SC-TC conversions. > > Will conditionals as [CN:...] still be used/required? > > What dialect (TC resp. SC) do you suggest for committed files? Both is > probably not possible or requires that the build system is extented to > either support a file name suffix to recognize the dialect or a WML > tag such as #use debian::chinese::Traditional_Chinese. > If we migrate to a UTF-8 based build system, then both dialects can be used in one file. - Kanru |
| Free embeddable forum powered by Nabble | Forum Help |