Re: Release announcement simplified Chinese translation update

View: New views
12 Messages — Rating Filter:   Alert me  

Parent Message unknown Re: Release announcement simplified Chinese translation update

by Anthony Wong :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/2/17 Arne Goetje <arne@...>
>
> Matt Kraai wrote:
> > On Mon, Feb 16, 2009 at 11:58:20PM +0800, Arne Goetje wrote:
> >> Matt Kraai wrote:
> >>> We already appear to use a single source version for all three Chinese
> >>> translations: Big5.  Whether it's possible to change to UTF-8 is for
> >>> someone more familiar with Chinese to say.  It's not sufficient to
> >>> just switch the encoding of this file, though:
> >>>
> >>>  $ make
> >>>  cd . && wml -q -D CUR_YEAR=2009 -o UNDEFuZH@uCNuCNHKuCNTW:20090214.zh-cn.html.tmp@g+w -o UNDEFuZH@uHKuCNHKuHKTW:20090214.zh-hk.html.tmp@g+w -o UNDEFuZH@uTWuCNTWuHKTW:20090214.zh-tw.html.tmp@g+w --prolog=../../bin/fix_big5.pl  20090214.wml
> >>>   * Converting: [zh_CN.GB2312], /usr/bin/iconv: illegal input sequence at position 233
> >>>  make: *** [20090214.zh-cn.html] Error 1
> >>>
> >> Doesn't surprise me. A number of characters which are present in Big5
> >> are not present in GB2312 (and vice versa). Using iconv to convert those
> >> characters will lead to such errors.
> >>
> >> zh-autoconvert might give better results.
> >>
> >> Else, if you can give me the link to the source, then I can take a look.
> >
> > Sure, it's available in the webwml CVS module at
> > chinese/News/2009/20090214.wml.  You can find instructions for
> > accessing the repository at
> >
> >  http://www.debian.org/devel/website/using_cvs
> >
>
> OK, attached are the results for review.
>
> Build-Depends: zh-autoconvert
>
> To convert from Big5 into GB2312:
>        autob5 -o gb < 20090214.wml > 20090214_gb2312.wml
> To convert from Big5 into UTF-8:
>        autob5 -o utf8 < 20090214.wml > 20090214_zht_utf8.wml
> To convert from Big5 into simplified Chinese UTF-8:
>        autob5 -o gb < 20090214.wml | autogb -o utf8 > 20090214_zhs_utf8.wml
>
> I used the latter two commands to generate the attached files.
>
> The difference between iconv and zh-autoconvert is that iconv simply
> tries to convert the codepoints one to one and zh-autoconvert uses a
> dictionary to map traditional characters to their simplified
> counterparts. Since the database is quite old, it may not work for
> simplified <-> traditional mappings where simplified characters have
> been added later (GBK) or where the document contains HKSCS characters,
> which use the Big5 Private Use Area. Those characters cannot be converted.
> I have long wanted to create a new library where a full Unicode
> compatible mapping takes place. Unfortunately I don't have the time for
> that. But if there are any volunteers out there, I'm willing to
> coordinate such a project.
>
> Cheers
> Arne

Hi all,

I have been thinking that using Big5 as the primary encoding for both TC (Traditional Chinese) and SC (Simplified Chinese) versions of Debian website are detrimental to user contributions. To summarize the current situation of the Chinese versions of Debian website, translations must be done in Big5 WML files, TC version is basically converted simply from WML to HTML, but to generate the SC versions, Big5 files must be converted to GB2312 first. It is done so due to the one-to-many SC-TC mappings problem. To deal with the differences of terms for the same meaning in TC and SC, like 文件 and 檔案, we use a simple mapping table written in Perl and for some terms that are rarely used, inline WML substitution syntax is used, like [CN:文件:][HKTW:檔案:].

This puts a hurdle for SC users to submit translations to Debian, because they write in SC but then have to use whatever method to convert it to Big5 for submission. And there is also the possibility that the converted Big5 file may not contain proper TC words/phrases. It also gives people the impression that SC contributors are treated like "second-class citizens" (am I too sensitive?). Not to mention that Big5 and GB2312 are both considered as outdated encodings now and should better be replaced by UTF-8, to make the same file accessible to both TC and SC users.

I suggest 1. to convert all existing Chinese WML files for the Debian website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion table to do both TC-SC and SC-TC conversions. This way, we no longer need to care which script the translators use and the burden for them to use Big5 is lifted.

For MediaWiki's Chinese conversion system, please see:
  1. http://meta.wikimedia.org/wiki/Automatic_conversion_between_simplified_and_traditional_Chinese
  2. http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/ZhConversion.php?revision=47314&view=markup
  3. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hant
  4. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hans

Any comments?

--
Anthony

Re: Release announcement simplified Chinese translation update

by Arne Goetje-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Anthony Wong wrote:
 > I suggest 1. to convert all existing Chinese WML files for the Debian
> website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion
> table to do both TC-SC and SC-TC conversions. This way, we no longer
> need to care which script the translators use and the burden for them to
> use Big5 is lifted.

> Any comments?

Excellent idea. :)
Can we get the MediaWiki conversion table packaged as a library for debian?

Cheers
Arne
--
Arne Götje (高盛華) <arne@...>
PGP/GnuPG key: 1024D/685D1E8C
Fingerprint: 2056 F6B7 DEA8 B478 311F  1C34 6E9F D06E 685D 1E8C
Key available at wwwkeys.pgp.net.   Encrypted e-mail preferred.



--
To UNSUBSCRIBE, email to debian-chinese-big5-REQUEST@...
with a subject of "unsubscribe". Trouble? Contact listmaster@...


Re: Release announcement simplified Chinese translation update

by Jens Seidel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Feb 18, 2009 at 02:43:11AM +0800, Anthony Wong wrote:

> I have been thinking that using Big5 as the primary encoding for both TC
> (Traditional Chinese) and SC (Simplified Chinese) versions of Debian website
> are detrimental to user contributions. To summarize the current situation of
> the Chinese versions of Debian website, translations must be done in Big5
> WML files, TC version is basically converted simply from WML to HTML, but to
> generate the SC versions, Big5 files must be converted to GB2312 first. It
> is done so due to the one-to-many SC-TC mappings problem. To deal with the
> differences of terms for the same meaning in TC and SC, like 文件 and 檔案, we
> use a simple mapping table written in Perl and for some terms that are
> rarely used, inline WML substitution syntax is used, like
> [CN:文件:][HKTW:檔案:].
 
> I suggest 1. to convert all existing Chinese WML files for the Debian
> website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion
> table to do both TC-SC and SC-TC conversions.

Will conditionals as [CN:...] still be used/required?

What dialect (TC resp. SC) do you suggest for committed files? Both is
probably not possible or requires that the build system is extented to
either support a file name suffix to recognize the dialect or a WML
tag such as #use debian::chinese::Traditional_Chinese.

Jens


--
To UNSUBSCRIBE, email to debian-chinese-big5-REQUEST@...
with a subject of "unsubscribe". Trouble? Contact listmaster@...


Re: Release announcement simplified Chinese translation update

by Deng Xiyue-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Anthony Wong <ypwong@...> writes:

> 2009/2/17 Arne Goetje <arne@...>
>>
>> Matt Kraai wrote:
>> > On Mon, Feb 16, 2009 at 11:58:20PM +0800, Arne Goetje wrote:
>> >> Matt Kraai wrote:
>> >>> We already appear to use a single source version for all three Chinese
>> >>> translations: Big5.  Whether it's possible to change to UTF-8 is for
>> >>> someone more familiar with Chinese to say.  It's not sufficient to
>> >>> just switch the encoding of this file, though:
>> >>>
>> >>>  $ make
>> >>>  cd . && wml -q -D CUR_YEAR=2009 -o
> UNDEFuZH@uCNuCNHKuCNTW:20090214.zh-cn.html.tmp@g+w -o
> UNDEFuZH@uHKuCNHKuHKTW:20090214.zh-hk.html.tmp@g+w -o
> UNDEFuZH@uTWuCNTWuHKTW:20090214.zh-tw.html.tmp@g+w --prolog=../../bin/
> fix_big5.pl  20090214.wml
>> >>>   * Converting: [zh_CN.GB2312], /usr/bin/iconv: illegal input sequence at
> position 233
>> >>>  make: *** [20090214.zh-cn.html] Error 1
>> >>>
>> >> Doesn't surprise me. A number of characters which are present in Big5
>> >> are not present in GB2312 (and vice versa). Using iconv to convert those
>> >> characters will lead to such errors.
>> >>
>> >> zh-autoconvert might give better results.
>> >>
>> >> Else, if you can give me the link to the source, then I can take a look.
>> >
>> > Sure, it's available in the webwml CVS module at
>> > chinese/News/2009/20090214.wml.  You can find instructions for
>> > accessing the repository at
>> >
>> >  http://www.debian.org/devel/website/using_cvs
>> >
>>
>> OK, attached are the results for review.
>>
>> Build-Depends: zh-autoconvert
>>
>> To convert from Big5 into GB2312:
>>        autob5 -o gb < 20090214.wml > 20090214_gb2312.wml
>> To convert from Big5 into UTF-8:
>>        autob5 -o utf8 < 20090214.wml > 20090214_zht_utf8.wml
>> To convert from Big5 into simplified Chinese UTF-8:
>>        autob5 -o gb < 20090214.wml | autogb -o utf8 > 20090214_zhs_utf8.wml
>>
>> I used the latter two commands to generate the attached files.
>>
>> The difference between iconv and zh-autoconvert is that iconv simply
>> tries to convert the codepoints one to one and zh-autoconvert uses a
>> dictionary to map traditional characters to their simplified
>> counterparts. Since the database is quite old, it may not work for
>> simplified <-> traditional mappings where simplified characters have
>> been added later (GBK) or where the document contains HKSCS characters,
>> which use the Big5 Private Use Area. Those characters cannot be converted.
>> I have long wanted to create a new library where a full Unicode
>> compatible mapping takes place. Unfortunately I don't have the time for
>> that. But if there are any volunteers out there, I'm willing to
>> coordinate such a project.
>>
>> Cheers
>> Arne
>
> Hi all,
>
> I have been thinking that using Big5 as the primary encoding for both
> TC (Traditional Chinese) and SC (Simplified Chinese) versions of
> Debian website are detrimental to user contributions. To summarize the
> current situation of the Chinese versions of Debian website,
> translations must be done in Big5 WML files, TC version is basically
> converted simply from WML to HTML, but to generate the SC versions,
> Big5 files must be converted to GB2312 first. It is done so due to the
> one-to-many SC-TC mappings problem. To deal with the differences of
> terms for the same meaning in TC and SC, like 文件 and 檔案, we use a
> simple mapping table written in Perl and for some terms that are
> rarely used, inline WML substitution syntax is used, like [CN:文
> 件:][HKTW:檔案:].
>
> This puts a hurdle for SC users to submit translations to Debian,
> because they write in SC but then have to use whatever method to
> convert it to Big5 for submission. And there is also the possibility
> that the converted Big5 file may not contain proper TC
> words/phrases. It also gives people the impression that SC
> contributors are treated like "second-class citizens" (am I too
> sensitive?).  Not to mention that Big5 and GB2312 are both considered
> as outdated encodings now and should better be replaced by UTF-8, to
> make the same file accessible to both TC and SC users.
>
> I suggest 1. to convert all existing Chinese WML files for the Debian website
> from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion table to do
> both TC-SC and SC-TC conversions. This way, we no longer need to care which
> script the translators use and the burden for them to use Big5 is lifted.
>
> For MediaWiki's Chinese conversion system, please see:
>
>  1. http://meta.wikimedia.org/wiki/
>     Automatic_conversion_between_simplified_and_traditional_Chinese
>  2. http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/
>     ZhConversion.php?revision=47314&view=markup
>  3. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hant
>  4. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hans
>
>
> Any comments?
>
> --
> Anthony

It is great to migrate to UTF-8 encoding to ease encoding conversion.
However, I'm a little bit concerned with solution for automatic dialect
handling in mediawiki, which is complicated and possibly error-prone.
It'll be good if the inline diversion solution currently in use can be
retained.  Plus, several diversions that are synonyms can be unified, as
the example given above.  Ideas?

Regards,
Deng Xiyue


--
To UNSUBSCRIBE, email to debian-chinese-big5-REQUEST@...
with a subject of "unsubscribe". Trouble? Contact listmaster@...


Re: Release announcement simplified Chinese translation update

by Vern Sun :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

on 三, 2009-02-18 at 02:43 +0800, Anthony Wong wrote:
> I suggest 1. to convert all existing Chinese WML files for the Debian website
> from Big5 to UTF-8
>
> Any comments?
>
如果全部转换成 UTF-8 格式可能会存在问题,假设有两个用户(一个简体,一个繁体)都
贡献了一个翻译:

% cat foo.tc
中國

% cat foo.sc
中国

% enca foo.sc foo.tc
foo.sc: Universal transformation format 8 bits; UTF-8
foo.tc: Universal transformation format 8 bits; UTF-8

把简体用户贡献的翻译从 UTF-8 转到 GB2312 是正常的
~% iconv -f utf8 -t gb2312 foo.sc > foo.sc.gb

但是把繁体用户贡献的翻译从 UTF-8 转到 GB2312 是错误的
~% iconv -f utf8 -t gb2312 foo.tc > foo.tc.gb
iconv: illegal input sequence at position 3

同理,把简体用户贡献的翻译从 UTF-8 转到 BIG5 也是错误的
~% iconv -f utf8 -t big5   foo.sc > foo.sc.big
iconv: illegal input sequence at position 3

~% iconv -f utf8 -t big5   foo.tc > foo.tc.big

~% enca foo.*
foo.sc: Universal transformation format 8 bits; UTF-8
foo.sc.big: Traditional Chinese Industrial Standard; Big5
foo.sc.gb: Simplified Chinese National Standard; GB2312
foo.tc: Universal transformation format 8 bits; UTF-8
foo.tc.big: Traditional Chinese Industrial Standard; Big5
foo.tc.gb: Simplified Chinese National Standard; GB2312

--
Vern
2009-02-18


signature.asc (204 bytes) Download Attachment

Re: Release announcement simplified Chinese translation update

by Dongsheng Song :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/2/18 Vern Sun <s5unty@...>:

> on 三, 2009-02-18 at 02:43 +0800, Anthony Wong wrote:
>> I suggest 1. to convert all existing Chinese WML files for the Debian website
>> from Big5 to UTF-8
>>
>> Any comments?
>>
> 如果全部转换成 UTF-8 格式可能会存在问题,假设有两个用户(一个简体,一个繁体)都
> 贡献了一个翻译:
>
> % cat foo.tc
> 中國
>
> % cat foo.sc
> 中国
>
> % enca foo.sc foo.tc
> foo.sc: Universal transformation format 8 bits; UTF-8
> foo.tc: Universal transformation format 8 bits; UTF-8
>
> 把简体用户贡献的翻译从 UTF-8 转到 GB2312 是正常的
> ~% iconv -f utf8 -t gb2312 foo.sc > foo.sc.gb
>
> 但是把繁体用户贡献的翻译从 UTF-8 转到 GB2312 是错误的
> ~% iconv -f utf8 -t gb2312 foo.tc > foo.tc.gb
> iconv: illegal input sequence at position 3
>
> 同理,把简体用户贡献的翻译从 UTF-8 转到 BIG5 也是错误的
> ~% iconv -f utf8 -t big5   foo.sc > foo.sc.big
> iconv: illegal input sequence at position 3
>
> ~% iconv -f utf8 -t big5   foo.tc > foo.tc.big
>

我不明白,为什么还死抱着 GB2312/Big5 不放手,直接使用 UTF-8 不好吗?
sc <=> tc 应该只转换内容,不应该多此一举的转换到过时的编码。

---
Dongsheng Song

Re: Release announcement simplified Chinese translation update

by Vern Sun :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

on 四, 2009-02-19 at 09:06 +0800, Dongsheng Song wrote:
> 我不明白,为什么还死抱着 GB2312/Big5 不放手,直接使用 UTF-8 不好吗?
> sc <=> tc 应该只转换内容,不应该多此一举的转换到过时的编码。
>

假设一个用户要对 index.wml 文件贡献一些翻译条目,例如 "Debian 是什麼",目前的
处理方式由于所提交的 index.wml 是 BIG5 编码的,通过 iconv 可以直接把 "Debian
是什麼" 转换成 "Debian 是什么",那么简/繁各自的 HTML 文件都能正确显示。

如果把 index.wml 文件由 BIG5 编码转换成 UTF-8 编码,并且假设最终生成的 HTML 文
件也使用 UTF-8 编码,问题是如何在简体 HTML 文件中显示 "Debian 是什么",而在繁
体 HTML 文件中显示 "Debian 是什麼"。

另外考虑一下 UTF-8 编码格式的 index.wml 文件,如果简繁用户先后贡献了两条翻译条
目,并且他们所录入的文字,使用的都是各自习惯的文字系统(简体/繁体)。这时
index.wml 文件同时保存着简体中文和繁体中文两种系统的文字,如何把 wml 文件向
html 文件以正确的文字系统对应关系进行转换呢?

举例来说,一个简体用户在 index.wml 的第6行贡献了一个翻译条目 "<h2>Debian 是什
么</h2>",若干月之后一个繁体用户在 index.wml 的第23行贡献了一个条目 "<h2>現在
開始</h2>"。有什么办法可以实现:

* 将 index.wml 文件转换成 index.zh-cn.html 时,
  把第23行的内容转换成 "<h2>现在开始</h2>"。
  第6行的内容保持不变

* 将 index.wml 文件转换成 index.zh-tw.html/index.zh-hk.html 时,
  把第6行的内容转换成 "<h2>Debian 是什麼</h2>"。
  第23行的内容保持不变

解决了这个问题,就可以实现 wml 文件格式从 BIG5 向 UTF-8 迁移了。
--
Vern
2009-02-19


signature.asc (204 bytes) Download Attachment

Re: Release announcement simplified Chinese translation update

by Dongsheng Song :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/2/19 Vern Sun <s5unty@...>:

> on 四, 2009-02-19 at 09:06 +0800, Dongsheng Song wrote:
>> 我不明白,为什么还死抱着 GB2312/Big5 不放手,直接使用 UTF-8 不好吗?
>> sc <=> tc 应该只转换内容,不应该多此一举的转换到过时的编码。
>>
>
> 假设一个用户要对 index.wml 文件贡献一些翻译条目,例如 "Debian 是什麼",目前的
> 处理方式由于所提交的 index.wml 是 BIG5 编码的,通过 iconv 可以直接把 "Debian
> 是什麼" 转换成 "Debian 是什么",那么简/繁各自的 HTML 文件都能正确显示。
>
> 如果把 index.wml 文件由 BIG5 编码转换成 UTF-8 编码,并且假设最终生成的 HTML 文
> 件也使用 UTF-8 编码,问题是如何在简体 HTML 文件中显示 "Debian 是什么",而在繁
> 体 HTML 文件中显示 "Debian 是什麼"。
>
> 另外考虑一下 UTF-8 编码格式的 index.wml 文件,如果简繁用户先后贡献了两条翻译条
> 目,并且他们所录入的文字,使用的都是各自习惯的文字系统(简体/繁体)。这时
> index.wml 文件同时保存着简体中文和繁体中文两种系统的文字,如何把 wml 文件向
> html 文件以正确的文字系统对应关系进行转换呢?
>
> 举例来说,一个简体用户在 index.wml 的第6行贡献了一个翻译条目 "<h2>Debian 是什
> 么</h2>",若干月之后一个繁体用户在 index.wml 的第23行贡献了一个条目 "<h2>現在
> 開始</h2>"。有什么办法可以实现:
>
> * 将 index.wml 文件转换成 index.zh-cn.html 时,
>  把第23行的内容转换成 "<h2>现在开始</h2>"。
>  第6行的内容保持不变
>
> * 将 index.wml 文件转换成 index.zh-tw.html/index.zh-hk.html 时,
>  把第6行的内容转换成 "<h2>Debian 是什麼</h2>"。
>  第23行的内容保持不变
>
> 解决了这个问题,就可以实现 wml 文件格式从 BIG5 向 UTF-8 迁移了。
> --
> Vern
> 2009-02-19
>

目前似乎是强迫大家都用繁体(big5)做贡献,怎么又要考虑简体和繁体
混排了呢?

我的意见是,每个*.zh.wml 只用繁体或简体,取决于主要贡献者的选择。

例如 SC 贡献者用简体,TC 贡献者用繁体(不必自动侦测编码,与放置
对应的英文版本信息类似,直接加个标记即可)。如果某个 SC 贡献者
不再维护此 wml 了,由 TC 贡献者接手,那么就由 TC 贡献者自己决定
继续用简体,或者切换到繁体。反之亦然。

简繁体的转换不能简单的用 iconv 来转换,应该考虑使用前面讨论的 MediaWiki
转换方法。当然,目前可以暂时用 iconv 作为权宜之计。

规定统一用繁体相当打击 SC 贡献者的积极性,而且可能在转换后降低简体
的质量(简体与繁体的对应不是线性的),以及丢失字符(gb2312 字符集太小了)。

---
Dongsheng Song

Re: Release announcement simplified Chinese translation update

by Vern Sun :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

on 四, 2009-02-19 at 14:24 +0800, Dongsheng Song wrote:
> 目前似乎是强迫大家都用繁体(big5)做贡献,怎么又要考虑简体和繁体
> 混排了呢?
>
如果统一采用 UTF-8 编码,简繁混编的问题应该要考虑。想像一个有很多内容的 wml 文
件,最初是由一位繁体用户贡献的大部分翻译条目,过了一两个月,有一位简体用户也希
望对该 wml 文件作贡献,这位简体用户很可能只关心他所贡献的条目,所以他并没有意
识到他采用的简体中文所贡献的条目,已经和原始 wml 文件产生了简繁混编的问题。

> 我的意见是,每个*.zh.wml 只用繁体或简体,取决于主要贡献者的选择。
>
这三个 html 文件(*.zh-cn.html/*.zh-tw.html/*.zh-hk.html) 都是由一个 *.wml 文件
生成的。

到底用简体还是繁体?很可能 A.wml 采用简体,B.wml 采用繁体,那么在生成 html 文
件的时候如何处理?这就是我们要讨论的问题。

> 例如 SC 贡献者用简体,TC 贡献者用繁体(不必自动侦测编码,与放置
> 对应的英文版本信息类似,直接加个标记即可)。如果某个 SC 贡献者
> 不再维护此 wml 了,由 TC 贡献者接手,那么就由 TC 贡献者自己决定
> 继续用简体,或者切换到繁体。反之亦然。
>
贡献翻译不像维护软件包,不存在接管的问题。翻译是交替进行的。

> 简繁体的转换不能简单的用 iconv 来转换,应该考虑使用前面讨论的 MediaWiki
> 转换方法。当然,目前可以暂时用 iconv 作为权宜之计。
>
我对 MediaWiki 并不了解,我自己的理解是,MediaWiki 只是解决了简体/繁体中文之间
习惯用语的转换,例如简体中文所说的 "软件",在繁体中文应该称作 "軟體"。但是能否
依靠 MediaWiki 把所有简体汉字转换成繁体汉字?例如把 "国" 转换成"國",把 "么"
转换成 "麼"?

> 规定统一用繁体相当打击 SC 贡献者的积极性,而且可能在转换后降低简体
> 的质量(简体与繁体的对应不是线性的),以及丢失字符(gb2312 字符集太小了)。
>
我不认为目前的翻译机制打击了贡献者的积极性,毕竟这是没有办法的办法。否则也不存
在今天我们讨论的问题了。

--
Vern
2009-02-19


signature.asc (204 bytes) Download Attachment

Re: Release announcement simplified Chinese translation update

by Kan-Ru Chen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Feb 19 2009, Vern Sun wrote:

> on 四, 2009-02-19 at 14:24 +0800, Dongsheng Song wrote:
>> 我的意见是,每个*.zh.wml 只用繁体或简体,取决于主要贡献者的选择。
>>
> 这三个 html 文件(*.zh-cn.html/*.zh-tw.html/*.zh-hk.html) 都是由一个 *.wml 文件
> 生成的。
>
> 到底用简体还是繁体?很可能 A.wml 采用简体,B.wml 采用繁体,那么在生成 html 文
> 件的时候如何处理?这就是我们要讨论的问题。
>

統一採用 UTF-8 編碼之後,我認為最好是可以自動轉換,此時源 .wml 內可能簡繁
共存,不再需要考慮貢獻者的語言習慣。

>> 例如 SC 贡献者用简体,TC 贡献者用繁体(不必自动侦测编码,与放置
>> 对应的英文版本信息类似,直接加个标记即可)。如果某个 SC 贡献者
>> 不再维护此 wml 了,由 TC 贡献者接手,那么就由 TC 贡献者自己决定
>> 继续用简体,或者切换到繁体。反之亦然。
>>
> 贡献翻译不像维护软件包,不存在接管的问题。翻译是交替进行的。
>
>> 简繁体的转换不能简单的用 iconv 来转换,应该考虑使用前面讨论的 MediaWiki
>> 转换方法。当然,目前可以暂时用 iconv 作为权宜之计。
>>
> 我对 MediaWiki 并不了解,我自己的理解是,MediaWiki 只是解决了简体/繁体中文之间
> 习惯用语的转换,例如简体中文所说的 "软件",在繁体中文应该称作 "軟體"。但是能否
> 依靠 MediaWiki 把所有简体汉字转换成繁体汉字?例如把 "国" 转换成"國",把 "么"
> 转换成 "麼"?
>
MediaWiki 或是類似同文堂這樣的軟體,可以同時處理簡繁轉換以及慣用語的轉換。
另外再加上 wml 現有的語言標籤機制,應該可以處理自動轉換時的語意錯誤讓產生
的最終文件更加精確。

>> 规定统一用繁体相当打击 SC 贡献者的积极性,而且可能在转换后降低简体
>> 的质量(简体与繁体的对应不是线性的),以及丢失字符(gb2312 字符集太小了)。
>>
> 我不认为目前的翻译机制打击了贡献者的积极性,毕竟这是没有办法的办法。否则也不存
> 在今天我们讨论的问题了。

我認為多少是有點影響的,不然如今也不會需要考慮轉向 UTF-8 編碼了 :)


attachment0 (202 bytes) Download Attachment

Re: Release announcement simplified Chinese translation update

by Kan-Ru Chen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Feb 18 2009, Arne Goetje wrote:

> Anthony Wong wrote:
>  > I suggest 1. to convert all existing Chinese WML files for the Debian
>> website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion
>> table to do both TC-SC and SC-TC conversions. This way, we no longer
>> need to care which script the translators use and the burden for them to
>> use Big5 is lifted.
>
>> Any comments?
>
> Excellent idea. :)
> Can we get the MediaWiki conversion table packaged as a library for debian?
>
> Cheers
> Arne
I don't know much about MediaWiki conversion engine. Does it have a
command line interface or library interface?

Another source of conversion table I'd like to suggest is
TongWenTang[1,2], which is a javascript based firefox addons that
provides online web page conversion. I have wrote a simple script
tongwen.py[3] that import table from TongWenTang and convert a
file. It's usable, there maybe exist other similar projects though :)

- Kanru

[1]: http://tongwen.mozdev.org/
[2]: http://of.openfoundry.org/projects/333/rt
[3]: http://github.com/kanru/tongwen-py/


attachment0 (202 bytes) Download Attachment

Re: Release announcement simplified Chinese translation update

by Kan-Ru Chen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Feb 18 2009, Jens Seidel wrote:

> On Wed, Feb 18, 2009 at 02:43:11AM +0800, Anthony Wong wrote:
>> I have been thinking that using Big5 as the primary encoding for both TC
>> (Traditional Chinese) and SC (Simplified Chinese) versions of Debian website
>> are detrimental to user contributions. To summarize the current situation of
>> the Chinese versions of Debian website, translations must be done in Big5
>> WML files, TC version is basically converted simply from WML to HTML, but to
>> generate the SC versions, Big5 files must be converted to GB2312 first. It
>> is done so due to the one-to-many SC-TC mappings problem. To deal with the
>> differences of terms for the same meaning in TC and SC, like 文件 and 檔案, we
>> use a simple mapping table written in Perl and for some terms that are
>> rarely used, inline WML substitution syntax is used, like
>> [CN:文件:][HKTW:檔案:].
>  
>> I suggest 1. to convert all existing Chinese WML files for the Debian
>> website from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion
>> table to do both TC-SC and SC-TC conversions.
>
> Will conditionals as [CN:...] still be used/required?
>
I think they are still required in case of mis-auto-conversion.

> What dialect (TC resp. SC) do you suggest for committed files? Both is
> probably not possible or requires that the build system is extented to
> either support a file name suffix to recognize the dialect or a WML
> tag such as #use debian::chinese::Traditional_Chinese.
>

If we migrate to a UTF-8 based build system, then both dialects can be
used in one file.

- Kanru


attachment0 (202 bytes) Download Attachment