[
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706928#action_12706928 ]
Xiaoping Gao commented on LUCENE-1629:
--------------------------------------
to McCandless:
There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki:
* All core code to be included in 2.X releases should be compatible with Java 1.4.
* All contrib code should be compatible with *either Java 5 or 1.4*.
I have corrected the copyright header and @author tags, thank you.
to Schindler:
1. This is really a good idea, I wanna to move the data file into jar in next develop cycle, but now I need to make some changes to the data files independently, can I just commit the codes now?
2. I have changed the getInstance() method to synchronized
3. All the source files are fixed encoded using UTF-8, and I had put a notice in package.html, Should I do something else?
Thank you all!
> contrib intelligent Analyzer for Chinese
> ----------------------------------------
>
> Key: LUCENE-1629
> URL:
https://issues.apache.org/jira/browse/LUCENE-1629> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
> Reporter: Xiaoping Gao
> Attachments: analysis-data.zip, LUCENE-1629.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here:
http://code.google.com/p/imdict-chinese-analyzer/> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously!
> Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail:
java-dev-unsubscribe@...
For additional commands, e-mail:
java-dev-help@...