|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 | Next > |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708892#action_12708892 ] Uwe Schindler commented on LUCENE-1629: --------------------------------------- I will look into it this evening and provide a patch. Because of the file exclusion problematics, I thought, the approach to have a separate resources directory (like Maven does it), would be a great new invention. We could also do this for the tests. In my opinion, data files should be separated from source files. And by adding the resources folder to classpath during tests saves a lot of disk space during compilation and testing (ok, thats not important). By this compilation/test class path and building the jar files are separate tasks. The problem with my current approach is only, that the JAR packager fails, when the directory is not available :( - Is it so bad to just add an empty resources folder to every compilation unit? This would be similar to Maven. > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708894#action_12708894 ] Michael McCandless commented on LUCENE-1629: -------------------------------------------- OK, I agree, separation of resources from source code is good. Can we limit the required addition of src/resources/org/apache/lucene/* to just contrib/analyzers? Ie, somehow only override its jarify macro? > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708909#action_12708909 ] Uwe Schindler commented on LUCENE-1629: --------------------------------------- Its only needed to have the src/resources folder, no subfolders, I think it would be no problem to add this folder to every compilation unit (I added it to my svn in minutes). The good thing is, that future developments then know, where to put the resource files. But I agree, there should be a better way to automatically detect the resources folder before ANT 1.7.1. Maybe we should ask Erik Hatcher as the ANT specialist...! > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708912#action_12708912 ] Erik Hatcher edited comment on LUCENE-1629 at 5/13/09 5:58 AM: --------------------------------------------------------------- My initial thought is to move the <copy> excluding {noformat} **/*.java and **/*.html{noformat} to the "compile" macro. In the ancient past, Ant actually used to do this automatically with <javac>. was (Author: ehatcher): My initial thought is to move the <copy> excluding **/*.java and **/*.html to the "compile" macro. In the ancient past, Ant actually used to do this automatically with <javac>. > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708912#action_12708912 ] Erik Hatcher commented on LUCENE-1629: -------------------------------------- My initial thought is to move the <copy> excluding **/*.java and **/*.html to the "compile" macro. In the ancient past, Ant actually used to do this automatically with <javac>. > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1629: ---------------------------------- Attachment: build-resources.patch Here another try with Erik's suggestion: I moved the <copy> task to the compile macro and extended the list of exclusions. With some work and verbose=true, I added all "source" files to the exclusion (also .jj and so on). Using this patch, you can compile Xiaoping Gao patch, add the resources to cn/ and cn/smart/hhmm/ and they appear in classpath for testing and the final jar file. My problem with this is the messy exclusion list. During reading ANT docs, I dound out that there is the possibility with the <copy> task to not stop on errors. The idea is now again to put the data files into a maven-like resources folder and just copy them to the classpath (if the folder does not exist, copy would simply do nothing). I post a patch/test later. > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1629: ---------------------------------- Attachment: build-resources-with-folder.patch This is a second try, again with the resources folder. It is now optional, to have a src/resources folder, if it exists, all files from inside are copied to the build destination. The trick was, that the copy task can additionally use a globmapping, and by that, does the following: - The source fileset of the copy task uses the src/ folder directly - The fileset only includes resources/** - Because then the target folder would get an additional sub-folder "resources" (because the base dir of the copy operation is "src/"), the filenames are replaced by a globmapping, stripping the "resources/" from the relative path This patch also adds a simple test case, that shows, that ArabicAnalyzer does not start correctly, when the stopwords.txt file is not in the classpath. The test fails, if the stopwords.txt file stays at the original location and/or the copy operation is commented out. The patch does not contain the deletion of the arabic stopwords file from the sources folder (was binary), so remove it by hand or simply move it after aplying the patch. > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709345#action_12709345 ] Michael McCandless commented on LUCENE-1629: -------------------------------------------- Awesome! I've applied your patch, Uwe, and moved ArabicAnalyzer's stopwords.txt, as well as SmartChineseAnalyzer's stopwords.txt, bigramdict.mem, coredict.mem, under their respective subdirs under src/resources/*. I confirmed TestArabicAnalyzer passes (and verified it really did instantiate ArabicAnalyzer). All tests pass. I will commit shortly. This issue is a delightful example of the collaboration that makes open source development work so well. Thanks Xiaoping, Uwe and Erik! > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Resolved: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1629. ---------------------------------------- Resolution: Fixed Thanks everyone! > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709352#action_12709352 ] Uwe Schindler commented on LUCENE-1629: --------------------------------------- Fine! Should I commit the ArabicAnalyzer test, too? But I think the test is not really needed, as the new chinese analyzer already tests for the resources implicit. One thing: The change is in the main changes.txt, normally it should be in contrib's changes.txt, or not? If it should stay there, we should also add Spatial and TrieRange to main changes.txt. And one other thing: The analyzer (and many more) use the old TokenStream API at the moment, we should change this before 2.9 for all contrib analyzers, see LUCENE-1460? > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709355#action_12709355 ] Michael McCandless commented on LUCENE-1629: -------------------------------------------- bq. Should I commit the ArabicAnalyzer test, too? Woops, I missed it -- I'll commit it. The more tests the better! bq. The change is in the main changes.txt, normally it should be in contrib's changes.txt, or not? Woops -- you're right. I'll move this to contrib's CHANGES.txt. > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709357#action_12709357 ] Michael McCandless commented on LUCENE-1629: -------------------------------------------- bq. The analyzer (and many more) use the old TokenStream API at the moment, we should change this before 2.9 for all contrib analyzers, see LUCENE-1460? Yes -- we need to resolve LUCENE-1460 (and a great many more; the list keeps growing!) before 2.9. > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1629: ---------------------------------- Attachment: LUCENE-1629-encoding-fix.patch Hi Mike, a small patch: The HTML files generated by Javadoc do not contain the charset header and are displayed as ISO-8859-1. This breaks the docs for the chinese analyzer. The attached patch sets the output encoding correctly to UTF-8 using the <meta/> html tag. > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709416#action_12709416 ] Xiaoping Gao commented on LUCENE-1629: -------------------------------------- Test successful on my laptop now! Thank all of you for your patience and hard work! I will continue to maintain this analyzer and develop new features. Best Wishes! > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709420#action_12709420 ] Michael McCandless commented on LUCENE-1629: -------------------------------------------- OK, I just committed that fix (javadocs encoding == UTF-8) Uwe. Thanks. > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709425#action_12709425 ] Uwe Schindler commented on LUCENE-1629: --------------------------------------- Hi Xiaoping, Thanks! The code is now committed. Only for the understanding (as I do not know chinese and cannot read some comments), some questions/comments: The .mem files are serializations of the dictionaries. They are created by loading from the random access file (these dct files) and then serialized to the mem files. But for developers and further updates you need to have the dct files and rerun these steps (that are all these private methods). An interesting addition would be to create a custom build step, that uses the dct files and builds the .mem files from it. How could I invoke that? So maybe you could extract the useless dct file loaders from the current classes and create a separate tool from it, that could be invoked from ant, that builds that mem files. Uwe P.S.: By the way: In these private conversation methods (that are never called from the library code) you have these default try-catch blocks, which is bad programming practice. So the proposed separate conversion tool should correctly handle the exceptions or better just not catch them at all and pass up (side note: I hate eclipse for generating these auto-catch blocks, better would be to auto-add throws-clauses to the method signatures!) > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709796#action_12709796 ] Mingfai Ma commented on LUCENE-1629: ------------------------------------ hi Xiaoping, I'm interested to get the Chinese analyzer work for Traditional Chinese (UTF-8/Big5). Just wonder if your coredict.mem comes from ICTCLAS? (http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008? The ICTCLAS has traditional chinese edition for its 2008 release. But the distribution are not in .dct. I wonder if we have a simple specification for the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to the .dct format to work with your library? > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Issue Comment Edited: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709796#action_12709796 ] Mingfai Ma edited comment on LUCENE-1629 at 5/15/09 3:23 AM: ------------------------------------------------------------- hi Xiaoping, I'm interested to get the Chinese analyzer work for Traditional Chinese (UTF-8/Big5). Just wonder if your coredict.dct comes from ICTCLAS? (http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008? The ICTCLAS has traditional chinese edition for its 2008 release. But the distribution are not in .dct. I wonder if we have a simple specification for the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to the .dct format to work with your library? was (Author: mingfai): hi Xiaoping, I'm interested to get the Chinese analyzer work for Traditional Chinese (UTF-8/Big5). Just wonder if your coredict.mem comes from ICTCLAS? (http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008? The ICTCLAS has traditional chinese edition for its 2008 release. But the distribution are not in .dct. I wonder if we have a simple specification for the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to the .dct format to work with your library? > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709867#action_12709867 ] Xiaoping Gao commented on LUCENE-1629: -------------------------------------- Hello Mingfai! coredict.mem is converted from coredict.dct which come from ICTCLAS1.0, neither 2008 nor 2009. The author authorized me to release just the lexical dictionary from ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of ictclas2008~2009. As far as I know, coredict.dct just contain GB2312 characters, so it cannot support Big5. I think we should find the proper big5 dictionary first, then I will help you to convert to dct file. On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" <jira@...> wrote: > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
|
|
[jira] Commented: (LUCENE-1629) contrib intelligent Analyzer for Chinese[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709866#action_12709866 ] Xiaoping Gao commented on LUCENE-1629: -------------------------------------- Hello Mingfai! coredict.mem is converted from coredict.dct which come from ICTCLAS1.0, neither 2008 nor 2009. The author authorized me to release just the lexical dictionary from ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of ictclas2008~2009. As far as I know, coredict.dct just contain GB2312 characters, so it cannot support Big5. I think we should find the proper big5 dictionary first, then I will help you to convert to dct file. On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" <jira@...> wrote: > contrib intelligent Analyzer for Chinese > ---------------------------------------- > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 > Reporter: Xiaoping Gao > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: analysis-data.zip, bigramdict.mem, build-resources-with-folder.patch, build-resources.patch, build-resources.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously! > Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@... For additional commands, e-mail: java-dev-help@... |
| < Prev | 1 - 2 - 3 - 4 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |