|
View:
New views
12 Messages
—
Rating Filter:
Alert me
|
|
|
Advice about n-grams and comparing vectorsHello all, I am a relative novice with Weka. I have been playing around and it has some awesome features! I am looking for a bit of advice about how to accomplish the following task. I have searched the archives to come up with what I have tried till now. Known information: A bunch of files are provided. Each file (1.txt) has "actions" listed in it, like <begin> eat walk sleep eat snore walk sleep </begin> Each file can be classified as active or dormant (only 2 states) The task: To classify a new file What have I done: I wrote a program which will calculate n-gram frequencies for each file. So for 1.txt, 3-gram would be {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector [1,2,1,..]. What I need to do now is develop a classification strategy so that given a new file, I can predict it is active or dormant. I am currently thinking of using the Kolmogrov-Smirnov test to see if two vectors look similar and then base my decision on the p value (0.05 threshold) I am a newbie to machine learning and would greatly appreciate some pointers. Thanks __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectors> I am looking for a bit of advice about how to accomplish the following task.
> I have searched the archives to come up with what I have tried till now. > Known information: A bunch of files are provided. Each file (1.txt) has > "actions" listed in it, like > <begin> > eat > walk > sleep > eat > snore > walk > sleep > </begin> > > Each file can be classified as active or dormant (only 2 states) > > The task: To classify a new file > > What have I done: I wrote a program which will calculate n-gram frequencies > for each file. So for 1.txt, 3-gram would be > {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector > [1,2,1,..]. What I need to do now is develop a classification strategy so > that given a new file, I can predict it is active or dormant. I am currently > thinking of using the Kolmogrov-Smirnov test to see if two vectors look > similar and then base my decision on the p value (0.05 threshold) > > I am a newbie to machine learning and would greatly appreciate some > pointers. Have you thought about using Weka for all those steps? The 3.6.x/3.7.x versions of Weka contain already an NGramTokenizer. Here's what you could do: 1. import your text documents with the TextDirectoryLoader converter see FAQ "How do I perform text classification?" or generate the dataset manually via code yourself, see wiki article "Creating an ARFF file": http://weka.wikispaces.com/Creating+an+ARFF+file 2. convert the data via the StringToWordVector filter (package weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the FilteredClassifier meta-classifier with this StringToWordVector and your choice of base classifier, you don't have to run the filter separately. This approach also takes care of transforming new data correctly when classifying new data. Link to the FAQs available from the Weka homepage. Cheers, Pete -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectorsAwesome! I will try it out, my weka install is a little dated. I will update it and try out your suggestions. Thanks a ton. From: Peter Reutemann <fracpete@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 12:06:17 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors > I am looking for a bit of advice about how to accomplish the following task. > I have searched the archives to come up with what I have tried till now. > Known information: A bunch of files are provided. Each file (1.txt) has > "actions" listed in it, like > <begin> > eat > walk > sleep > eat > snore > walk > sleep > </begin> > > Each file can be classified as active or dormant (only 2 states) > > The task: To classify a new file > > What have I done: I wrote a program which will calculate n-gram frequencies > for each file. So for 1.txt, 3-gram would be > {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector > [1,2,1,..]. What I need to do now is develop a classification strategy so > that given a new file, I can predict it is active or dormant. I am currently > thinking of using the Kolmogrov-Smirnov test to see if two vectors look > similar and then base my decision on the p value (0.05 threshold) > > I am a newbie to machine learning and would greatly appreciate some > pointers. Have you thought about using Weka for all those steps? The 3.6.x/3.7.x versions of Weka contain already an NGramTokenizer. Here's what you could do: 1. import your text documents with the TextDirectoryLoader converter see FAQ "How do I perform text classification?" or generate the dataset manually via code yourself, see wiki article "Creating an ARFF file": http://weka.wikispaces.com/Creating+an+ARFF+file 2. convert the data via the StringToWordVector filter (package weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the FilteredClassifier meta-classifier with this StringToWordVector and your choice of base classifier, you don't have to run the filter separately. This approach also takes care of transforming new data correctly when classifying new data. Link to the FAQs available from the Weka homepage. Cheers, Pete -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectorsThanks Peter for your help. Being a newbie I am facing another small hurdle. I'd greatly appreciate any advice here too. I did all the steps you mentioned, [code] java weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" > data.arff [/code] however at the classification stage I get "problem evaluating classifier: weka.filters.supervised.attribute.discretize: cannot handle unary class!" My arff file looks like @relation _home_me_weka-3-6-1_mydata @attribute text string @attribute class {active} @data 'work\njump\neat\ndance\nshout\nloudshout\nwear\ndance\nwork\n',active Any advice? Thanks, -J From: john <anir_iiit@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 12:11:12 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors Awesome! I will try it out, my weka install is a little dated. I will update it and try out your suggestions. Thanks a ton. From: Peter Reutemann <fracpete@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 12:06:17 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors > I am looking for a bit of advice about how to accomplish the following task. > I have searched the archives to come up with what I have tried till now. > Known information: A bunch of files are provided. Each file (1.txt) has > "actions" listed in it, like > <begin> > eat > walk > sleep > eat > snore > walk > sleep > </begin> > > Each file can be classified as active or dormant (only 2 states) > > The task: To classify a new file > > What have I done: I wrote a program which will calculate n-gram frequencies > for each file. So for 1.txt, 3-gram would be > {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector > [1,2,1,..]. What I need to do now is develop a classification strategy so > that given a new file, I can predict it is active or dormant. I am currently > thinking of using the Kolmogrov-Smirnov test to see if two vectors look > similar and then base my decision on the p value (0.05 threshold) > > I am a newbie to machine learning and would greatly appreciate some > pointers. Have you thought about using Weka for all those steps? The 3.6.x/3.7.x versions of Weka contain already an NGramTokenizer. Here's what you could do: 1. import your text documents with the TextDirectoryLoader converter see FAQ "How do I perform text classification?" or generate the dataset manually via code yourself, see wiki article "Creating an ARFF file": http://weka.wikispaces.com/Creating+an+ARFF+file 2. convert the data via the StringToWordVector filter (package weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the FilteredClassifier meta-classifier with this StringToWordVector and your choice of base classifier, you don't have to run the filter separately. This approach also takes care of transforming new data correctly when classifying new data. Link to the FAQs available from the Weka homepage. Cheers, Pete -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/%7Efracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectorsI got the issue, there's only one class.. now my arff file has more than 1 class but I get an error msg saying: cannot handle string attributes! From: john <anir_iiit@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 3:16:30 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors Thanks Peter for your help. Being a newbie I am facing another small hurdle. I'd greatly appreciate any advice here too. I did all the steps you mentioned, [code] java weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" > data.arff [/code] however at the classification stage I get "problem evaluating classifier: weka.filters.supervised.attribute.discretize: cannot handle unary class!" My arff file looks like @relation _home_me_weka-3-6-1_mydata @attribute text string @attribute class {active} @data 'work\njump\neat\ndance\nshout\nloudshout\nwear\ndance\nwork\n',active Any advice? Thanks, -J From: john <anir_iiit@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 12:11:12 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors Awesome! I will try it out, my weka install is a little dated. I will update it and try out your suggestions. Thanks a ton. From: Peter Reutemann <fracpete@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 12:06:17 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors > I am looking for a bit of advice about how to accomplish the following task. > I have searched the archives to come up with what I have tried till now. > Known information: A bunch of files are provided. Each file (1.txt) has > "actions" listed in it, like > <begin> > eat > walk > sleep > eat > snore > walk > sleep > </begin> > > Each file can be classified as active or dormant (only 2 states) > > The task: To classify a new file > > What have I done: I wrote a program which will calculate n-gram frequencies > for each file. So for 1.txt, 3-gram would be > {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector > [1,2,1,..]. What I need to do now is develop a classification strategy so > that given a new file, I can predict it is active or dormant. I am currently > thinking of using the Kolmogrov-Smirnov test to see if two vectors look > similar and then base my decision on the p value (0.05 threshold) > > I am a newbie to machine learning and would greatly appreciate some > pointers. Have you thought about using Weka for all those steps? The 3.6.x/3.7.x versions of Weka contain already an NGramTokenizer. Here's what you could do: 1. import your text documents with the TextDirectoryLoader converter see FAQ "How do I perform text classification?" or generate the dataset manually via code yourself, see wiki article "Creating an ARFF file": http://weka.wikispaces.com/Creating+an+ARFF+file 2. convert the data via the StringToWordVector filter (package weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the FilteredClassifier meta-classifier with this StringToWordVector and your choice of base classifier, you don't have to run the filter separately. This approach also takes care of transforming new data correctly when classifying new data. Link to the FAQs available from the Weka homepage. Cheers, Pete -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/%7Efracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectorsupdate: I tried [code] java weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -filter weka.filters.unsupervised.attribute.StringToNominal > data.arff [/code] yet I get the J48 cannot handle string error. I loaded the file in via the explorer, I can see two attributes, text and class, class seems to be nominal but text seems to be "attribute is neither numeric nor nominal" Any pointers ? From: john <anir_iiit@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 3:26:48 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors I got the issue, there's only one class.. now my arff file has more than 1 class but I get an error msg saying: cannot handle string attributes! From: john <anir_iiit@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 3:16:30 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors Thanks Peter for your help. Being a newbie I am facing another small hurdle. I'd greatly appreciate any advice here too. I did all the steps you mentioned, [code] java weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" > data.arff [/code] however at the classification stage I get "problem evaluating classifier: weka.filters.supervised.attribute.discretize: cannot handle unary class!" My arff file looks like @relation _home_me_weka-3-6-1_mydata @attribute text string @attribute class {active} @data 'work\njump\neat\ndance\nshout\nloudshout\nwear\ndance\nwork\n',active Any advice? Thanks, -J From: john <anir_iiit@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 12:11:12 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors Awesome! I will try it out, my weka install is a little dated. I will update it and try out your suggestions. Thanks a ton. From: Peter Reutemann <fracpete@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 12:06:17 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors > I am looking for a bit of advice about how to accomplish the following task. > I have searched the archives to come up with what I have tried till now. > Known information: A bunch of files are provided. Each file (1.txt) has > "actions" listed in it, like > <begin> > eat > walk > sleep > eat > snore > walk > sleep > </begin> > > Each file can be classified as active or dormant (only 2 states) > > The task: To classify a new file > > What have I done: I wrote a program which will calculate n-gram frequencies > for each file. So for 1.txt, 3-gram would be > {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector > [1,2,1,..]. What I need to do now is develop a classification strategy so > that given a new file, I can predict it is active or dormant. I am currently > thinking of using the Kolmogrov-Smirnov test to see if two vectors look > similar and then base my decision on the p value (0.05 threshold) > > I am a newbie to machine learning and would greatly appreciate some > pointers. Have you thought about using Weka for all those steps? The 3.6.x/3.7.x versions of Weka contain already an NGramTokenizer. Here's what you could do: 1. import your text documents with the TextDirectoryLoader converter see FAQ "How do I perform text classification?" or generate the dataset manually via code yourself, see wiki article "Creating an ARFF file": http://weka.wikispaces.com/Creating+an+ARFF+file 2. convert the data via the StringToWordVector filter (package weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the FilteredClassifier meta-classifier with this StringToWordVector and your choice of base classifier, you don't have to run the filter separately. This approach also takes care of transforming new data correctly when classifying new data. Link to the FAQs available from the Weka homepage. Cheers, Pete -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/%7Efracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectorsPlease no top-posting, see mailing list etiquette why
(http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html). Also, please remove irrelevant stuff from the post. And keep in mind that you're sending this email to several thousand people. Sending an email every few minutes (sit down, think a bit and try a few things before firing off an email) could be considered spamming... > update: I tried > > [code] > java weka.core.converters.TextDirectoryLoader -dir > /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer > -min 2 -max 3" -filter weka.filters.unsupervised.attribute.StringToNominal > > data.arff > [/code] Is that a single commandline that you're running? If so, that's not gonna work. You have to perform the steps separately: 1. generate the initial ARFF file with your documents: java weka.core.converters.TextDirectoryLoader \ -dir /home/me/weka-3-6-1/mydata/ > /some/output/directory/raw.arff 2. preprocess the dataset with the StringToWordVector filter: java weka.filters.unsupervised.attribute.StringToWordVector \ - tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" \ -c last \ -i /some/output/directory/raw.arff \ -o /some/output/directory/raw_processed.arff 3. run a classifier on the dataset (NB: the class attribute is now at position 1!) java weka.classifiers.trees.J48 - t /some/output/directory/raw_processed.arff > yet I get the J48 cannot handle string error. I loaded the file in via the > explorer, I can see two attributes, text and class, class seems to be > nominal but text seems to be "attribute is neither numeric nor nominal" You didn't actually process the data with your commandline, hence you're getting this error message when using J48 and the display in the Explorer (it will be still listed as a STRING attribute, if you open the ARFF file with a text editor). Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectorsFrom: Peter Reutemann <fracpete@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Tue, October 27, 2009 4:59:14 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors Please no top-posting, see mailing list etiquette why (http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html). Also, please remove irrelevant stuff from the post. And keep in mind that you're sending this email to several thousand people. Sending an email every few minutes (sit down, think a bit and try a few things before firing off an email) could be considered spamming... > update: I tried > > [code] > java weka.core.converters.TextDirectoryLoader -dir > /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer > -min 2 -max 3" -filter weka.filters.unsupervised.attribute.StringToNominal > > data.arff > [/code] Is that a single commandline that you're running? If so, that's not gonna work. You have to perform the steps separately: 1. generate the initial ARFF file with your documents: java weka.core.converters.TextDirectoryLoader \ -dir /home/me/weka-3-6-1/mydata/ > /some/output/directory/raw.arff 2. preprocess the dataset with the StringToWordVector filter: java weka.filters.unsupervised.attribute.StringToWordVector \ - tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" \ -c last \ -i /some/output/directory/raw.arff \ -o /some/output/directory/raw_processed.arff 3. run a classifier on the dataset (NB: the class attribute is now at position 1!) java weka.classifiers.trees.J48 - t /some/output/directory/raw_processed.arff > yet I get the J48 cannot handle string error. I loaded the file in via the > explorer, I can see two attributes, text and class, class seems to be > nominal but text seems to be "attribute is neither numeric nor nominal" You didn't actually process the data with your commandline, hence you're getting this error message when using J48 and the display in the Explorer (it will be still listed as a STRING attribute, if you open the ARFF file with a text editor). Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html Peter, My apologies Please no top-posting, see mailing list etiquette why got it (http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html). Also, please remove irrelevant stuff from the post. And keep in mind that you're sending this email to several thousand people. Sending an email every few minutes (sit down, think a bit and try a few things before firing off an email) could be considered spamming... understood Thanks for the pointers, I have tried the following using weka-3-6-1 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ > 1.arff 2) java -classpath ./weka.jar weka.filters.unsupervised.attribute.StringToWordVector -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i /home/me/weka-3-6-1/1.arff -o 2.arff 3) used the gui to select 2.arff, select and run a classification model, save the model Now I created another directory structure with a test file that I want to classify and did 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/testdata/ > 1.arff 2) java -classpath ./weka.jar weka.filters.unsupervised.attribute.StringToWordVector -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i /home/me/weka-3-6-1/1.arff -o 2.arff 3) java -classpath ./weka.jar weka.classifiers.lazy.KStar -l lazy.KStar_oct_28_2009.model -T 2.arff -p 0 and ended up with this error: java.lang.Exception: training and test set are not compatible at weka.classifiers.Evaluation.evaluateModel(Unknown Source) at weka.classifiers.Classifier.runClassifier(Unknown Source) at weka.classifiers.lazy.KStar.main(Unknown Source) the testdata directory contains 2 dirs active and lazy and the active directory contains 1 file 1.txt with some info like <begin> jump jump walk eat snore </begin> Can someone please give me a pointer as to having created the model, how can I classify a dataset. Thanks, -J _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectors[...]
> Thanks for the pointers, I have tried the following using weka-3-6-1 > > 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir > /home/me/weka-3-6-1/mydata/ > 1.arff > 2) java -classpath ./weka.jar > weka.filters.unsupervised.attribute.StringToWordVector -tokenizer > "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i > /home/me/weka-3-6-1/1.arff -o 2.arff > 3) used the gui to select 2.arff, select and run a classification model, > save the model > > Now I created another directory structure with a test file that I want to > classify and did > > 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir > /home/me/weka-3-6-1/testdata/ > 1.arff > 2) java -classpath ./weka.jar > weka.filters.unsupervised.attribute.StringToWordVector -tokenizer > "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i > /home/me/weka-3-6-1/1.arff -o 2.arff > 3) java -classpath ./weka.jar weka.classifiers.lazy.KStar -l > lazy.KStar_oct_28_2009.model -T 2.arff -p 0 > > and ended up with this error: > > java.lang.Exception: training and test set are not compatible > at weka.classifiers.Evaluation.evaluateModel(Unknown Source) > at weka.classifiers.Classifier.runClassifier(Unknown Source) > at weka.classifiers.lazy.KStar.main(Unknown Source) Yes, Weka expects the training set and test set to have the exactly same structure. > the testdata directory contains 2 dirs active and lazy and the active > directory contains 1 file 1.txt with some info like > > <begin> > jump > jump > walk > eat > snore > </begin> > > Can someone please give me a pointer as to having created the model, how can > I classify a dataset. After you generated the initial datasets using the TextDirectoryLoader, you have to use batch-filtering when using the StringToWordVector filter. See FAQ "How do I generate compatible train and test sets that get processed with a filter?" for more information. Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectorsFrom: Peter Reutemann <fracpete@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Wed, October 28, 2009 5:18:06 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors [...] > Thanks for the pointers, I have tried the following using weka-3-6-1 > > 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir > /home/me/weka-3-6-1/mydata/ > 1.arff > 2) java -classpath ./weka.jar > weka.filters.unsupervised.attribute.StringToWordVector -tokenizer > "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i > /home/me/weka-3-6-1/1.arff -o 2.arff > 3) used the gui to select 2.arff, select and run a classification model, > save the model > > Now I created another directory structure with a test file that I want to > classify and did > > 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir > /home/me/weka-3-6-1/testdata/ > 1.arff > 2) java -classpath ./weka.jar > weka.filters.unsupervised.attribute.StringToWordVector -tokenizer > "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i > /home/me/weka-3-6-1/1.arff -o 2.arff > 3) java -classpath ./weka.jar weka.classifiers.lazy.KStar -l > lazy.KStar_oct_28_2009.model -T 2.arff -p 0 > > and ended up with this error: > > java.lang.Exception: training and test set are not compatible > at weka.classifiers.Evaluation.evaluateModel(Unknown Source) > at weka.classifiers.Classifier.runClassifier(Unknown Source) > at weka.classifiers.lazy.KStar.main(Unknown Source) Yes, Weka expects the training set and test set to have the exactly same structure. > the testdata directory contains 2 dirs active and lazy and the active > directory contains 1 file 1.txt with some info like > > <begin> > jump > jump > walk > eat > snore > </begin> > > Can someone please give me a pointer as to having created the model, how can > I classify a dataset. After you generated the initial datasets using the TextDirectoryLoader, you have to use batch-filtering when using the StringToWordVector filter. See FAQ "How do I generate compatible train and test sets that get processed with a filter?" for more information. Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html Once again, thanks for helping out Peter. I have looked up the FAQ and have used the following: 0) Made 2 Directories mydata and testing. In mydata there are 2 directories, active and lazy, with text files in each. In testing, there are two directories, active and lazy which have one file each. Both the files in testing/active and testing/lazy are also present in the training (mydata) directories. 1)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir ./testing/ > testing1.arff 2)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir ./mydata/ > training1.arff 3)java -classpath ./weka.jar weka.filters.unsupervised.attribute.StringToWordVector -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -b -i training1.arff -o training2.arff -r testing1.arff -s testing2.arff Now I have a training and testing set in same format I used the GUI to selct a classifer and build a model and saved it. 4)java -classpath ./weka.jar weka.classifiers.lazy.LWL -l lazy.LWL_oct_30_2009.model -T testing2.arff -p 0 The output is as follows: === Predictions on test data === inst# actual predicted error 1 0 0 0 2 0 0 0 did something go wrong? The model has Relative absolute error 19.021 % Root relative squared error 59.9506 % Attributes: 1018 Test mode: 16-fold cross-validation Instances: 18 Also, consider the case where I have a file and want to test if it is lazy or active (classification), will I have to re-do the whole procedure again Steps 1 through 4 !! It takes quite a few minutes on my puny laptop to crunch out the tokenized result. Thanks in advance, -J _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectors> I have looked up the FAQ and have used the following:
> > 0) Made 2 Directories mydata and testing. In mydata there are 2 directories, > active and lazy, with text files in each. In testing, there are two > directories, active and lazy which have one file each. Both the files in > testing/active and testing/lazy are also present in the training (mydata) > directories. > > 1)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir > ./testing/ > testing1.arff > > 2)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir > ./mydata/ > training1.arff > > 3)java -classpath ./weka.jar > weka.filters.unsupervised.attribute.StringToWordVector -tokenizer > "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -b -i training1.arff -o > training2.arff -r testing1.arff -s testing2.arff > > Now I have a training and testing set in same format > > I used the GUI to selct a classifer and build a model and saved it. > > 4)java -classpath ./weka.jar weka.classifiers.lazy.LWL -l > lazy.LWL_oct_30_2009.model -T testing2.arff -p 0 Why do you use the GUI for generating the model? And why don't use just build the classifier on the fly? -t: training set -T: test set If you really want to save the model and re-use it later, then use the -d option to do this. > The output is as follows: > === Predictions on test data === > > inst# actual predicted error > 1 0 0 0 > 2 0 0 0 > > did something go wrong? > > The model has > > Relative absolute error 19.021 % > Root relative squared error 59.9506 % > Attributes: 1018 > Test mode: 16-fold cross-validation > Instances: 18 Since you loaded the data in the Explorer, you should have noticed that the StringToWordVector puts the class attribute in first position. Unless you specify otherwise, the last attribute is used as class attribute (Explorer and commandline). BTW 16-fold CV is rather odd... > Also, consider the case where I have a file and want to test if it is lazy > or active (classification), will I have to re-do the whole procedure again > Steps 1 through 4 !! It takes quite a few minutes on my puny laptop to > crunch out the tokenized result. There's nothing you can do about that: machine learning and data mining are computational expensive and memory intensive process. But instead of performing those steps manual, automate them by writing a script, for instance (batch, bash, groovy, jython, perl, ...) Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Advice about n-grams and comparing vectorsFrom: Peter Reutemann <fracpete@...> To: Weka machine learning workbench list. <wekalist@...> Sent: Sun, November 1, 2009 1:27:19 PM Subject: Re: [Wekalist] Advice about n-grams and comparing vectors > I have looked up the FAQ and have used the following: > > 0) Made 2 Directories mydata and testing. In mydata there are 2 directories, > active and lazy, with text files in each. In testing, there are two > directories, active and lazy which have one file each. Both the files in > testing/active and testing/lazy are also present in the training (mydata) > directories. > > 1)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir > ./testing/ > testing1.arff > > 2)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir > ./mydata/ > training1.arff > > 3)java -classpath ./weka.jar > weka.filters.unsupervised.attribute.StringToWordVector -tokenizer > "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -b -i training1.arff -o > training2.arff -r testing1.arff -s testing2.arff > > Now I have a training and testing set in same format > > I used the GUI to selct a classifer and build a model and saved it. > > 4)java -classpath ./weka.jar weka.classifiers.lazy.LWL -l > lazy.LWL_oct_30_2009.model -T testing2.arff -p 0 Why do you use the GUI for generating the model? And why don't use just build the classifier on the fly? -t: training set -T: test set If you really want to save the model and re-use it later, then use the -d option to do this. > The output is as follows: > === Predictions on test data === > > inst# actual predicted error > 1 0 0 0 > 2 0 0 0 > > did something go wrong? > > The model has > > Relative absolute error 19.021 % > Root relative squared error 59.9506 % > Attributes: 1018 > Test mode: 16-fold cross-validation > Instances: 18 Since you loaded the data in the Explorer, you should have noticed that the StringToWordVector puts the class attribute in first position. Unless you specify otherwise, the last attribute is used as class attribute (Explorer and commandline). BTW 16-fold CV is rather odd... > Also, consider the case where I have a file and want to test if it is lazy > or active (classification), will I have to re-do the whole procedure again > Steps 1 through 4 !! It takes quite a few minutes on my puny laptop to > crunch out the tokenized result. There's nothing you can do about that: machine learning and data mining are computational expensive and memory intensive process. But instead of performing those steps manual, automate them by writing a script, for instance (batch, bash, groovy, jython, perl, ...) Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html Thanks Peter, much obliged :-) _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
| Free embeddable forum powered by Nabble | Forum Help |