Advice about n-grams and comparing vectors

View: New views
12 Messages — Rating Filter:   Alert me  

Advice about n-grams and comparing vectors

by ab-18 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
Hello all,
              I am a relative novice with Weka. I have been playing around and it has some awesome features!

I am looking for a bit of advice about how to accomplish the following task. I have searched the archives to come up with what I have tried till now.
Known information: A bunch of files are provided. Each file (1.txt) has "actions" listed in it, like
<begin>
eat
walk
sleep
eat
snore
walk
sleep
</begin>

Each file can be classified as active or dormant (only 2 states)

The task: To classify a new file

What have I done: I wrote a program which will calculate n-gram frequencies for each file. So for 1.txt, 3-gram would be {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector [1,2,1,..]. What I need to do now is develop a classification strategy so that given a new file, I can predict it is active or dormant. I am currently thinking of using the Kolmogrov-Smirnov test to see if two vectors look similar and then base my decision on the p value (0.05 threshold)

 I am a newbie to machine learning and would greatly appreciate some pointers.

Thanks



__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I am looking for a bit of advice about how to accomplish the following task.
> I have searched the archives to come up with what I have tried till now.
> Known information: A bunch of files are provided. Each file (1.txt) has
> "actions" listed in it, like
> <begin>
> eat
> walk
> sleep
> eat
> snore
> walk
> sleep
> </begin>
>
> Each file can be classified as active or dormant (only 2 states)
>
> The task: To classify a new file
>
> What have I done: I wrote a program which will calculate n-gram frequencies
> for each file. So for 1.txt, 3-gram would be
> {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector
> [1,2,1,..]. What I need to do now is develop a classification strategy so
> that given a new file, I can predict it is active or dormant. I am currently
> thinking of using the Kolmogrov-Smirnov test to see if two vectors look
> similar and then base my decision on the p value (0.05 threshold)
>
>  I am a newbie to machine learning and would greatly appreciate some
> pointers.

Have you thought about using Weka for all those steps? The 3.6.x/3.7.x
versions of Weka contain already an NGramTokenizer.

Here's what you could do:
1. import your text documents with the TextDirectoryLoader converter
    see FAQ "How do I perform text classification?"
    or generate the dataset manually via code yourself, see wiki
article "Creating an ARFF file":
    http://weka.wikispaces.com/Creating+an+ARFF+file

2. convert the data via the StringToWordVector filter (package
weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the
FilteredClassifier meta-classifier with this StringToWordVector and
your choice of base classifier, you don't have to run the filter
separately. This approach also takes care of transforming new data
correctly when classifying new data.

Link to the FAQs available from the Weka homepage.

Cheers, Pete
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by ab-18 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
Awesome! I will try it out, my weka install is a little dated. I will update it and try out your suggestions.

Thanks a ton.


From: Peter Reutemann <fracpete@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 12:06:17 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

> I am looking for a bit of advice about how to accomplish the following task.

> I have searched the archives to come up with what I have tried till now.
> Known information: A bunch of files are provided. Each file (1.txt) has
> "actions" listed in it, like
> <begin>
> eat
> walk
> sleep
> eat
> snore
> walk
> sleep
> </begin>
>
> Each file can be classified as active or dormant (only 2 states)
>
> The task: To classify a new file
>
> What have I done: I wrote a program which will calculate n-gram frequencies
> for each file. So for 1.txt, 3-gram would be
> {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector
> [1,2,1,..]. What I need to do now is develop a classification strategy so
> that given a new file, I can predict it is active or dormant. I am currently
> thinking of using the Kolmogrov-Smirnov test to see if two vectors look
> similar and then base my decision on the p value (0.05 threshold)
>
>  I am a newbie to machine learning and would greatly appreciate some
> pointers.

Have you thought about using Weka for all those steps? The 3.6.x/3.7.x
versions of Weka contain already an NGramTokenizer.

Here's what you could do:
1. import your text documents with the TextDirectoryLoader converter
    see FAQ "How do I perform text classification?"
    or generate the dataset manually via code yourself, see wiki
article "Creating an ARFF file":
    http://weka.wikispaces.com/Creating+an+ARFF+file

2. convert the data via the StringToWordVector filter (package
weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the
FilteredClassifier meta-classifier with this StringToWordVector and
your choice of base classifier, you don't have to run the filter
separately. This approach also takes care of transforming new data
correctly when classifying new data.

Link to the FAQs available from the Weka homepage.

Cheers, Pete
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/          Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by ab-18 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
Thanks Peter for your help. Being a newbie I am facing another small hurdle. I'd greatly appreciate any advice here too.

I did all the steps you mentioned,
[code]
java weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" > data.arff
[/code]
however at the classification stage I get
"problem evaluating classifier: weka.filters.supervised.attribute.discretize: cannot handle unary class!"

My arff file looks like

@relation _home_me_weka-3-6-1_mydata

@attribute text string
@attribute class {active}

@data
'work\njump\neat\ndance\nshout\nloudshout\nwear\ndance\nwork\n',active

Any advice?

Thanks,
-J

From: john <anir_iiit@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 12:11:12 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

Awesome! I will try it out, my weka install is a little dated. I will update it and try out your suggestions.

Thanks a ton.


From: Peter Reutemann <fracpete@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 12:06:17 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

> I am looking for a bit of advice about how to accomplish the following task.

> I have searched the archives to come up with what I have tried till now.
> Known information: A bunch of files are provided. Each file (1.txt) has
> "actions" listed in it, like
> <begin>
> eat
> walk
> sleep
> eat
> snore
> walk
> sleep
> </begin>
>
> Each file can be classified as active or dormant (only 2 states)
>
> The task: To classify a new file
>
> What have I done: I wrote a program which will calculate n-gram frequencies
> for each file. So for 1.txt, 3-gram would be
> {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector
> [1,2,1,..]. What I need to do now is develop a classification strategy so
> that given a new file, I can predict it is active or dormant. I am currently
> thinking of using the Kolmogrov-Smirnov test to see if two vectors look
> similar and then base my decision on the p value (0.05 threshold)
>
>  I am a newbie to machine learning and would greatly appreciate some
> pointers.

Have you thought about using Weka for all those steps? The 3.6.x/3.7.x
versions of Weka contain already an NGramTokenizer.

Here's what you could do:
1. import your text documents with the TextDirectoryLoader converter
    see FAQ "How do I perform text classification?"
    or generate the dataset manually via code yourself, see wiki
article "Creating an ARFF file":
    http://weka.wikispaces.com/Creating+an+ARFF+file

2. convert the data via the StringToWordVector filter (package
weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the
FilteredClassifier meta-classifier with this StringToWordVector and
your choice of base classifier, you don't have to run the filter
separately. This approach also takes care of transforming new data
correctly when classifying new data.

Link to the FAQs available from the Weka homepage.

Cheers, Pete
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/%7Efracpete/          Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by ab-18 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
I got the issue, there's only one class.. now my arff file has more than 1 class but I get an error msg saying: cannot handle string attributes!


From: john <anir_iiit@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 3:16:30 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

Thanks Peter for your help. Being a newbie I am facing another small hurdle. I'd greatly appreciate any advice here too.

I did all the steps you mentioned,
[code]
java weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" > data.arff
[/code]
however at the classification stage I get
"problem evaluating classifier: weka.filters.supervised.attribute.discretize: cannot handle unary class!"

My arff file looks like

@relation _home_me_weka-3-6-1_mydata

@attribute text string
@attribute class {active}

@data
'work\njump\neat\ndance\nshout\nloudshout\nwear\ndance\nwork\n',active

Any advice?

Thanks,
-J

From: john <anir_iiit@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 12:11:12 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

Awesome! I will try it out, my weka install is a little dated. I will update it and try out your suggestions.

Thanks a ton.


From: Peter Reutemann <fracpete@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 12:06:17 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

> I am looking for a bit of advice about how to accomplish the following task.

> I have searched the archives to come up with what I have tried till now.
> Known information: A bunch of files are provided. Each file (1.txt) has
> "actions" listed in it, like
> <begin>
> eat
> walk
> sleep
> eat
> snore
> walk
> sleep
> </begin>
>
> Each file can be classified as active or dormant (only 2 states)
>
> The task: To classify a new file
>
> What have I done: I wrote a program which will calculate n-gram frequencies
> for each file. So for 1.txt, 3-gram would be
> {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector
> [1,2,1,..]. What I need to do now is develop a classification strategy so
> that given a new file, I can predict it is active or dormant. I am currently
> thinking of using the Kolmogrov-Smirnov test to see if two vectors look
> similar and then base my decision on the p value (0.05 threshold)
>
>  I am a newbie to machine learning and would greatly appreciate some
> pointers.

Have you thought about using Weka for all those steps? The 3.6.x/3.7.x
versions of Weka contain already an NGramTokenizer.

Here's what you could do:
1. import your text documents with the TextDirectoryLoader converter
    see FAQ "How do I perform text classification?"
    or generate the dataset manually via code yourself, see wiki
article "Creating an ARFF file":
    http://weka.wikispaces.com/Creating+an+ARFF+file

2. convert the data via the StringToWordVector filter (package
weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the
FilteredClassifier meta-classifier with this StringToWordVector and
your choice of base classifier, you don't have to run the filter
separately. This approach also takes care of transforming new data
correctly when classifying new data.

Link to the FAQs available from the Weka homepage.

Cheers, Pete
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/%7Efracpete/          Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html




_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by ab-18 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
update: I tried

[code]
java weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -filter weka.filters.unsupervised.attribute.StringToNominal > data.arff
[/code]

yet I get the J48 cannot handle string error. I loaded the file in via the explorer, I can see two attributes, text and class, class seems to be nominal but text seems to be "attribute is neither numeric nor nominal"

Any pointers ?

From: john <anir_iiit@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 3:26:48 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

I got the issue, there's only one class.. now my arff file has more than 1 class but I get an error msg saying: cannot handle string attributes!


From: john <anir_iiit@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 3:16:30 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

Thanks Peter for your help. Being a newbie I am facing another small hurdle. I'd greatly appreciate any advice here too.

I did all the steps you mentioned,
[code]
java weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" > data.arff
[/code]
however at the classification stage I get
"problem evaluating classifier: weka.filters.supervised.attribute.discretize: cannot handle unary class!"

My arff file looks like

@relation _home_me_weka-3-6-1_mydata

@attribute text string
@attribute class {active}

@data
'work\njump\neat\ndance\nshout\nloudshout\nwear\ndance\nwork\n',active

Any advice?

Thanks,
-J

From: john <anir_iiit@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 12:11:12 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

Awesome! I will try it out, my weka install is a little dated. I will update it and try out your suggestions.

Thanks a ton.


From: Peter Reutemann <fracpete@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 12:06:17 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

> I am looking for a bit of advice about how to accomplish the following task.

> I have searched the archives to come up with what I have tried till now.
> Known information: A bunch of files are provided. Each file (1.txt) has
> "actions" listed in it, like
> <begin>
> eat
> walk
> sleep
> eat
> snore
> walk
> sleep
> </begin>
>
> Each file can be classified as active or dormant (only 2 states)
>
> The task: To classify a new file
>
> What have I done: I wrote a program which will calculate n-gram frequencies
> for each file. So for 1.txt, 3-gram would be
> {1,eat\nwalk},{2,walk\nsleep},{1,sleep\neat}..and so on. I now have a vector
> [1,2,1,..]. What I need to do now is develop a classification strategy so
> that given a new file, I can predict it is active or dormant. I am currently
> thinking of using the Kolmogrov-Smirnov test to see if two vectors look
> similar and then base my decision on the p value (0.05 threshold)
>
>  I am a newbie to machine learning and would greatly appreciate some
> pointers.

Have you thought about using Weka for all those steps? The 3.6.x/3.7.x
versions of Weka contain already an NGramTokenizer.

Here's what you could do:
1. import your text documents with the TextDirectoryLoader converter
    see FAQ "How do I perform text classification?"
    or generate the dataset manually via code yourself, see wiki
article "Creating an ARFF file":
    http://weka.wikispaces.com/Creating+an+ARFF+file

2. convert the data via the StringToWordVector filter (package
weka.filters.unsupervised.attribute) and the NGramTokenizer. Using the
FilteredClassifier meta-classifier with this StringToWordVector and
your choice of base classifier, you don't have to run the filter
separately. This approach also takes care of transforming new data
correctly when classifying new data.

Link to the FAQs available from the Weka homepage.

Cheers, Pete
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/%7Efracpete/          Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html





_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Please no top-posting, see mailing list etiquette why
(http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html).
Also, please remove irrelevant stuff from the post. And keep in mind
that you're sending this email to several thousand people. Sending an
email every few minutes (sit down, think a bit and try a few things
before firing off an email) could be considered spamming...

> update: I tried
>
> [code]
> java weka.core.converters.TextDirectoryLoader -dir
> /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer
> -min 2 -max 3" -filter weka.filters.unsupervised.attribute.StringToNominal >
> data.arff
> [/code]

Is that a single commandline that you're running? If so, that's not gonna work.

You have to perform the steps separately:
1. generate the initial ARFF file with your documents:
    java weka.core.converters.TextDirectoryLoader \
       -dir /home/me/weka-3-6-1/mydata/ > /some/output/directory/raw.arff

2. preprocess the dataset with the StringToWordVector filter:
    java weka.filters.unsupervised.attribute.StringToWordVector \
        - tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" \
        -c last \
        -i /some/output/directory/raw.arff \
        -o /some/output/directory/raw_processed.arff

3. run a classifier on the dataset (NB: the class attribute is now at
position 1!)
    java weka.classifiers.trees.J48 - t
/some/output/directory/raw_processed.arff

> yet I get the J48 cannot handle string error. I loaded the file in via the
> explorer, I can see two attributes, text and class, class seems to be
> nominal but text seems to be "attribute is neither numeric nor nominal"

You didn't actually process the data with your commandline, hence
you're getting this error message when using J48 and the display in
the Explorer (it will be still listed as a STRING attribute, if you
open the ARFF file with a text editor).

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by ab-18 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.



From: Peter Reutemann <fracpete@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Tue, October 27, 2009 4:59:14 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

Please no top-posting, see mailing list etiquette why
(http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html).
Also, please remove irrelevant stuff from the post. And keep in mind
that you're sending this email to several thousand people. Sending an
email every few minutes (sit down, think a bit and try a few things
before firing off an email) could be considered spamming...

> update: I tried
>
> [code]
> java weka.core.converters.TextDirectoryLoader -dir
> /home/me/weka-3-6-1/mydata/ -tokenizer "weka.core.tokenizers.NGramTokenizer
> -min 2 -max 3" -filter weka.filters.unsupervised.attribute.StringToNominal >
> data.arff
> [/code]

Is that a single commandline that you're running? If so, that's not gonna work.

You have to perform the steps separately:
1. generate the initial ARFF file with your documents:
    java weka.core.converters.TextDirectoryLoader \
      -dir /home/me/weka-3-6-1/mydata/ > /some/output/directory/raw.arff

2. preprocess the dataset with the StringToWordVector filter:
    java weka.filters.unsupervised.attribute.StringToWordVector \
        - tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" \
        -c last \
        -i /some/output/directory/raw.arff \
        -o /some/output/directory/raw_processed.arff

3. run a classifier on the dataset (NB: the class attribute is now at
position 1!)
    java weka.classifiers.trees.J48 - t
/some/output/directory/raw_processed.arff

> yet I get the J48 cannot handle string error. I loaded the file in via the
> explorer, I can see two attributes, text and class, class seems to be
> nominal but text seems to be "attribute is neither numeric nor nominal"

You didn't actually process the data with your commandline, hence
you're getting this error message when using J48 and the display in
the Explorer (it will be still listed as a STRING attribute, if you
open the ARFF file with a text editor).

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/          Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Peter, My apologies

Please no top-posting, see mailing list etiquette why

got it

(http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html).
Also, please remove irrelevant stuff from the post. And keep in mind
that you're sending this email to several thousand people. Sending an
email every few minutes (sit down, think a bit and try a few things
before firing off an email) could be considered spamming...

understood

Thanks for the pointers, I have tried the following using weka-3-6-1

1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/mydata/ > 1.arff
2) java -classpath ./weka.jar weka.filters.unsupervised.attribute.StringToWordVector -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i /home/me/weka-3-6-1/1.arff -o 2.arff
3) used the gui to select 2.arff, select and run a classification model, save the model

Now I created another directory structure with a test file that I want to classify and did

1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir /home/me/weka-3-6-1/testdata/ > 1.arff
2) java -classpath ./weka.jar weka.filters.unsupervised.attribute.StringToWordVector -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i /home/me/weka-3-6-1/1.arff -o 2.arff
3) java -classpath ./weka.jar weka.classifiers.lazy.KStar -l lazy.KStar_oct_28_2009.model -T 2.arff -p 0

and ended up with this error:

java.lang.Exception: training and test set are not compatible
    at weka.classifiers.Evaluation.evaluateModel(Unknown Source)
    at weka.classifiers.Classifier.runClassifier(Unknown Source)
    at weka.classifiers.lazy.KStar.main(Unknown Source)

the testdata directory contains 2 dirs active and lazy and the active directory contains 1 file 1.txt with some info like

<begin>
jump
jump
walk
eat
snore
</begin>

Can someone please give me a pointer as to having created the model, how can I classify a dataset.

Thanks,
-J


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

[...]

> Thanks for the pointers, I have tried the following using weka-3-6-1
>
> 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> /home/me/weka-3-6-1/mydata/ > 1.arff
> 2) java -classpath ./weka.jar
> weka.filters.unsupervised.attribute.StringToWordVector -tokenizer
> "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i
> /home/me/weka-3-6-1/1.arff -o 2.arff
> 3) used the gui to select 2.arff, select and run a classification model,
> save the model
>
> Now I created another directory structure with a test file that I want to
> classify and did
>
> 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> /home/me/weka-3-6-1/testdata/ > 1.arff
> 2) java -classpath ./weka.jar
> weka.filters.unsupervised.attribute.StringToWordVector -tokenizer
> "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i
> /home/me/weka-3-6-1/1.arff -o 2.arff
> 3) java -classpath ./weka.jar weka.classifiers.lazy.KStar -l
> lazy.KStar_oct_28_2009.model -T 2.arff -p 0
>
> and ended up with this error:
>
> java.lang.Exception: training and test set are not compatible
>     at weka.classifiers.Evaluation.evaluateModel(Unknown Source)
>     at weka.classifiers.Classifier.runClassifier(Unknown Source)
>     at weka.classifiers.lazy.KStar.main(Unknown Source)

Yes, Weka expects the training set and test set to have the exactly
same structure.

> the testdata directory contains 2 dirs active and lazy and the active
> directory contains 1 file 1.txt with some info like
>
> <begin>
> jump
> jump
> walk
> eat
> snore
> </begin>
>
> Can someone please give me a pointer as to having created the model, how can
> I classify a dataset.

After you generated the initial datasets using the
TextDirectoryLoader, you have to use batch-filtering when using the
StringToWordVector filter. See FAQ "How do I generate compatible train
and test sets that get processed with a filter?" for more information.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by ab-18 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

From: Peter Reutemann <fracpete@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Wed, October 28, 2009 5:18:06 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

[...]

> Thanks for the pointers, I have tried the following using weka-3-6-1
>
> 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> /home/me/weka-3-6-1/mydata/ > 1.arff
> 2) java -classpath ./weka.jar
> weka.filters.unsupervised.attribute.StringToWordVector -tokenizer
> "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i
> /home/me/weka-3-6-1/1.arff -o 2.arff
> 3) used the gui to select 2.arff, select and run a classification model,
> save the model
>
> Now I created another directory structure with a test file that I want to
> classify and did
>
> 1) java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> /home/me/weka-3-6-1/testdata/ > 1.arff
> 2) java -classpath ./weka.jar
> weka.filters.unsupervised.attribute.StringToWordVector -tokenizer
> "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -i
> /home/me/weka-3-6-1/1.arff -o 2.arff
> 3) java -classpath ./weka.jar weka.classifiers.lazy.KStar -l
> lazy.KStar_oct_28_2009.model -T 2.arff -p 0
>
> and ended up with this error:
>
> java.lang.Exception: training and test set are not compatible
>     at weka.classifiers.Evaluation.evaluateModel(Unknown Source)
>     at weka.classifiers.Classifier.runClassifier(Unknown Source)
>     at weka.classifiers.lazy.KStar.main(Unknown Source)

Yes, Weka expects the training set and test set to have the exactly
same structure.

> the testdata directory contains 2 dirs active and lazy and the active
> directory contains 1 file 1.txt with some info like
>
> <begin>
> jump
> jump
> walk
> eat
> snore
> </begin>
>
> Can someone please give me a pointer as to having created the model, how can
> I classify a dataset.

After you generated the initial datasets using the
TextDirectoryLoader, you have to use batch-filtering when using the
StringToWordVector filter. See FAQ "How do I generate compatible train
and test sets that get processed with a filter?" for more information.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/          Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Once again, thanks for helping out Peter.

I have looked up the FAQ and have used the following:

0) Made 2 Directories mydata and testing. In mydata there are 2 directories, active and lazy, with text files in each. In testing, there are two directories, active and lazy which have one file each. Both the files in testing/active and testing/lazy are also present in the training (mydata) directories.

1)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir ./testing/ > testing1.arff

2)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir ./mydata/ > training1.arff

3)java -classpath ./weka.jar weka.filters.unsupervised.attribute.StringToWordVector -tokenizer "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -b -i training1.arff -o training2.arff -r testing1.arff -s testing2.arff

Now I have a training and testing set in same format

I used the GUI to selct a classifer and build a model and saved it.

4)java -classpath ./weka.jar weka.classifiers.lazy.LWL -l lazy.LWL_oct_30_2009.model -T testing2.arff -p 0

The output is as follows:
=== Predictions on test data ===

 inst#     actual  predicted      error
     1      0          0          0    
     2      0          0          0    

did something go wrong?

The model has

Relative absolute error                 19.021  %
Root relative squared error             59.9506 %
Attributes:   1018
Test mode:    16-fold cross-validation
Instances:    18

Also, consider the case where I have a file and want to test if it is lazy or active (classification), will I have to re-do the whole procedure again
Steps 1 through 4 !! It takes quite a few minutes on my puny laptop to crunch out the tokenized result.

Thanks in advance,
-J


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I have looked up the FAQ and have used the following:
>
> 0) Made 2 Directories mydata and testing. In mydata there are 2 directories,
> active and lazy, with text files in each. In testing, there are two
> directories, active and lazy which have one file each. Both the files in
> testing/active and testing/lazy are also present in the training (mydata)
> directories.
>
> 1)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> ./testing/ > testing1.arff
>
> 2)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> ./mydata/ > training1.arff
>
> 3)java -classpath ./weka.jar
> weka.filters.unsupervised.attribute.StringToWordVector -tokenizer
> "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -b -i training1.arff -o
> training2.arff -r testing1.arff -s testing2.arff
>
> Now I have a training and testing set in same format
>
> I used the GUI to selct a classifer and build a model and saved it.
>
> 4)java -classpath ./weka.jar weka.classifiers.lazy.LWL -l
> lazy.LWL_oct_30_2009.model -T testing2.arff -p 0

Why do you use the GUI for generating the model? And why don't use
just build the classifier on the fly?
-t: training set
-T: test set

If you really want to save the model and re-use it later, then use the
-d option to do this.

> The output is as follows:
> === Predictions on test data ===
>
>  inst#     actual  predicted      error
>      1      0          0          0
>      2      0          0          0
>
> did something go wrong?
>
> The model has
>
> Relative absolute error                 19.021  %
> Root relative squared error             59.9506 %
> Attributes:   1018
> Test mode:    16-fold cross-validation
> Instances:    18

Since you loaded the data in the Explorer, you should have noticed
that the StringToWordVector puts the class attribute in first
position. Unless you specify otherwise, the last attribute is used as
class attribute (Explorer and commandline).

BTW 16-fold CV is rather odd...

> Also, consider the case where I have a file and want to test if it is lazy
> or active (classification), will I have to re-do the whole procedure again
> Steps 1 through 4 !! It takes quite a few minutes on my puny laptop to
> crunch out the tokenized result.

There's nothing you can do about that: machine learning and data
mining are computational expensive and memory intensive process. But
instead of performing those steps manual, automate them by writing a
script, for instance (batch, bash, groovy, jython, perl, ...)

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Advice about n-grams and comparing vectors

by ab-18 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.



From: Peter Reutemann <fracpete@...>
To: Weka machine learning workbench list. <wekalist@...>
Sent: Sun, November 1, 2009 1:27:19 PM
Subject: Re: [Wekalist] Advice about n-grams and comparing vectors

> I have looked up the FAQ and have used the following:

>
> 0) Made 2 Directories mydata and testing. In mydata there are 2 directories,
> active and lazy, with text files in each. In testing, there are two
> directories, active and lazy which have one file each. Both the files in
> testing/active and testing/lazy are also present in the training (mydata)
> directories.
>
> 1)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> ./testing/ > testing1.arff
>
> 2)java -classpath ./weka.jar weka.core.converters.TextDirectoryLoader -dir
> ./mydata/ > training1.arff
>
> 3)java -classpath ./weka.jar
> weka.filters.unsupervised.attribute.StringToWordVector -tokenizer
> "weka.core.tokenizers.NGramTokenizer -min 2 -max 3" -b -i training1.arff -o
> training2.arff -r testing1.arff -s testing2.arff
>
> Now I have a training and testing set in same format
>
> I used the GUI to selct a classifer and build a model and saved it.
>
> 4)java -classpath ./weka.jar weka.classifiers.lazy.LWL -l
> lazy.LWL_oct_30_2009.model -T testing2.arff -p 0

Why do you use the GUI for generating the model? And why don't use
just build the classifier on the fly?
-t: training set
-T: test set

If you really want to save the model and re-use it later, then use the
-d option to do this.

> The output is as follows:
> === Predictions on test data ===
>
>  inst#     actual  predicted      error
>      1      0          0          0
>      2      0          0          0
>
> did something go wrong?
>
> The model has
>
> Relative absolute error                 19.021  %
> Root relative squared error             59.9506 %
> Attributes:   1018
> Test mode:    16-fold cross-validation
> Instances:    18

Since you loaded the data in the Explorer, you should have noticed
that the StringToWordVector puts the class attribute in first
position. Unless you specify otherwise, the last attribute is used as
class attribute (Explorer and commandline).

BTW 16-fold CV is rather odd...

> Also, consider the case where I have a file and want to test if it is lazy
> or active (classification), will I have to re-do the whole procedure again
> Steps 1 through 4 !! It takes quite a few minutes on my puny laptop to
> crunch out the tokenized result.

There's nothing you can do about that: machine learning and data
mining are computational expensive and memory intensive process. But
instead of performing those steps manual, automate them by writing a
script, for instance (batch, bash, groovy, jython, perl, ...)

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/          Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Thanks Peter, much obliged :-)


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html