|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
remove dubble string after the use of a stemming algorithmHI At the moment I am using the dutch
wordstemmer separately. When applying the stemmer to a dataset with a large
amount of single words, the output of the stringtowordvector filter, e.g. the
attached file, the output has the same amount of features. Is it possible to reduce
the number of unique features and adapting the frequencies of the stemmed words? For instance: Nederlandse Nederlands Nederlanders Nederlander Nederland Becomes Nederland Nederland Nederlander Nederlander Nederland And should become Nederland (freq 3) Nederlander (freq 2) Kind regards,
Paul Adriani Ba. sc. en ssc. Jan van Riebeekstraat 14-3 1057ZX Amsterdam tel 0644141917 Infocaster BV _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: remove dubble string after the use of a stemming algorithm> At the moment I am using the dutch wordstemmer separately. When applying the
Instead of stemming the text outside Weka, I've converted your data
> stemmer to a dataset with a large amount of single words, the output of the > stringtowordvector filter, e.g. the attached file, the output has the same > amount of features. Is it possible to reduce the number of unique features > and adapting the frequencies of the stemmed words? > > For instance: > > Nederlandse > Nederlands > Nederlanders > Nederlander > Nederland > > Becomes > > Nederland > Nederland > Nederlander > Nederlander > Nederland > > And should become > > Nederland (freq 3) > Nederlander (freq 2) into an ARFF file (see attached file) and perform all preprocessing within Weka using the StringToWordVector filter. As stemmer I'm using the "dutch" stemmer of the snowball stemmers (see FAQ "The snowball stemmers don't work, what am I doing wrong?", if you haven't added them to Weka yet). When not using word counts, then the generated "Nederland" and "Nederlander" attributes contain only 1 (= binary attribute, word occurs or not). But if I turn on word counts (outputWordCounts/-C), then I get 3 for "Nederland" and 2 for "Nederlander". Does that help? Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
RE: remove dubble string after the use of a stemming algorithmHi Peter,
Thank you for the response. In weka this method works fine and the result is good. However, I cannot apply the same method to the file I attached. For some reason it does not use the -R option to only select the first attribute. Can you help? Kind regards, Paul -----Oorspronkelijk bericht----- Van: wekalist-bounces@... [mailto:wekalist-bounces@...] Namens Peter Reutemann Verzonden: maandag 9 november 2009 8:57 Aan: Weka machine learning workbench list. Onderwerp: Re: [Wekalist] remove dubble string after the use of a stemming algorithm > At the moment I am using the dutch wordstemmer separately. When > applying the stemmer to a dataset with a large amount of single words, > the output of the stringtowordvector filter, e.g. the attached file, > the output has the same amount of features. Is it possible to reduce > the number of unique features and adapting the frequencies of the stemmed words? > > For instance: > > Nederlandse > Nederlands > Nederlanders > Nederlander > Nederland > > Becomes > > Nederland > Nederland > Nederlander > Nederlander > Nederland > > And should become > > Nederland (freq 3) > Nederlander (freq 2) ARFF file (see attached file) and perform all preprocessing within Weka using the StringToWordVector filter. As stemmer I'm using the "dutch" stemmer of the snowball stemmers (see FAQ "The snowball stemmers don't work, what am I doing wrong?", if you haven't added them to Weka yet). When not using word counts, then the generated "Nederland" and "Nederlander" attributes contain only 1 (= binary attribute, word occurs or not). But if I turn on word counts (outputWordCounts/-C), then I get 3 for "Nederland" and 2 for "Nederlander". Does that help? Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: remove dubble string after the use of a stemming algorithm> In weka this method works fine and the result is
> good. However, I cannot apply the same method to the file I attached. For > some reason it does not use the -R option to only select the first > attribute. Can you help? The filter automatically only works on STRING attributes. If the dataset contains only a single STRING attribute, like in your case, then you can just leave "first-last". The more important thing is, what did you use as class attribute? In your case, the last attribute is a numeric one. The filter behaves differently if you don't define a class attribute, choose a numeric one or a nominal one. By default, the filter operates on a per class (label) basis. I had no problem processing your dataset, using a post-3.7.0 release code basis. What's the exact problem that you're experiencing? BTW please don't top-post (does anybody bother reading the mailing list etiquette??). Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
| Free embeddable forum powered by Nabble | Forum Help |