remove dubble string after the use of a stemming algorithm

View: New views
4 Messages — Rating Filter:   Alert me  

remove dubble string after the use of a stemming algorithm

by paul.adriani :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

HI

At the moment I am using the dutch wordstemmer separately. When applying the stemmer to a dataset with a large amount of single words, the output of the stringtowordvector filter, e.g. the attached file, the output has the same amount of features. Is it possible to reduce the number of unique features and adapting the frequencies of the stemmed words?

 

For instance:

 

Nederlandse

Nederlands

Nederlanders                                  

Nederlander

Nederland

 

Becomes

 

Nederland

Nederland

Nederlander

Nederlander

Nederland

 

And should become

 

Nederland (freq 3)

Nederlander (freq 2)

 

 

Kind regards,


Paul

 

Paul Adriani Ba. sc. en ssc.

Jan van Riebeekstraat 14-3

1057ZX Amsterdam

tel 0644141917

 

Infocaster BV

paul@...

 


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: remove dubble string after the use of a stemming algorithm

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> At the moment I am using the dutch wordstemmer separately. When applying the

> stemmer to a dataset with a large amount of single words, the output of the
> stringtowordvector filter, e.g. the attached file, the output has the same
> amount of features. Is it possible to reduce the number of unique features
> and adapting the frequencies of the stemmed words?
>
> For instance:
>
> Nederlandse
> Nederlands
> Nederlanders
> Nederlander
> Nederland
>
> Becomes
>
> Nederland
> Nederland
> Nederlander
> Nederlander
> Nederland
>
> And should become
>
> Nederland (freq 3)
> Nederlander (freq 2)
Instead of stemming the text outside Weka, I've converted your data
into an ARFF file (see attached file) and perform all preprocessing
within Weka using the StringToWordVector filter.

As stemmer I'm using the "dutch" stemmer of the snowball stemmers (see
FAQ "The snowball stemmers don't work, what am I doing wrong?", if you
haven't added them to Weka yet). When not using word counts, then the
generated "Nederland" and "Nederlander" attributes contain only 1 (=
binary attribute, word occurs or not). But if I turn on word counts
(outputWordCounts/-C), then I get 3 for "Nederland" and 2 for
"Nederlander".

Does that help?

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

dutch.arff (192 bytes) Download Attachment

RE: remove dubble string after the use of a stemming algorithm

by paul.adriani :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Peter,

Thank you for the response. In weka this method works fine and the result is
good. However, I cannot apply the same method to the file I attached. For
some reason it does not use the -R option to only select the first
attribute. Can you help?

Kind regards,

Paul

-----Oorspronkelijk bericht-----
Van: wekalist-bounces@...
[mailto:wekalist-bounces@...] Namens Peter Reutemann
Verzonden: maandag 9 november 2009 8:57
Aan: Weka machine learning workbench list.
Onderwerp: Re: [Wekalist] remove dubble string after the use of a stemming
algorithm

> At the moment I am using the dutch wordstemmer separately. When
> applying the stemmer to a dataset with a large amount of single words,
> the output of the stringtowordvector filter, e.g. the attached file,
> the output has the same amount of features. Is it possible to reduce
> the number of unique features and adapting the frequencies of the stemmed
words?

>
> For instance:
>
> Nederlandse
> Nederlands
> Nederlanders
> Nederlander
> Nederland
>
> Becomes
>
> Nederland
> Nederland
> Nederlander
> Nederlander
> Nederland
>
> And should become
>
> Nederland (freq 3)
> Nederlander (freq 2)
Instead of stemming the text outside Weka, I've converted your data into an
ARFF file (see attached file) and perform all preprocessing within Weka
using the StringToWordVector filter.

As stemmer I'm using the "dutch" stemmer of the snowball stemmers (see FAQ
"The snowball stemmers don't work, what am I doing wrong?", if you haven't
added them to Weka yet). When not using word counts, then the generated
"Nederland" and "Nederlander" attributes contain only 1 (= binary attribute,
word occurs or not). But if I turn on word counts (outputWordCounts/-C),
then I get 3 for "Nederland" and 2 for "Nederlander".

Does that help?

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

feature_selection_nominal_3696.arff (188K) Download Attachment

Re: remove dubble string after the use of a stemming algorithm

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> In weka this method works fine and the result is
> good. However, I cannot apply the same method to the file I attached. For
> some reason it does not use the -R option to only select the first
> attribute. Can you help?

The filter automatically only works on STRING attributes. If the
dataset contains only a single STRING attribute, like in your case,
then you can just leave "first-last". The more important thing is,
what did you use as class attribute? In your case, the last attribute
is a numeric one. The filter behaves differently if you don't define a
class attribute, choose a numeric one or a nominal one. By default,
the filter operates on a per class (label) basis.

I had no problem processing your dataset, using a post-3.7.0 release
code basis. What's the exact problem that you're experiencing?

BTW please don't top-post (does anybody bother reading the mailing
list etiquette??).

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html