question on string to wordvector filter

View: New views
2 Messages — Rating Filter:   Alert me  

question on string to wordvector filter

by paul.adriani :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

LS,

 

At the moment I am performing an experiment in which I have WEKA select create wordvectors for an experiment.  The input sets are sets of descriptions of news items. The number of descriptions varies per set from 10 items to 200 items.

The smaller sets are all subsets of the larger sets.

 

My question is why the number of features is not growing positively related with the number of items, i.e. if the number of items grows the number of features grows as well.

 

Thank you!

 

Regards,

 

Paul Adriani

Universiteit van Amsterdam

 

Paul Adriani Ba. sc. en ssc.

Jan van Riebeekstraat 14-3

1057ZX Amsterdam

tel 0644141917

 

Infocaster BV

paul@...

 


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: question on string to wordvector filter

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> At the moment I am performing an experiment in which I have WEKA select
> create wordvectors for an experiment.  The input sets are sets of
> descriptions of news items. The number of descriptions varies per set from
> 10 items to 200 items.
>
> The smaller sets are all subsets of the larger sets.
>
>
>
> My question is why the number of features is not growing positively related
> with the number of items, i.e. if the number of items grows the number of
> features grows as well.

You're not providing a lot of information, so I'm only guessing... The
StringToWordVector filter has an option (-W/wordsToKeep) that is an
upper limit of how many words the generated internal dictionary will
contain in the end. If you're already at the limit with your subsets,
adding more data only means that less frequent words will get
discarded.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html