|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
Cluster data by giving vector inputHi I
am very new to weka. I want to cluster my data and so that similar data be
grouped together. I am thinking to use K-means. In weka as far I have seen we
are giving a file (.arff or .csv) as input. Clustering is done on that. But
input I want to give is vector input. I want to give weight for every
term residing in a weight vector. Can anybody please tell me how should I
proceed? Thanks &
Regards Aseem Puri _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Cluster data by giving vector inputIf you are working with text, you can use the stringtowordvector class
to convert each record into a "bag of words" It is also possible to stem the words, and assign the tf-idf weight to the them via the appropriate parameters You can chose the sparse vector representation to save "space" (and memory) I think the easiest way is to try on the explorer, and then copy and past the "commands" from the object editor into your code since the commands can get pretty long If you dealing with non-textual data, then you should be able to use the weka cluster functionalities directly; each instance of your arff for e.g. is itself a vector For the weighting, you can use the tf-idf measure or weka also allows you to assing different weights to different attributes, but i'm not sure how to do that ashwin Puri, Aseem wrote: > > Hi > > I am very new to weka. I want to cluster my data and so > that similar data be grouped together. I am thinking to use K-means. > In weka as far I have seen we are giving a file (.arff or .csv) as > input. Clustering is done on that. > > But input I want to give is vector input. I want to give > weight for every term residing in a weight vector. Can anybody please > tell me how should I proceed? > > > > Thanks & Regards > > **Aseem Puri** > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Wekalist mailing list > Send posts to: Wekalist@... > List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist > List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html > _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Cluster data by giving vector inputActually i want to group my HBase Row data. HBase is Data Base built on top of Hadoop File System. I created a lucene index on my HBase table. With the lucene analyzer stemming and stop words are removed and also i have calculated tf-idf for each word in a row. Now i want to group similar kind of rows. So now i have vector for each row which further have tf-idf weights.
Now please tell how should i proceed further so my similar kind of rows should group together. Also i want to know how i give this input to any cluster algorithm Regards Aseem Puri
|
|
|
Re: Cluster data by giving vector input> Actually i want to group my HBase Row data. HBase is Data Base built on top
> of Hadoop File System. I created a lucene index on my HBase table. With the > lucene analyzer stemming and stop words are removed and also i have > calculated tf-idf for each word in a row. Now i want to group similar kind > of rows. So now i have vector for each row which further have tf-idf > weights. > > Now please tell how should i proceed further so my similar kind of rows > should group together. Also i want to know how i give this input to any > cluster algorithm Instead of generating files, you could just generate a weka.core.Instances object on-the-fly and feed that into clusterer. For an example of how to generate weka.core.Instances objects, see wiki article "Creating an ARFF file": http://weka.wiki.sourceforge.net/Creating+an+ARFF+file As each of rows is weighted, you will have to set the weight of each weka.core.Instance object as well, either using the weight directly in the weka.core.Instance constructor or afterwards using the setWeight(double) method. See Javadoc of that class for more information. For general API usage, see FAQ "How do I use WEKA's classes in my own code?". Link to the FAQs available from the Weka homepage. Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cs.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Cluster data by giving vector inputThanks Peter and Ashwin
I read the links you sent me. i used stringtowordvector class and saw how its generating sparse arff file as input. But without using it i created same kind of file by my own because my lucene analyzer already removing stop words, doing stemming and giving vector weight for every message. so i am generating by my own and giving as input to Kmeans algorithm. But the thing for 100 messages in HBase, if I specify 4 cluster around 90% message come in one category. So i cannot get that k-mean is doing grouping of message properly. Also if i use EM then i get only one group. So please tell me is any problem in input of data or anything else. how i can i leverage my grouping of messages?
|
|
|
Re: Cluster data by giving vector inputThanks Peter and Ashwin
I read the links you sent me. i used stringtowordvector class and saw how its generating sparse arff file as input. But without using it i created same kind of file by my own because my lucene analyzer already removing stop words, doing stemming and giving vector weight for every message. so i am generating by my own and giving as input to Kmeans algorithm. But the thing for 100 messages in HBase, if I specify 4 cluster around 90% message come in one category. So i cannot get that k-mean is doing grouping of message properly. Also if i use EM then i get only one group. So please tell me is any problem in input of data or anything else. how i can i leverage my grouping of messages? Regards Aseem Puri
|
| Free embeddable forum powered by Nabble | Forum Help |