« Return to Thread: Cluster data by giving vector input

Re: Cluster data by giving vector input

by Aseem Puri :: Rate this Message:

Reply to Author | View in Thread

Actually i want to group my HBase Row data. HBase is Data Base built on top of Hadoop File System. I created a lucene index on my HBase table. With the lucene analyzer stemming and stop words are removed and also i have calculated tf-idf for each word in a row. Now i want to group similar kind of rows. So now i have vector for each row which further have tf-idf weights.

Now please tell how should i proceed further so my similar kind of rows should group together. Also i want to know how i give this input to any cluster algorithm

Regards
Aseem Puri

Ashwin Ittoo wrote:
If you are working with text, you can use the stringtowordvector class
to convert each record into a "bag of words"
It is also possible to stem the words, and assign the tf-idf weight to
the them via the appropriate parameters
You can chose the sparse vector representation to save "space" (and memory)
I think the easiest way is to try on the explorer, and then copy and
past the "commands" from the object editor into your code since the
commands can get pretty long

If you dealing with non-textual data, then you should be able to use the
weka cluster functionalities directly; each instance of your arff for
e.g. is itself a vector

For the weighting, you can use the tf-idf measure or weka also allows
you to assing different weights to different attributes, but i'm not
sure how to do that

ashwin


Puri, Aseem wrote:
>
> Hi
>
>             I am very new to weka. I want to cluster my data and so
> that similar data be grouped together. I am thinking to use K-means.
> In weka as far I have seen we are giving a file (.arff or .csv)  as
> input. Clustering is done on that.
>
>             But input I want to give is vector input. I want to give
> weight for every term residing in a weight vector.  Can anybody please
> tell me how should I proceed?
>
>  
>
> Thanks & Regards
>
> **Aseem Puri**
>
>  
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: Wekalist@list.scms.waikato.ac.nz
> List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>  


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

 « Return to Thread: Cluster data by giving vector input