Cluster data by giving vector input

View: New views
6 Messages — Rating Filter:   Alert me  

Cluster data by giving vector input

by Aseem Puri :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

Hi

            I am very new to weka. I want to cluster my data and so that similar data be grouped together. I am thinking to use K-means. In weka as far I have seen we are giving a file (.arff or .csv)  as input. Clustering is done on that.

            But input I want to give is vector input. I want to give weight for every term residing in a weight vector.  Can anybody please tell me how should I proceed?

 

Thanks & Regards

Aseem Puri

 


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Cluster data by giving vector input

by Ashwin Ittoo :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

If you are working with text, you can use the stringtowordvector class
to convert each record into a "bag of words"
It is also possible to stem the words, and assign the tf-idf weight to
the them via the appropriate parameters
You can chose the sparse vector representation to save "space" (and memory)
I think the easiest way is to try on the explorer, and then copy and
past the "commands" from the object editor into your code since the
commands can get pretty long

If you dealing with non-textual data, then you should be able to use the
weka cluster functionalities directly; each instance of your arff for
e.g. is itself a vector

For the weighting, you can use the tf-idf measure or weka also allows
you to assing different weights to different attributes, but i'm not
sure how to do that

ashwin


Puri, Aseem wrote:

>
> Hi
>
>             I am very new to weka. I want to cluster my data and so
> that similar data be grouped together. I am thinking to use K-means.
> In weka as far I have seen we are giving a file (.arff or .csv)  as
> input. Clustering is done on that.
>
>             But input I want to give is vector input. I want to give
> weight for every term residing in a weight vector.  Can anybody please
> tell me how should I proceed?
>
>  
>
> Thanks & Regards
>
> **Aseem Puri**
>
>  
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: Wekalist@...
> List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>  


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Cluster data by giving vector input

by Aseem Puri :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Actually i want to group my HBase Row data. HBase is Data Base built on top of Hadoop File System. I created a lucene index on my HBase table. With the lucene analyzer stemming and stop words are removed and also i have calculated tf-idf for each word in a row. Now i want to group similar kind of rows. So now i have vector for each row which further have tf-idf weights.

Now please tell how should i proceed further so my similar kind of rows should group together. Also i want to know how i give this input to any cluster algorithm

Regards
Aseem Puri

Ashwin Ittoo wrote:
If you are working with text, you can use the stringtowordvector class
to convert each record into a "bag of words"
It is also possible to stem the words, and assign the tf-idf weight to
the them via the appropriate parameters
You can chose the sparse vector representation to save "space" (and memory)
I think the easiest way is to try on the explorer, and then copy and
past the "commands" from the object editor into your code since the
commands can get pretty long

If you dealing with non-textual data, then you should be able to use the
weka cluster functionalities directly; each instance of your arff for
e.g. is itself a vector

For the weighting, you can use the tf-idf measure or weka also allows
you to assing different weights to different attributes, but i'm not
sure how to do that

ashwin


Puri, Aseem wrote:
>
> Hi
>
>             I am very new to weka. I want to cluster my data and so
> that similar data be grouped together. I am thinking to use K-means.
> In weka as far I have seen we are giving a file (.arff or .csv)  as
> input. Clustering is done on that.
>
>             But input I want to give is vector input. I want to give
> weight for every term residing in a weight vector.  Can anybody please
> tell me how should I proceed?
>
>  
>
> Thanks & Regards
>
> **Aseem Puri**
>
>  
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Wekalist mailing list
> Send posts to: Wekalist@list.scms.waikato.ac.nz
> List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
>  


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Cluster data by giving vector input

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Actually i want to group my HBase Row data. HBase is Data Base built on top
> of Hadoop File System. I created a lucene index on my HBase table. With the
> lucene analyzer stemming and stop words are removed and also i have
> calculated tf-idf for each word in a row. Now i want to group similar kind
> of rows. So now i have vector for each row which further have tf-idf
> weights.
>
> Now please tell how should i proceed further so my similar kind of rows
> should group together. Also i want to know how i give this input to any
> cluster algorithm

Instead of generating files, you could just generate a
weka.core.Instances object on-the-fly and feed that into clusterer.
For an example of how to generate weka.core.Instances objects, see
wiki article "Creating an ARFF file":
  http://weka.wiki.sourceforge.net/Creating+an+ARFF+file

As each of rows is weighted, you will have to set the weight of each
weka.core.Instance object as well, either using the weight directly in
the weka.core.Instance constructor or afterwards using the
setWeight(double) method. See Javadoc of that class for more
information.

For general API usage, see FAQ "How do I use WEKA's classes in my own
code?". Link to the FAQs available from the Weka homepage.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Cluster data by giving vector input

by Aseem Puri :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks Peter and Ashwin

I read the links you sent me. i used stringtowordvector class and saw how its generating sparse arff file as input. But without using it i created same kind of file by my own because my lucene analyzer already removing stop words, doing stemming and giving vector weight for every message. so i am generating by my own and giving as input to Kmeans algorithm. But the thing for 100 messages in HBase, if I specify 4 cluster around 90% message come in one category. So i cannot get that k-mean is doing grouping of message properly. Also if i use EM then i get only one group.

So please tell me is any problem in input of data or anything else. how i can i leverage my grouping of messages?


Peter Reutemann-3 wrote:
> Actually i want to group my HBase Row data. HBase is Data Base built on top
> of Hadoop File System. I created a lucene index on my HBase table. With the
> lucene analyzer stemming and stop words are removed and also i have
> calculated tf-idf for each word in a row. Now i want to group similar kind
> of rows. So now i have vector for each row which further have tf-idf
> weights.
>
> Now please tell how should i proceed further so my similar kind of rows
> should group together. Also i want to know how i give this input to any
> cluster algorithm

Instead of generating files, you could just generate a
weka.core.Instances object on-the-fly and feed that into clusterer.
For an example of how to generate weka.core.Instances objects, see
wiki article "Creating an ARFF file":
  http://weka.wiki.sourceforge.net/Creating+an+ARFF+file

As each of rows is weighted, you will have to set the weight of each
weka.core.Instance object as well, either using the weight directly in
the weka.core.Instance constructor or afterwards using the
setWeight(double) method. See Javadoc of that class for more
information.

For general API usage, see FAQ "How do I use WEKA's classes in my own
code?". Link to the FAQs available from the Weka homepage.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Cluster data by giving vector input

by Aseem Puri :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks Peter and Ashwin

I read the links you sent me. i used stringtowordvector class and saw how its generating sparse arff file as input. But without using it i created same kind of file by my own because my lucene analyzer already removing stop words, doing stemming and giving vector weight for every message. so i am generating by my own and giving as input to Kmeans algorithm. But the thing for 100 messages in HBase, if I specify 4 cluster around 90% message come in one category. So i cannot get that k-mean is doing grouping of message properly. Also if i use EM then i get only one group.

So please tell me is any problem in input of data or anything else. how i can i leverage my grouping of messages?

Regards
Aseem Puri

Peter Reutemann-3 wrote:
> Actually i want to group my HBase Row data. HBase is Data Base built on top
> of Hadoop File System. I created a lucene index on my HBase table. With the
> lucene analyzer stemming and stop words are removed and also i have
> calculated tf-idf for each word in a row. Now i want to group similar kind
> of rows. So now i have vector for each row which further have tf-idf
> weights.
>
> Now please tell how should i proceed further so my similar kind of rows
> should group together. Also i want to know how i give this input to any
> cluster algorithm

Instead of generating files, you could just generate a
weka.core.Instances object on-the-fly and feed that into clusterer.
For an example of how to generate weka.core.Instances objects, see
wiki article "Creating an ARFF file":
  http://weka.wiki.sourceforge.net/Creating+an+ARFF+file

As each of rows is weighted, you will have to set the weight of each
weka.core.Instance object as well, either using the weight directly in
the weka.core.Instance constructor or afterwards using the
setWeight(double) method. See Javadoc of that class for more
information.

For general API usage, see FAQ "How do I use WEKA's classes in my own
code?". Link to the FAQs available from the Weka homepage.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.scms.waikato.ac.nz
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html