Using different distances for attributes, is it possible?

View: New views
6 Messages — Rating Filter:   Alert me  

Using different distances for attributes, is it possible?

by nikita m. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have a dataset with numeric and string-attributes and want to use K-means, maybe later even X-means. Is there any possibilty to use different metrics for the attributes? eg. Euclidean for numeric and edit-distance for strings? Thanks in advance

nikita m.

Re: Using different distances for attributes, is it possible?

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I have a dataset with numeric and string-attributes and want to use K-means,
> maybe later even X-means. Is there any possibilty to use different metrics
> for the attributes? eg. Euclidean for numeric and edit-distance for strings?

No, you can only use a single distance function (though you can
specify what attributes ranges to use in calculation). But you can
always implement your own distance function that uses different
sub-distance-functions for the different attribute types. A distance
function has to implement the interface weka.core.DistanceFunction.
See the other distance functions for implementation examples.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Using different distances for attributes, is it possible?

by Mark Hall-9 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 6/30/09 5:14 AM, nikita m. wrote:
> I have a dataset with numeric and string-attributes and want to use K-means,
> maybe later even X-means. Is there any possibilty to use different metrics
> for the attributes? eg. Euclidean for numeric and edit-distance for strings?
> Thanks in advance

Not without changing the code. Note that how the centroids are computed
for K-means is important if you want to guarantee that the distance
function minimizes the within cluster error. The component-wise mean is
the correct choice for squared error (Euclidean distance) while the
component-wise median minimizes the Manhattan distance.

Cheers,
Mark.

--
Mark Hall
Senior Developer/Consultant, Pentaho Open Source Business Intelligence
Citadel International, Suite 340, 5950 Hazeltine National Dr., Orlando,
FL 32822, USA
+64 7 847-3537 office, +64 21 399-132 mobile, +1 815 550-8637 fax,
Skype: mark.andrew.hall, Yahoo: mark_andrew_hall
Download the latest release today
<http://www.sourceforge.net/projects/pentaho>

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Using different distances for attributes, is it possible?

by nikita m. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>No, you can only use a single distance function (though you can
>specify what attributes ranges to use in calculation).  

>A distance function has to implement the interface weka.core.DistanceFunction.

It seems like i have to implement a new distance function. I am using 3.6.1
and already tried to use my own distancefunction, but Kmeans insists on Euclidean or
Manhattan distance - and as far as i know, Xmeans is using Kmeans aswell.

So - what can i do?
- Rewrite the Euclidean distance with some extra functionality for Strings?
- Force Kmeans to accept my own distance?
- Something else?

What i'm trying to do is using Euclidean for numeric atrributes and Edit-dist. for String-
attributes and mixing the values for an overall distance. This distance should then be
used by Kmeans and/or Xmeans.

Re: Using different distances for attributes, is it possible?

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>>No, you can only use a single distance function (though you can
>>specify what attributes ranges to use in calculation).
>
>>A distance function has to implement the interface
> weka.core.DistanceFunction.
>
> It seems like i have to implement a new distance function. I am using 3.6.1
> and already tried to use my own distancefunction, but Kmeans insists on
> Euclidean or
> Manhattan distance -

You might want to change the setDistanceFunction method, adding your
distance function to be allowed as well.

> and as far as i know, Xmeans is using Kmeans aswell.

XMeans isn't using KMeans, it's a completely separate implementation.
>From a quick look at the code, there don't seem to be any restrictions
regarding distance functions.

[...]

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Using different distances for attributes, is it possible?

by nikita m. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


>You might want to change the setDistanceFunction method, adding your
>distance function to be allowed as well.
ok - I will try this.


> XMeans isn't using KMeans, it's a completely separate implementation.
You are right. I assumed this, because I was taking a too 'quick' look on the code.


>From a quick look at the code, there don't seem to be any restrictions
>regarding distance functions.
I'm glad to read that there are no fundamental restrictions.  

thx for the help
nikita