Re: RE: Wekalist Digest, Vol 77, Issue 6

View: New views
2 Messages — Rating Filter:   Alert me  

Re: RE: Wekalist Digest, Vol 77, Issue 6

by Polczynski, Mark :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>> I am using OneR on a version of the weather dataset (outlook/temp/humid/ windy/play) which has numerical attributes for temperature and humidity.  Is there a way to see the bin ranges that OneR uses to discretize these numerical attributes?

> If OneR chooses a numeric attribute to base its model on (the
classifiers uses only a single attribute!), then the output is as
follows (UCI dataset "balance-scale"):

> left-weight:
        < 2.5   -> R
        >= 2.5  -> L

> For numeric attributes, the generated rule holds the breakpoints (=
borders between bins). The above example has one breakpoint and
therefore two bins.

Thank you Peter.  I believe I am asking a different question.  It seems that in order for OneR to select the single attribute, it must discretize the numerical attributes into nominal attributes first.  I'm assuming that it does discretization using the scheme outlined in the paper by Neville-Manning, Holmes and Witten.  I know that I can specify the  minBucketSize in the GenericObjetEditor that OneR will use when discretizing the values.  What I would like to see is the bins, or perhaps the term is "buckets", that OneR put the numerical values in to in order to select the single attribute.  The goal is to compare this with the nominal attributes in the version of the weather database that has all nominal values to see what the difference is between the attribute values for temperature and humidity.  I would then like to compare this with equal-width and equal-frequency discretization.

Thanks again,

Mark Polczynski

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Re: RE: Wekalist Digest, Vol 77, Issue 6

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>>> I am using OneR on a version of the weather dataset (outlook/temp/humid/ windy/play) which has numerical attributes for temperature and humidity.  Is there a way to see the bin ranges that OneR uses to discretize these numerical attributes?
>
>> If OneR chooses a numeric attribute to base its model on (the
> classifiers uses only a single attribute!), then the output is as
> follows (UCI dataset "balance-scale"):
>
>> left-weight:
>        < 2.5   -> R
>        >= 2.5  -> L
>
>> For numeric attributes, the generated rule holds the breakpoints (=
> borders between bins). The above example has one breakpoint and
> therefore two bins.
>
> Thank you Peter.  I believe I am asking a different question.

Nope, maybe I wasn't clear enough.

> It seems that in order for OneR to select the single attribute, it must discretize the numerical attributes into nominal attributes first.

The breakpoints of a rule resemble OneR's form of discretization. It
doesn't use any of Weka's discretization filters.

Here is a short summary of how OneR computes the breakspoints for a
numeric rule (for a specific numeric attribute) based on what I can
gather from the code (method "newNumericRule"):
1. sort the instances in ascending manner in regards to the selected
numeric attribute, set instance index to start of sorted list
2. reset class label counter
3. increment the instance index and increment the class label counter
accordingly
4. the first class label that meets the minBucketSize requirement is
used as candidate for new breakpoint, if not go back to 3.
5. continue counting class labels (i.e., incrementing instance index)
until the class label changes
6. if class labels of candidate is the same as from previous
breakpoint, merge them, otherwise we have a new breakpoint at the
given position; in both cases, update the breakpoint value with the
current value from the sorted instance list
7. go back to 2. and continue going through the instances.

After rules for all attributes apart from the class attribute have
been generated, the rule that gets most class labels right is selected
as the only rule used in the model (OneR = one rule).

> I'm assuming that it does discretization using the scheme outlined in the paper by Neville-Manning, Holmes and Witten.  I know that I can specify the  minBucketSize in the GenericObjetEditor that OneR will use when discretizing the values.  What I would like to see is the bins, or perhaps the term is "buckets", that OneR put the numerical values in to in order to select the single attribute.  The goal is to compare this with the nominal attributes in the version of the weather database that has all nominal values to see what the difference is between the attribute values for temperature and humidity.  I would then like to compare this with equal-width and equal-frequency discretization.

The breakpoints resemble the borders between the bins. The example
from my previous post has exactly one breakpoint for the chosen
attribute "left-weight", which results in two bins:
- bin 1:
  (-Inf, 2.5)
- bin 2:
  [2.5, +Inf)

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html