Classifiers

View: New views
9 Messages — Rating Filter:   Alert me  

Classifiers

by Fernando Zuher-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi people,

I have been testing a datebase to several classifiers. But I don't
know how to choose one.

Should I choose one for the better "Correctly Classified Instances"?
But how about "Root relative squared error"? Should I desconsider?

In these examples below, the better classifier is meta.RandomSubSpace,
but its "Root relative squared error" is 20.1967 %. What is that?

***********************************************************************************
meta.RandomSubSpace
Correctly Classified Instances      172478               99.9988 %
Incorrectly Classified Instances         2                0.0012 %
Kappa statistic                          0.9993
Mean absolute error                      0.072
Root mean squared error                  0.0952
Relative absolute error                 16.2081 %
Root relative squared error             20.1967 %
Total Number of Instances           172480
***********************************************************************************

***********************************************************************************
trees.LADTree '-B 10' -4940716114518300302
Correctly Classified Instances      172477               99.9983 %
Incorrectly Classified Instances         3                0.0017 %
Kappa statistic                          0.9989
Mean absolute error                      0.0015
Root mean squared error                  0.0079
Relative absolute error                  0.3391 %
Root relative squared error              1.6848 %
Total Number of Instances           172480
***********************************************************************************

***********************************************************************************
rules.DTNB
Correctly Classified Instances      172477               99.9983 %
Incorrectly Classified Instances         3                0.0017 %
Kappa statistic                          0.9989
Mean absolute error                      0.0018
Root mean squared error                  0.0059
Relative absolute error                  0.3965 %
Root relative squared error              1.2611 %
Total Number of Instances           172480
***********************************************************************************

***********************************************************************************
functions.SMO
Correctly Classified Instances      172477               99.9983 %
Incorrectly Classified Instances         3                0.0017 %
Kappa statistic                          0.9989
Mean absolute error                      0.2222
Root mean squared error                  0.2722
Relative absolute error                 50.0009 %
Root relative squared error             57.7365 %
Total Number of Instances           172480
***********************************************************************************

***********************************************************************************
meta.RacedIncrementalLogitBoost
Correctly Classified Instances      172477               99.9983 %
Incorrectly Classified Instances         3                0.0017 %
Kappa statistic                          0.9989
Mean absolute error                      0.1378
Root mean squared error                  0.1555
Relative absolute error                 31.0121 %
Root relative squared error             32.9803 %
Total Number of Instances           172480
***********************************************************************************

Regars,
Fernando

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Classifiers

by NightlordTW :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



2009/10/26 Fernando Zuher <fernando.zuher@...>
Hi people,

I have been testing a datebase to several classifiers. But I don't
know how to choose one.

Should I choose one for the better "Correctly Classified Instances"?
But how about "Root relative squared error"? Should I desconsider?
 
I would advise to use the AUC.  Correctly Classified Instances can give a very wrong idea about performance, especially when your dataset is imbalanced. Root mean squared error is mainly of interest when applying regression. In the case of classification, it gives a limited measure of performance (and distorted as well in the case of class imbalance). Although sensitivity and specificity usually are good measures when addressed together, they do not tell you anything about classifier performance when class probability thresholds are altered. You can create infinite new classifiers from just one trained classifier by changing this threshold. The AUC is a very good and well-known measure to represent the ability of a classifier to discriminate between 2 classes.



--
Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht  | P.O.Box 85500  | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Classifiers

by Harri Saarikoski-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



2009/10/26 Thomas Debray <tw@...>


2009/10/26 Fernando Zuher <fernando.zuher@...>

Hi people,

I have been testing a datebase to several classifiers. But I don't
know how to choose one.

Should I choose one for the better "Correctly Classified Instances"?
But how about "Root relative squared error"? Should I desconsider?
 
I would advise to use the AUC.  Correctly Classified Instances can give a very wrong idea about performance, especially when your dataset is imbalanced. Root mean squared error is mainly of interest when applying regression. In the case of classification, it gives a limited measure of performance (and distorted as well in the case of class imbalance). Although sensitivity and specificity usually are good measures when addressed together, they do not tell you anything about classifier performance when class probability thresholds are altered. You can create infinite new classifiers from just one trained classifier by changing this threshold. The AUC is a very good and well-known measure to represent the ability of a classifier to discriminate between 2 classes.


hi thomas

having browsed your thesis wrt naive bayes sampling, you defly might know more about imbalanced datasets than I

what I wanted to ask you is to qualify for a best model, in a binary class situation,
should the AUC measure be highest for the minor class only, the major class or both classes?
(and if the latter, what sort of average of the two classes, prior probability weighted or weighted equally ?)

really appreciate, harri



--
Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht  | P.O.Box 85500  | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Classifiers

by NightlordTW :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/10/28 Harri Saarikoski <harri.saarikoski@...>



hi thomas

having browsed your thesis wrt naive bayes sampling, you defly might know more about imbalanced datasets than I

what I wanted to ask you is to qualify for a best model, in a binary class situation,
should the AUC measure be highest for the minor class only, the major class or both classes?
The AUC measure is not defined per class, but for both classes at the same time. How closer it is to one, how better. If it is below 0.5, your classifier performs worse than random guessing (and you should always choose the opposite prediction of the classifier). The AUC gives an impression of how good your classifier is able to separated the minority from the majority instances.

For example, assume your classifier has an AUC of 0.89, but your obtained sensitivity is 10% and specificity is 99%. In the first place, you could think that the classifier performs not well on minority classes. However, since the AUC is rather high, it tells you that the classifier in fact is good in discriminating between both classes. This means that by just altering your probability threshold, you can change the resulting sensitivity and specificity into, lets assume in this example,  85% and 70%.

With other words, when a classifier yields a high AUC, it means that it is able to obtain better sensitivity/specificity pairs than classifiers with a lower AUC. Although in some situations it is possible that the sensitivity/specificity pair of a classifier with a good AUC seems to favor the majority class (high specificity, low sensitivity), simply altering the probability threshold might result into a whole new classifier (eg: high sensitivity, medium specificity)
 
(and if the latter, what sort of average of the two classes, prior probability weighted or weighted equally ?)

really appreciate, harri



--
Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht  | P.O.Box 85500  | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht  | P.O.Box 85500  | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Parent Message unknown Re: Classifiers

by Fernando Zuher-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for the reply Thomas.

But you said  "The AUC is a very good and well-known measure to
represent the ability of a classifier to discriminate between 2
classes".
And how about mensure the performance among more than 2 classifiers?
How it would be possible?

Fernando

2009/10/26  <wekalist-request@...>:

> Send Wekalist mailing list submissions to
>        wekalist@...
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
> or, via email, send a message with subject or body 'help' to
>        wekalist-request@...
>
> You can reach the person managing the list at
>        wekalist-owner@...
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wekalist digest..."
>
>
> Today's Topics:
>
>   1. Re: Classifiers (Thomas Debray)
>   2. Re: Question on Association Rules (Feris Thia)
>   3. Re: String attributes (Ícaro Medeiros)
>   4. Re: String attributes (Ícaro Medeiros)
>   5. Re: Question on Association Rules (Peter Reutemann)
>   6. Re: String attributes (Peter Reutemann)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 26 Oct 2009 14:24:18 +0100
> From: Thomas Debray <tw@...>
> Subject: Re: [Wekalist] Classifiers
> To: "Weka machine learning workbench list."
>        <wekalist@...>
> Message-ID:
>        <560ffce40910260624n1f3b3ec6j2b6e1bdc019bc32a@...>
> Content-Type: text/plain; charset="iso-8859-1"
>
> 2009/10/26 Fernando Zuher <fernando.zuher@...>
>
>> Hi people,
>>
>> I have been testing a datebase to several classifiers. But I don't
>> know how to choose one.
>>
>> Should I choose one for the better "Correctly Classified Instances"?
>> But how about "Root relative squared error"? Should I desconsider?
>>
>
> I would advise to use the AUC.  Correctly Classified Instances can give a
> very wrong idea about performance, especially when your dataset is
> imbalanced. Root mean squared error is mainly of interest when applying
> regression. In the case of classification, it gives a limited measure of
> performance (and distorted as well in the case of class imbalance). Although
> sensitivity and specificity usually are good measures when addressed
> together, they do not tell you anything about classifier performance when
> class probability thresholds are altered. You can create infinite new
> classifiers from just one trained classifier by changing this threshold. The
> AUC is a very good and well-known measure to represent the ability of a
> classifier to discriminate between 2 classes.
>
>
>
> --
> Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 |
> University Medical Center Utrecht  | P.O.Box 85500  | 3508 GA Utrecht | The
> Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: https://list.scms.waikato.ac.nz/pipermail/wekalist/attachments/20091026/68467e39/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Mon, 26 Oct 2009 20:41:51 +0700
> From: Feris Thia <machine.learning@...>
> Subject: Re: [Wekalist] Question on Association Rules
> To: "Weka machine learning workbench list."
>        <wekalist@...>
> Message-ID:
>        <44dba300910260641l7bf8088byb92f2601cd54692d@...>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi Peter,
>
> Thanks for the confirmation. Is there any roadmap that this feature will be
> implemented ?
>
> On Sun, Oct 25, 2009 at 4:53 AM, Peter Reutemann <fracpete@...> wrote:
>
>>
>> Correct.
>>
>> Cheers, Peter
>>
>
>
> --
> Regards,
>
> Feris Thia
> http://pentaho.phi-integration.com
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: https://list.scms.waikato.ac.nz/pipermail/wekalist/attachments/20091026/1491d91f/attachment-0001.html
>
> ------------------------------
>
> Message: 3
> Date: Mon, 26 Oct 2009 15:16:08 +0000
> From: Ícaro Medeiros <icaro.medeiros@...>
> Subject: Re: [Wekalist] String attributes
> To: "Weka machine learning workbench list."
>        <wekalist@...>
> Message-ID:
>        <e85c26010910260816h40f145ag740e6de219606886@...>
> Content-Type: text/plain; charset="iso-8859-1"
>
> But how can I do this programatically during Instance creation?
>
> On Thu, Oct 22, 2009 at 7:09 PM, Peter Reutemann <fracpete@...> wrote:
>
>> > I have been reading emails and Web pages about string attributes in WEKA
>> but
>> > was not able to figure out how to use them in the right way.
>> >
>> > My problem is that I want a reference in the ARFF file for the document
>> and
>> > the term of instances (in a text mining application) but obviously this
>> is
>> > not an information for the Classifier to handle, so I want to use string
>> > attributes in the ARFF file and filter this attribute when building the
>> > Classifier.
>> >
>> > Does anybody have an example of how a string Attribute object is set?
>> Also,
>> > how Instance values are set and how I use these filters to remove string
>> > attributes when I forward instances read from the ARFF files for the
>> > Classifier?
>>
>> If you want to track instances in your data, then use ID attributes.
>> See FAQ "How do I use ID attributes?". Link to the FAQs available from
>> the Weka homepage.
>>
>> Cheers, Peter
>> --
>> Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
>> http://www.cs.waikato.ac.nz/~fracpete/<http://www.cs.waikato.ac.nz/%7Efracpete/>          Ph. +64 (7) 858-5174
>>
>> _______________________________________________
>> Wekalist mailing list
>> Send posts to: Wekalist@...
>> List info and subscription status:
>> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>> List etiquette:
>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html<http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html>
>>
>
>
>
> --
> Ícaro Rafael da Silva Medeiros (+351 91 447 0877)
> Researcher at INESC-ID Lisbon (www.inesc-id.pt)
> MSc Candidate - Federal University of Pernambuco (www.cin.ufpe.br)
> Blog: http://kirux.wordpress.com/
>
> Linux User #212604
>
> == Scientia Vincere Tenebras ==
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: https://list.scms.waikato.ac.nz/pipermail/wekalist/attachments/20091026/8d2c92da/attachment-0001.html
>
> ------------------------------
>
> Message: 4
> Date: Mon, 26 Oct 2009 15:38:04 +0000
> From: Ícaro Medeiros <icaro.medeiros@...>
> Subject: Re: [Wekalist] String attributes
> To: "Weka machine learning workbench list."
>        <wekalist@...>
> Message-ID:
>        <e85c26010910260838k5228290y48efd56e25c434ff@...>
> Content-Type: text/plain; charset="iso-8859-1"
>
> And I want a string attribute, not a numeric one.
>
> []s
>
> 2009/10/26 Ícaro Medeiros <icaro.medeiros@...>
>
>> But how can I do this programatically during Instance creation?
>>
>>
>> On Thu, Oct 22, 2009 at 7:09 PM, Peter Reutemann <fracpete@...>wrote:
>>
>>> > I have been reading emails and Web pages about string attributes in WEKA
>>> but
>>> > was not able to figure out how to use them in the right way.
>>> >
>>> > My problem is that I want a reference in the ARFF file for the document
>>> and
>>> > the term of instances (in a text mining application) but obviously this
>>> is
>>> > not an information for the Classifier to handle, so I want to use string
>>> > attributes in the ARFF file and filter this attribute when building the
>>> > Classifier.
>>> >
>>> > Does anybody have an example of how a string Attribute object is set?
>>> Also,
>>> > how Instance values are set and how I use these filters to remove string
>>> > attributes when I forward instances read from the ARFF files for the
>>> > Classifier?
>>>
>>> If you want to track instances in your data, then use ID attributes.
>>> See FAQ "How do I use ID attributes?". Link to the FAQs available from
>>> the Weka homepage.
>>>
>>> Cheers, Peter
>>> --
>>> Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
>>> http://www.cs.waikato.ac.nz/~fracpete/<http://www.cs.waikato.ac.nz/%7Efracpete/>          Ph. +64 (7) 858-5174
>>>
>>> _______________________________________________
>>> Wekalist mailing list
>>> Send posts to: Wekalist@...
>>> List info and subscription status:
>>> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>>> List etiquette:
>>> http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html<http://www.cs.waikato.ac.nz/%7Eml/weka/mailinglist_etiquette.html>
>>>
>>
>>
>>
>> --
>> Ícaro Rafael da Silva Medeiros (+351 91 447 0877)
>> Researcher at INESC-ID Lisbon (www.inesc-id.pt)
>> MSc Candidate - Federal University of Pernambuco (www.cin.ufpe.br)
>> Blog: http://kirux.wordpress.com/
>>
>> Linux User #212604
>>
>> == Scientia Vincere Tenebras ==
>>
>
>
>
> --
> Ícaro Rafael da Silva Medeiros (+351 91 447 0877)
> Researcher at INESC-ID Lisbon (www.inesc-id.pt)
> MSc Candidate - Federal University of Pernambuco (www.cin.ufpe.br)
> Blog: http://kirux.wordpress.com/
>
> Linux User #212604
>
> == Scientia Vincere Tenebras ==
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: https://list.scms.waikato.ac.nz/pipermail/wekalist/attachments/20091026/7328c723/attachment-0001.html
>
> ------------------------------
>
> Message: 5
> Date: Tue, 27 Oct 2009 07:32:56 +1300
> From: Peter Reutemann <fracpete@...>
> Subject: Re: [Wekalist] Question on Association Rules
> To: "Weka machine learning workbench list."
>        <wekalist@...>
> Message-ID:
>        <548e07050910261132u68101e6fs81810db743fcb318@...>
> Content-Type: text/plain; charset=ISO-8859-1
>
>> Thanks for the confirmation. Is there any roadmap that this feature will be
>> implemented ?
>
> I don't know, I'm not the Weka maintainer. But since this feature
> hasn't been added in the past 10 years, I doubt it that it will be
> added any time soon (unless somebody contributes it).
>
> Cheers, Peter
> --
> Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
> http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174
>
>
>
> ------------------------------
>
> Message: 6
> Date: Tue, 27 Oct 2009 07:37:12 +1300
> From: Peter Reutemann <fracpete@...>
> Subject: Re: [Wekalist] String attributes
> To: "Weka machine learning workbench list."
>        <wekalist@...>
> Message-ID:
>        <548e07050910261137x5a683c56oe0c0a831665688b5@...>
> Content-Type: text/plain; charset=ISO-8859-1
>
>> But how can I do this programatically during Instance creation?
>>
>>> > I have been reading emails and Web pages about string attributes in WEKA
>>> > but
>>> > was not able to figure out how to use them in the right way.
>>> >
>>> > My problem is that I want a reference in the ARFF file for the document
>>> > and
>>> > the term of instances (in a text mining application) but obviously this
>>> > is
>>> > not an information for the Classifier to handle, so I want to use string
>>> > attributes in the ARFF file and filter this attribute when building the
>>> > Classifier.
>>> >
>>> > Does anybody have an example of how a string Attribute object is set?
>>> > Also,
>>> > how Instance values are set and how I use these filters to remove string
>>> > attributes when I forward instances read from the ARFF files for the
>>> > Classifier?
>>>
>>> If you want to track instances in your data, then use ID attributes.
>>> See FAQ "How do I use ID attributes?". Link to the FAQs available from
>>> the Weka homepage.
>
> Just add another attribute for storing the ID (and increase your
> counter whenever you generate an Instance).
>
> See wiki article "Creating an ARFF file" on how to generate
> weka.core.Instances objects on the fly:
>  http://weka.wikispaces.com/Creating+an+ARFF+file
>
> Cheers, Peter
> --
> Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
> http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174
>
>
>
> ------------------------------
>
> _______________________________________________
> Wekalist mailing list
> Wekalist@...
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
>
>
> End of Wekalist Digest, Vol 80, Issue 51
> ****************************************
>

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Re: Classifiers

by NightlordTW :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



2009/10/28 Fernando Zuher <fernando.zuher@...>
Thanks for the reply Thomas.

But you said  "The AUC is a very good and well-known measure to
represent the ability of a classifier to discriminate between 2
classes".
And how about mensure the performance among more than 2 classifiers?
How it would be possible?

I assume you mean how to calculate the AUC for the case when more than 2 classes occur in the dataset. To answer this question; there has been done some research on that, but I dont know the details about it. Just google a bit and I'm sure you 'll find answers.
eg: http://www.cs.bris.ac.uk/~flach/ICML04tutorial/ROCtutorialPartIII.pdf
you can compare the auc of a classifier with the auc of other classifiers by just looking which one is larger. How larger the AUC, how better.

--
Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht  | P.O.Box 85500  | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Classifiers

by Harri Saarikoski-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



2009/10/28 Thomas Debray <tw@...>
2009/10/28 Harri Saarikoski <harri.saarikoski@...>




hi thomas

having browsed your thesis wrt naive bayes sampling, you defly might know more about imbalanced datasets than I

what I wanted to ask you is to qualify for a best model, in a binary class situation,
should the AUC measure be highest for the minor class only, the major class or both classes?
The AUC measure is not defined per class, but for both classes at the same time. How closer it is to one, how better. If it is below 0.5, your classifier performs worse than random guessing (and you should always choose the opposite prediction of the classifier). The AUC gives an impression of how good your classifier is able to separated the minority from the majority instances.

For example, assume your classifier has an AUC of 0.89, but your obtained sensitivity is 10% and specificity is 99%. In the first place, you could think that the classifier performs not well on minority classes. However, since the AUC is rather high, it tells you that the classifier in fact is good in discriminating between both classes. This means that by just altering your probability threshold, you can change the resulting sensitivity and specificity into, lets assume in this example,  85% and 70%.

With other words, when a classifier yields a high AUC, it means that it is able to obtain better sensitivity/specificity pairs than classifiers with a lower AUC. Although in some situations it is possible that the sensitivity/specificity pair of a classifier with a good AUC seems to favor the majority class (high specificity, low sensitivity),

yes, exactly, I have eval data below for three classifiers
which of these would you consider best as in 'least prone to overfit':

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.97      0.3        0.942     0.97      0.956      0.875    O
0.7       0.03       0.824     0.7       0.757      0.948    I
  a  b  c   <-- classified as
 97  3  0 |  a = O
  6 14  0 |  b = I

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.85      0          1         0.85      0.919      0.944    O
1         0.15       0.571     1         0.727      0.944    I
  a  b  c   <-- classified as
 85 15  0 |  a = O
  0 20  0 |  b = I

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.91      0.25       0.948     0.91      0.929      0.902    O
0.75      0.09       0.625     0.75      0.682      0.902    I
  a  b  c   <-- classified as
 91  9  0 |  a = O
  5 15  0 |  b = I

I would scrap the middle one for its overrecall of "I" class, but how would you rank first and third ?
should the number of false predictions made at each class follow their distribution in training data?
i.e. that the 9/5 ratio in third vs 3/6 ratio in first

 
simply altering the probability threshold might result into a whole new classifier (eg: high sensitivity, medium specificity)
 
by probability threshold, do you mean prior operation of under/oversampling one class or the output class probability?
if latter, how do you figure that?


 
(and if the latter, what sort of average of the two classes, prior probability weighted or weighted equally ?)

btw, what I meant by per-class AUC was weka's standard output as above, per class
there's also 'weighted avg' for ROC area (=AUC) which I'm not sure how they weight it...
whether it's the AUC you're talking about, should ask weka people probably
 
really appreciate, harri



--
Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht  | P.O.Box 85500  | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht  | P.O.Box 85500  | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html




--
-----------------
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Feature representation

by Azizi Abdullah :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear all,

I have a simple question pertaining feature representation. I have a set
of euclidean distances from an input set of n vectors. The distances are
computed between the set of n vectors and cluster centers or centroids.
After that, I want to use all distances as one of features for indexing.
My question is what is the best way to represent the euclidean distances
for indexing? I know that one of the commonly used approaches is histogram
i.e. histogram of euclidean distances. Does anybody knows what is the best
approach rather than histogram?

kind regards
azizisya.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Classifiers

by NightlordTW :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


yes, exactly, I have eval data below for three classifiers
which of these would you consider best as in 'least prone to overfit':

 You need the AUC. I do not know how you get the AUC for a classifier in the GUI, but in the javacode it is very simple. Just call evaluation.getAUC(). For 3 classifiers, you get 3 AUCs. I would prefer the classifier with the highest AUC. Note btw that although you define 3 classes in your dataset, only 2 classes can be recognized. I guess there is  no data for class 3.


by probability threshold, do you mean prior operation of under/oversampling one class or the output class probability?

Many classifiers (such as Naive Bayes) produce a probability vector for which they believe the presented instance belongs to class Y. In a 2-class problem, the class with a probability of higher than 50% is then choosen. However, you can alter this threshold to any other value, with the result that different classifications are made.

I suggest you look up some literature if you want to know more about it. I have a brief description of all the mentioned things in my thesis.

--
Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht  | P.O.Box 85500  | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html