|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
ClassifiersHi people,
I have been testing a datebase to several classifiers. But I don't know how to choose one. Should I choose one for the better "Correctly Classified Instances"? But how about "Root relative squared error"? Should I desconsider? In these examples below, the better classifier is meta.RandomSubSpace, but its "Root relative squared error" is 20.1967 %. What is that? *********************************************************************************** meta.RandomSubSpace Correctly Classified Instances 172478 99.9988 % Incorrectly Classified Instances 2 0.0012 % Kappa statistic 0.9993 Mean absolute error 0.072 Root mean squared error 0.0952 Relative absolute error 16.2081 % Root relative squared error 20.1967 % Total Number of Instances 172480 *********************************************************************************** *********************************************************************************** trees.LADTree '-B 10' -4940716114518300302 Correctly Classified Instances 172477 99.9983 % Incorrectly Classified Instances 3 0.0017 % Kappa statistic 0.9989 Mean absolute error 0.0015 Root mean squared error 0.0079 Relative absolute error 0.3391 % Root relative squared error 1.6848 % Total Number of Instances 172480 *********************************************************************************** *********************************************************************************** rules.DTNB Correctly Classified Instances 172477 99.9983 % Incorrectly Classified Instances 3 0.0017 % Kappa statistic 0.9989 Mean absolute error 0.0018 Root mean squared error 0.0059 Relative absolute error 0.3965 % Root relative squared error 1.2611 % Total Number of Instances 172480 *********************************************************************************** *********************************************************************************** functions.SMO Correctly Classified Instances 172477 99.9983 % Incorrectly Classified Instances 3 0.0017 % Kappa statistic 0.9989 Mean absolute error 0.2222 Root mean squared error 0.2722 Relative absolute error 50.0009 % Root relative squared error 57.7365 % Total Number of Instances 172480 *********************************************************************************** *********************************************************************************** meta.RacedIncrementalLogitBoost Correctly Classified Instances 172477 99.9983 % Incorrectly Classified Instances 3 0.0017 % Kappa statistic 0.9989 Mean absolute error 0.1378 Root mean squared error 0.1555 Relative absolute error 31.0121 % Root relative squared error 32.9803 % Total Number of Instances 172480 *********************************************************************************** Regars, Fernando _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Classifiers2009/10/26 Fernando Zuher <fernando.zuher@...> Hi people, I would advise to use the AUC. Correctly Classified Instances can give
a very wrong idea about performance, especially when your dataset is
imbalanced. Root mean squared error is mainly of interest when applying
regression. In the case of classification, it gives a limited measure
of performance (and distorted as well in the case of class imbalance).
Although sensitivity and specificity usually are good measures when
addressed together, they do not tell you anything about classifier
performance when class probability thresholds are altered. You can
create infinite new classifiers from just one trained classifier by
changing this threshold. The AUC is a very good and well-known measure
to represent the ability of a classifier to discriminate between 2
classes. -- Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht | P.O.Box 85500 | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Classifiers2009/10/26 Thomas Debray <tw@...>
hi thomas having browsed your thesis wrt naive bayes sampling, you defly might know more about imbalanced datasets than I what I wanted to ask you is to qualify for a best model, in a binary class situation, should the AUC measure be highest for the minor class only, the major class or both classes? (and if the latter, what sort of average of the two classes, prior probability weighted or weighted equally ?) really appreciate, harri
-- ----------------- Harri M.T. Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Classifiers2009/10/28 Harri Saarikoski <harri.saarikoski@...>
The AUC measure is not defined per class, but for both classes at the same time. How closer it is to one, how better. If it is below 0.5, your classifier performs worse than random guessing (and you should always choose the opposite prediction of the classifier). The AUC gives an impression of how good your classifier is able to separated the minority from the majority instances. For example, assume your classifier has an AUC of 0.89, but your obtained sensitivity is 10% and specificity is 99%. In the first place, you could think that the classifier performs not well on minority classes. However, since the AUC is rather high, it tells you that the classifier in fact is good in discriminating between both classes. This means that by just altering your probability threshold, you can change the resulting sensitivity and specificity into, lets assume in this example, 85% and 70%. With other words, when a classifier yields a high AUC, it means that it is able to obtain better sensitivity/specificity pairs than classifiers with a lower AUC. Although in some situations it is possible that the sensitivity/specificity pair of a classifier with a good AUC seems to favor the majority class (high specificity, low sensitivity), simply altering the probability threshold might result into a whole new classifier (eg: high sensitivity, medium specificity)
-- Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht | P.O.Box 85500 | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
|
|
|
Re: Re: Classifiers2009/10/28 Fernando Zuher <fernando.zuher@...> Thanks for the reply Thomas. I assume you mean how to calculate the AUC for the case when more than 2 classes occur in the dataset. To answer this question; there has been done some research on that, but I dont know the details about it. Just google a bit and I'm sure you 'll find answers. eg: http://www.cs.bris.ac.uk/~flach/ICML04tutorial/ROCtutorialPartIII.pdf you can compare the auc of a classifier with the auc of other classifiers by just looking which one is larger. How larger the AUC, how better. -- Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht | P.O.Box 85500 | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Classifiers2009/10/28 Thomas Debray <tw@...>
yes, exactly, I have eval data below for three classifiers which of these would you consider best as in 'least prone to overfit': TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.97 0.3 0.942 0.97 0.956 0.875 O 0.7 0.03 0.824 0.7 0.757 0.948 I a b c <-- classified as 97 3 0 | a = O 6 14 0 | b = I TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.85 0 1 0.85 0.919 0.944 O 1 0.15 0.571 1 0.727 0.944 I a b c <-- classified as 85 15 0 | a = O 0 20 0 | b = I TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.91 0.25 0.948 0.91 0.929 0.902 O 0.75 0.09 0.625 0.75 0.682 0.902 I a b c <-- classified as 91 9 0 | a = O 5 15 0 | b = I I would scrap the middle one for its overrecall of "I" class, but how would you rank first and third ? should the number of false predictions made at each class follow their distribution in training data? i.e. that the 9/5 ratio in third vs 3/6 ratio in first
by probability threshold, do you mean prior operation of under/oversampling one class or the output class probability? if latter, how do you figure that?
btw, what I meant by per-class AUC was weka's standard output as above, per class there's also 'weighted avg' for ROC area (=AUC) which I'm not sure how they weight it... whether it's the AUC you're talking about, should ask weka people probably
-- ----------------- Harri M.T. Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Feature representationDear all,
I have a simple question pertaining feature representation. I have a set of euclidean distances from an input set of n vectors. The distances are computed between the set of n vectors and cluster centers or centroids. After that, I want to use all distances as one of features for indexing. My question is what is the best way to represent the euclidean distances for indexing? I know that one of the commonly used approaches is histogram i.e. histogram of euclidean distances. Does anybody knows what is the best approach rather than histogram? kind regards azizisya. _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Classifiers
You need the AUC. I do not know how you get the AUC for a classifier in the GUI, but in the javacode it is very simple. Just call evaluation.getAUC(). For 3 classifiers, you get 3 AUCs. I would prefer the classifier with the highest AUC. Note btw that although you define 3 classes in your dataset, only 2 classes can be recognized. I guess there is no data for class 3.
Many classifiers (such as Naive Bayes) produce a probability vector for which they believe the presented instance belongs to class Y. In a 2-class problem, the class with a probability of higher than 50% is then choosen. However, you can alter this threshold to any other value, with the result that different classifications are made. I suggest you look up some literature if you want to know more about it. I have a brief description of all the mentioned things in my thesis. -- Thomas Debray | Theoretical Epidemiology | Julius Center | Stratenum 6.131 | University Medical Center Utrecht | P.O.Box 85500 | 3508 GA Utrecht | The Netherlands | www.juliuscenter.nl | www.thomasdebray.be | www.netstorm.be _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
| Free embeddable forum powered by Nabble | Forum Help |