« Return to Thread: Assuring the completeness of training data set

Re: Assuring the completeness of training data set

by Harri Saarikoski :: Rate this Message:

Reply to Author | View in Thread


> What should I choose there for analysis (in comparison field)?
> Harri, I am not clear with the method and concept regarding your  
> suggestion. Kindly explain.

(ok, I can see you already did the runs, misread on my part was  
earlier message...)

ok so you get MAE 0.64 with stdev of 0.04 across ten folds, so  
individual fold tests range from 0.60 to 0.68, roughly (if you use my  
earlier message's suggestion and get the individual fold results, we  
can check this more carefully)

so each ten trainfolds are representative wrt corresponding ten  
testfolds by this amount: trainsets are 'off', or underrepresented, by  
0.04, and the MAE is 0.64, so I'm thinking would not 0.04 / 0.64 =  
1/16 = 6.25% also be the measure for by how much any RW set will be on  
average 'off' from the full trainset...
(to get stdev more accurate, run multiple iterations (10) of 10-fold  
cv, won't change much but will be statistically more valid)

since MAE is calculated from the (predicted minus actual) values of  
your class attribute, we need further to know what's the value range  
(min/max/mean/stdev) of that class attribute? (you can get these by  
opening in explorer and scrolling the feature list to the class  
feature, min max etc. are shown in top right window of preprocess tab)

##
as to your other questions:
- you can use either MAE or RMAE for quality measure, mean abso error  
(MAE) is fine choose one and stick with it, they're ultimately the  
same measure

- as to "explain method and concept", I think the answer to your  
special problem is either not in the books or will be very deep in  
those data mining books, so the procedure can be developed custom here  
and nonetheless adequately motivated

>
> regards,
> Jitendra
> ============================================================
> Tester:     weka.experiment.PairedCorrectedTTester
> Analysing:  Mean_absolute_error
> Datasets:   1
> Resultsets: 1
> Confidence: 0.05 (two tailed)
> Sorted by:  -
> Date:       6/30/09 6:07 AM
>
>
> Dataset                   (1) functions.Linea
> ---------------------------------------------
> L2_MR                     (10)   0.64(0.04) |
> ---------------------------------------------
> (v/ /*)                                     |
>
>
> Key:
> (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747
>
> ==============================================================
>
> Tester:     weka.experiment.PairedCorrectedTTester
> Analysing:  Root_mean_squared_error
> Datasets:   1
> Resultsets: 1
> Confidence: 0.05 (two tailed)
> Sorted by:  -
> Date:       6/30/09 6:08 AM
>
>
> Dataset                   (1) functions.Linea
> ---------------------------------------------
> L2_MR                     (10)   1.06(0.10) |
> ---------------------------------------------
> (v/ /*)                                     |
>
>
> Key:
> (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747
> ======================================================================
>
>
>       ICC World Twenty20 England '09 exclusively on YAHOO!  
> CRICKET http://cricket.yahoo.com
>
>
>


--

kind regards,
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

 « Return to Thread: Assuring the completeness of training data set