Assuring the completeness of training data set

View: New views
5 Messages — Rating Filter:   Alert me  

Assuring the completeness of training data set

by J K Rai :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> This was also my first thought having read the question
> ("similar+different kind" = 2 classes, a class balance
> issue). But when I read it again, the question seems more
> about the spread/location of instances along some measure of
> outliership (e.g. euclidean space).
>
> The asker would need now to confirm this I think, it is
> also possible both were meant, as both class imbalance and
> proporition of outliers are known to have an effect on how
> much the trainset model will overfit, i.e. perform worse
> that trainset CV tests propose, against RW set.
Thanks for the replies,

I am a new user, and have been using weka for numeric classification (i.e like regression for predicting some numeric value). All attributes and class values are numeric.

The generated data set of 3025 instances represent behavior of programs running on processors, which in a real scenario is not bounded (in number of instances, or the numeric values of the class (single class) as well as attributes).

Hence I am trying to generate a model which should be most representative.
Not pretty sure of the representativeness of my 3025 instance dataset.
Still feel that the existing data can be used in better way to generate the model, which is most representative.

Harry, can you please explain what is RW-set?

From the discussion which went on it looks like that I should use smote or the filter suggested by wirefree (meant for non-numeric classification?) . Both seem to add some more instances. Can I do with the original dataset, I mean by reducing the dataset to most representative one? (please correct me) (I feel that the added instances may not be representative ones.)
I am attaching the Q-Q plot of the dataset. (removed it as previous mail bounced  back due to exceeding the size limit)

Looking forward for suggestions.

Thanks again.
Regards,
Jitendra


      ICC World Twenty20 England '09 exclusively on YAHOO! CRICKET http://cricket.yahoo.com



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Assuring the completeness of training data set

by Harri Saarikoski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Lainaus "J K Rai" <jk.anurag@...>:

>
>> This was also my first thought having read the question
>> ("similar+different kind" = 2 classes, a class balance
>> issue). But when I read it again, the question seems more
>> about the spread/location of instances along some measure of
>> outliership (e.g. euclidean space).
>>
>> The asker would need now to confirm this I think, it is
>> also possible both were meant, as both class imbalance and
>> proporition of outliers are known to have an effect on how
>> much the trainset model will overfit, i.e. perform worse
>> that trainset CV tests propose, against RW set.
>
> Thanks for the replies,
>
> I am a new user, and have been using weka for numeric classification  
> (i.e like regression for predicting some numeric value). All  
> attributes and class values are numeric.
>
> The generated data set of 3025 instances represent behavior of  
> programs running on processors, which in a real scenario is not  
> bounded (in number of instances, or the numeric values of the class  
> (single class) as well as attributes).
>
> Hence I am trying to generate a model which should be most representative.
> Not pretty sure of the representativeness of my 3025 instance dataset.
> Still feel that the existing data can be used in better way to  
> generate the model, which is most representative.
>
> Harry, can you please explain what is RW-set?
"real world testset" as meant by you, the performance of the trainset  
model you here want to estimate by measuring its (trainset's)  
representativeness internally

>
> From the discussion which went on it looks like that I should use  
> smote or the filter suggested by wirefree (meant for non-numeric  
> classification?) . Both seem to add some more instances. Can I do  
> with the original dataset,

sure you can. forget about the adding part for now.

for original dataset I suggested running the following in experimenter:
- iterations: 1
- folds: 10
- classifier selection: select e.g. the following fast-running  
classifiers (doesn't matter much for our purposes which we choose but  
best have several of them though):
    bayes.NaiveBayes
    trees.J48
    lazy.IBk
(all classifiers in their default configurations)

after it finishes, go to results tab and select from those row/column  
buttons something reading "standard deviation" -> results analysis:  
stdev could be e.g. +/- 2.3% which under the above assumption of  
similarity between trainset and testset is your figure of  
representativeness of your trainset (the smaller this deviation is,  
the more representative your trainset is)


> I mean by reducing the dataset to most representative one? (please  
> correct me) (I feel that the added instances may not be  
> representative ones.)

let's get back to that later, after you run the above

> I am attaching the Q-Q plot of the dataset. (removed it as previous  
> mail bounced  back due to exceeding the size limit)

ok, any sizeable data you have you can send offlist to  
my.name@..., including the dataset

>
> Looking forward for suggestions.
>
> Thanks again.
> Regards,
> Jitendra
>
>
>       ICC World Twenty20 England '09 exclusively on YAHOO!  
> CRICKET http://cricket.yahoo.com
>
>
>


--

kind regards,
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Parent Message unknown Re: Assuring the completeness of training data set

by J K Rai :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



> > From the discussion which went on it looks like that I
> should use smote or the filter suggested by wirefree (meant
> for non-numeric classification?) . Both seem to add some
> more instances. Can I do with the original dataset,
>
> sure you can. forget about the adding part for now.
>
> for original dataset I suggested running the following in
> experimenter:
> - iterations: 1
> - folds: 10
> - classifier selection: select e.g. the following
> fast-running classifiers (doesn't matter much for our
> purposes which we choose but best have several of them
> though):
>    bayes.NaiveBayes
>    trees.J48
>    lazy.IBk
> (all classifiers in their default configurations)
>
> after it finishes, go to results tab and select from those
> row/column buttons something reading "standard deviation"
> -> results analysis: stdev could be e.g. +/- 2..3% which
> under the above assumption of similarity between trainset
> and testset is your figure of representativeness of your
> trainset (the smaller this deviation is, the more
> representative your trainset is)
If I choose a different value in comparison field (in analyzer I get different stddev in right window), like .04 and .1 for Mean_absolute_error and Root_mean_squared_error respectively, shown below:

What should I choose there for analysis (in comparison field)?
Harri, I am not clear with the method and concept regarding your suggestion. Kindly explain.

regards,
Jitendra
============================================================
Tester:     weka.experiment.PairedCorrectedTTester
Analysing:  Mean_absolute_error
Datasets:   1
Resultsets: 1
Confidence: 0.05 (two tailed)
Sorted by:  -
Date:       6/30/09 6:07 AM


Dataset                   (1) functions.Linea
---------------------------------------------
L2_MR                     (10)   0.64(0.04) |
---------------------------------------------
(v/ /*)                                     |


Key:
(1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747

==============================================================

Tester:     weka.experiment.PairedCorrectedTTester
Analysing:  Root_mean_squared_error
Datasets:   1
Resultsets: 1
Confidence: 0.05 (two tailed)
Sorted by:  -
Date:       6/30/09 6:08 AM


Dataset                   (1) functions.Linea
---------------------------------------------
L2_MR                     (10)   1.06(0.10) |
---------------------------------------------
(v/ /*)                                     |


Key:
(1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747
======================================================================


      ICC World Twenty20 England '09 exclusively on YAHOO! CRICKET http://cricket.yahoo.com



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Assuring the completeness of training data set

by Harri Saarikoski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Lainaus "J K Rai" <jk.anurag@...>:

>
>
>> > From the discussion which went on it looks like that I
>> should use smote or the filter suggested by wirefree (meant
>> for non-numeric classification?) . Both seem to add some
>> more instances. Can I do with the original dataset,
>>
>> sure you can. forget about the adding part for now.
>>
>> for original dataset I suggested running the following in
>> experimenter:
>> - iterations: 1
>> - folds: 10
>> - classifier selection: select e.g. the following
>> fast-running classifiers (doesn't matter much for our
>> purposes which we choose but best have several of them
>> though):
>>    bayes.NaiveBayes
>>    trees.J48
>>    lazy.IBk
>> (all classifiers in their default configurations)
>>
>> after it finishes, go to results tab and select from those
>> row/column buttons something reading "standard deviation"
>> -> results analysis: stdev could be e.g. +/- 2..3% which
>> under the above assumption of similarity between trainset
>> and testset is your figure of representativeness of your
>> trainset (the smaller this deviation is, the more
>> representative your trainset is)
>
> If I choose a different value in comparison field (in analyzer I get  
> different stddev in right window), like .04 and .1 for  
> Mean_absolute_error and Root_mean_squared_error respectively, shown  
> below:
>
> What should I choose there for analysis (in comparison field)?
> Harri, I am not clear with the method and concept regarding your  
> suggestion. Kindly explain.
sorry, stdev is a tickbox in the analyse tab not a row/column selection
tick that and rerun. then you get:

Tester:     weka.experiment.PairedCorrectedTTester
Analysing:  Percent_correct
Datasets:   1
Resultsets: 1
Confidence: 0.05 (two tailed)
Sorted by:  -
Date:       30.6.2009 7:50


Dataset                   (1) trees.REPTree '-
----------------------------------------------
                           (10)   94.00(5.84) |
----------------------------------------------
(v/ /*)                                      |


Key:
(1) trees.REPTree '-M 2 -V 0.0010 -N 3 -S 1 -L -1' -8562443428621539458

-> 5.84 is the stdev

(maybe observe the individual fold results too, get that by selecting  
Fold as one of the Rows there, and hit Perform test again)

>
> regards,
> Jitendra
> ============================================================
> Tester:     weka.experiment.PairedCorrectedTTester
> Analysing:  Mean_absolute_error
> Datasets:   1
> Resultsets: 1
> Confidence: 0.05 (two tailed)
> Sorted by:  -
> Date:       6/30/09 6:07 AM
>
>
> Dataset                   (1) functions.Linea
> ---------------------------------------------
> L2_MR                     (10)   0.64(0.04) |
> ---------------------------------------------
> (v/ /*)                                     |
>
>
> Key:
> (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747
>
> ==============================================================
>
> Tester:     weka.experiment.PairedCorrectedTTester
> Analysing:  Root_mean_squared_error
> Datasets:   1
> Resultsets: 1
> Confidence: 0.05 (two tailed)
> Sorted by:  -
> Date:       6/30/09 6:08 AM
>
>
> Dataset                   (1) functions.Linea
> ---------------------------------------------
> L2_MR                     (10)   1.06(0.10) |
> ---------------------------------------------
> (v/ /*)                                     |
>
>
> Key:
> (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747
> ======================================================================
>
>
>       ICC World Twenty20 England '09 exclusively on YAHOO!  
> CRICKET http://cricket.yahoo.com
>
>
>


--

kind regards,
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Assuring the completeness of training data set

by Harri Saarikoski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> What should I choose there for analysis (in comparison field)?
> Harri, I am not clear with the method and concept regarding your  
> suggestion. Kindly explain.

(ok, I can see you already did the runs, misread on my part was  
earlier message...)

ok so you get MAE 0.64 with stdev of 0.04 across ten folds, so  
individual fold tests range from 0.60 to 0.68, roughly (if you use my  
earlier message's suggestion and get the individual fold results, we  
can check this more carefully)

so each ten trainfolds are representative wrt corresponding ten  
testfolds by this amount: trainsets are 'off', or underrepresented, by  
0.04, and the MAE is 0.64, so I'm thinking would not 0.04 / 0.64 =  
1/16 = 6.25% also be the measure for by how much any RW set will be on  
average 'off' from the full trainset...
(to get stdev more accurate, run multiple iterations (10) of 10-fold  
cv, won't change much but will be statistically more valid)

since MAE is calculated from the (predicted minus actual) values of  
your class attribute, we need further to know what's the value range  
(min/max/mean/stdev) of that class attribute? (you can get these by  
opening in explorer and scrolling the feature list to the class  
feature, min max etc. are shown in top right window of preprocess tab)

##
as to your other questions:
- you can use either MAE or RMAE for quality measure, mean abso error  
(MAE) is fine choose one and stick with it, they're ultimately the  
same measure

- as to "explain method and concept", I think the answer to your  
special problem is either not in the books or will be very deep in  
those data mining books, so the procedure can be developed custom here  
and nonetheless adequately motivated

>
> regards,
> Jitendra
> ============================================================
> Tester:     weka.experiment.PairedCorrectedTTester
> Analysing:  Mean_absolute_error
> Datasets:   1
> Resultsets: 1
> Confidence: 0.05 (two tailed)
> Sorted by:  -
> Date:       6/30/09 6:07 AM
>
>
> Dataset                   (1) functions.Linea
> ---------------------------------------------
> L2_MR                     (10)   0.64(0.04) |
> ---------------------------------------------
> (v/ /*)                                     |
>
>
> Key:
> (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747
>
> ==============================================================
>
> Tester:     weka.experiment.PairedCorrectedTTester
> Analysing:  Root_mean_squared_error
> Datasets:   1
> Resultsets: 1
> Confidence: 0.05 (two tailed)
> Sorted by:  -
> Date:       6/30/09 6:08 AM
>
>
> Dataset                   (1) functions.Linea
> ---------------------------------------------
> L2_MR                     (10)   1.06(0.10) |
> ---------------------------------------------
> (v/ /*)                                     |
>
>
> Key:
> (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747
> ======================================================================
>
>
>       ICC World Twenty20 England '09 exclusively on YAHOO!  
> CRICKET http://cricket.yahoo.com
>
>
>


--

kind regards,
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html