|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
Assuring the completeness of training data set> This was also my first thought having read the question > ("similar+different kind" = 2 classes, a class balance > issue). But when I read it again, the question seems more > about the spread/location of instances along some measure of > outliership (e.g. euclidean space). > > The asker would need now to confirm this I think, it is > also possible both were meant, as both class imbalance and > proporition of outliers are known to have an effect on how > much the trainset model will overfit, i.e. perform worse > that trainset CV tests propose, against RW set. I am a new user, and have been using weka for numeric classification (i.e like regression for predicting some numeric value). All attributes and class values are numeric. The generated data set of 3025 instances represent behavior of programs running on processors, which in a real scenario is not bounded (in number of instances, or the numeric values of the class (single class) as well as attributes). Hence I am trying to generate a model which should be most representative. Not pretty sure of the representativeness of my 3025 instance dataset. Still feel that the existing data can be used in better way to generate the model, which is most representative. Harry, can you please explain what is RW-set? From the discussion which went on it looks like that I should use smote or the filter suggested by wirefree (meant for non-numeric classification?) . Both seem to add some more instances. Can I do with the original dataset, I mean by reducing the dataset to most representative one? (please correct me) (I feel that the added instances may not be representative ones.) I am attaching the Q-Q plot of the dataset. (removed it as previous mail bounced back due to exceeding the size limit) Looking forward for suggestions. Thanks again. Regards, Jitendra ICC World Twenty20 England '09 exclusively on YAHOO! CRICKET http://cricket.yahoo.com _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Assuring the completeness of training data setLainaus "J K Rai" <jk.anurag@...>:
> >> This was also my first thought having read the question >> ("similar+different kind" = 2 classes, a class balance >> issue). But when I read it again, the question seems more >> about the spread/location of instances along some measure of >> outliership (e.g. euclidean space). >> >> The asker would need now to confirm this I think, it is >> also possible both were meant, as both class imbalance and >> proporition of outliers are known to have an effect on how >> much the trainset model will overfit, i.e. perform worse >> that trainset CV tests propose, against RW set. > > Thanks for the replies, > > I am a new user, and have been using weka for numeric classification > (i.e like regression for predicting some numeric value). All > attributes and class values are numeric. > > The generated data set of 3025 instances represent behavior of > programs running on processors, which in a real scenario is not > bounded (in number of instances, or the numeric values of the class > (single class) as well as attributes). > > Hence I am trying to generate a model which should be most representative. > Not pretty sure of the representativeness of my 3025 instance dataset. > Still feel that the existing data can be used in better way to > generate the model, which is most representative. > > Harry, can you please explain what is RW-set? model you here want to estimate by measuring its (trainset's) representativeness internally > > From the discussion which went on it looks like that I should use > smote or the filter suggested by wirefree (meant for non-numeric > classification?) . Both seem to add some more instances. Can I do > with the original dataset, sure you can. forget about the adding part for now. for original dataset I suggested running the following in experimenter: - iterations: 1 - folds: 10 - classifier selection: select e.g. the following fast-running classifiers (doesn't matter much for our purposes which we choose but best have several of them though): bayes.NaiveBayes trees.J48 lazy.IBk (all classifiers in their default configurations) after it finishes, go to results tab and select from those row/column buttons something reading "standard deviation" -> results analysis: stdev could be e.g. +/- 2.3% which under the above assumption of similarity between trainset and testset is your figure of representativeness of your trainset (the smaller this deviation is, the more representative your trainset is) > I mean by reducing the dataset to most representative one? (please > correct me) (I feel that the added instances may not be > representative ones.) let's get back to that later, after you run the above > I am attaching the Q-Q plot of the dataset. (removed it as previous > mail bounced back due to exceeding the size limit) ok, any sizeable data you have you can send offlist to my.name@..., including the dataset > > Looking forward for suggestions. > > Thanks again. > Regards, > Jitendra > > > ICC World Twenty20 England '09 exclusively on YAHOO! > CRICKET http://cricket.yahoo.com > > > -- kind regards, Harri M.T. Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
|
|
|
Re: Assuring the completeness of training data setLainaus "J K Rai" <jk.anurag@...>:
> > >> > From the discussion which went on it looks like that I >> should use smote or the filter suggested by wirefree (meant >> for non-numeric classification?) . Both seem to add some >> more instances. Can I do with the original dataset, >> >> sure you can. forget about the adding part for now. >> >> for original dataset I suggested running the following in >> experimenter: >> - iterations: 1 >> - folds: 10 >> - classifier selection: select e.g. the following >> fast-running classifiers (doesn't matter much for our >> purposes which we choose but best have several of them >> though): >> bayes.NaiveBayes >> trees.J48 >> lazy.IBk >> (all classifiers in their default configurations) >> >> after it finishes, go to results tab and select from those >> row/column buttons something reading "standard deviation" >> -> results analysis: stdev could be e.g. +/- 2..3% which >> under the above assumption of similarity between trainset >> and testset is your figure of representativeness of your >> trainset (the smaller this deviation is, the more >> representative your trainset is) > > If I choose a different value in comparison field (in analyzer I get > different stddev in right window), like .04 and .1 for > Mean_absolute_error and Root_mean_squared_error respectively, shown > below: > > What should I choose there for analysis (in comparison field)? > Harri, I am not clear with the method and concept regarding your > suggestion. Kindly explain. tick that and rerun. then you get: Tester: weka.experiment.PairedCorrectedTTester Analysing: Percent_correct Datasets: 1 Resultsets: 1 Confidence: 0.05 (two tailed) Sorted by: - Date: 30.6.2009 7:50 Dataset (1) trees.REPTree '- ---------------------------------------------- (10) 94.00(5.84) | ---------------------------------------------- (v/ /*) | Key: (1) trees.REPTree '-M 2 -V 0.0010 -N 3 -S 1 -L -1' -8562443428621539458 -> 5.84 is the stdev (maybe observe the individual fold results too, get that by selecting Fold as one of the Rows there, and hit Perform test again) > > regards, > Jitendra > ============================================================ > Tester: weka.experiment.PairedCorrectedTTester > Analysing: Mean_absolute_error > Datasets: 1 > Resultsets: 1 > Confidence: 0.05 (two tailed) > Sorted by: - > Date: 6/30/09 6:07 AM > > > Dataset (1) functions.Linea > --------------------------------------------- > L2_MR (10) 0.64(0.04) | > --------------------------------------------- > (v/ /*) | > > > Key: > (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747 > > ============================================================== > > Tester: weka.experiment.PairedCorrectedTTester > Analysing: Root_mean_squared_error > Datasets: 1 > Resultsets: 1 > Confidence: 0.05 (two tailed) > Sorted by: - > Date: 6/30/09 6:08 AM > > > Dataset (1) functions.Linea > --------------------------------------------- > L2_MR (10) 1.06(0.10) | > --------------------------------------------- > (v/ /*) | > > > Key: > (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747 > ====================================================================== > > > ICC World Twenty20 England '09 exclusively on YAHOO! > CRICKET http://cricket.yahoo.com > > > -- kind regards, Harri M.T. Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
|
|
Re: Assuring the completeness of training data set> What should I choose there for analysis (in comparison field)? > Harri, I am not clear with the method and concept regarding your > suggestion. Kindly explain. (ok, I can see you already did the runs, misread on my part was earlier message...) ok so you get MAE 0.64 with stdev of 0.04 across ten folds, so individual fold tests range from 0.60 to 0.68, roughly (if you use my earlier message's suggestion and get the individual fold results, we can check this more carefully) so each ten trainfolds are representative wrt corresponding ten testfolds by this amount: trainsets are 'off', or underrepresented, by 0.04, and the MAE is 0.64, so I'm thinking would not 0.04 / 0.64 = 1/16 = 6.25% also be the measure for by how much any RW set will be on average 'off' from the full trainset... (to get stdev more accurate, run multiple iterations (10) of 10-fold cv, won't change much but will be statistically more valid) since MAE is calculated from the (predicted minus actual) values of your class attribute, we need further to know what's the value range (min/max/mean/stdev) of that class attribute? (you can get these by opening in explorer and scrolling the feature list to the class feature, min max etc. are shown in top right window of preprocess tab) ## as to your other questions: - you can use either MAE or RMAE for quality measure, mean abso error (MAE) is fine choose one and stick with it, they're ultimately the same measure - as to "explain method and concept", I think the answer to your special problem is either not in the books or will be very deep in those data mining books, so the procedure can be developed custom here and nonetheless adequately motivated > > regards, > Jitendra > ============================================================ > Tester: weka.experiment.PairedCorrectedTTester > Analysing: Mean_absolute_error > Datasets: 1 > Resultsets: 1 > Confidence: 0.05 (two tailed) > Sorted by: - > Date: 6/30/09 6:07 AM > > > Dataset (1) functions.Linea > --------------------------------------------- > L2_MR (10) 0.64(0.04) | > --------------------------------------------- > (v/ /*) | > > > Key: > (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747 > > ============================================================== > > Tester: weka.experiment.PairedCorrectedTTester > Analysing: Root_mean_squared_error > Datasets: 1 > Resultsets: 1 > Confidence: 0.05 (two tailed) > Sorted by: - > Date: 6/30/09 6:08 AM > > > Dataset (1) functions.Linea > --------------------------------------------- > L2_MR (10) 1.06(0.10) | > --------------------------------------------- > (v/ /*) | > > > Key: > (1) functions.LinearRegression '-S 0 -R 1.0E-8' -3364580862046573747 > ====================================================================== > > > ICC World Twenty20 England '09 exclusively on YAHOO! > CRICKET http://cricket.yahoo.com > > > -- kind regards, Harri M.T. Saarikoski M.A, PhD graduate student Helsinki University Finland _______________________________________________ Wekalist mailing list Send posts to: Wekalist@... List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html |
| Free embeddable forum powered by Nabble | Forum Help |