« Return to Thread: Assuring the completeness of training data set

Assuring the completeness of training data set

by J K Rai :: Rate this Message:

Reply to Author | View in Thread


> This was also my first thought having read the question
> ("similar+different kind" = 2 classes, a class balance
> issue). But when I read it again, the question seems more
> about the spread/location of instances along some measure of
> outliership (e.g. euclidean space).
>
> The asker would need now to confirm this I think, it is
> also possible both were meant, as both class imbalance and
> proporition of outliers are known to have an effect on how
> much the trainset model will overfit, i.e. perform worse
> that trainset CV tests propose, against RW set.
Thanks for the replies,

I am a new user, and have been using weka for numeric classification (i.e like regression for predicting some numeric value). All attributes and class values are numeric.

The generated data set of 3025 instances represent behavior of programs running on processors, which in a real scenario is not bounded (in number of instances, or the numeric values of the class (single class) as well as attributes).

Hence I am trying to generate a model which should be most representative.
Not pretty sure of the representativeness of my 3025 instance dataset.
Still feel that the existing data can be used in better way to generate the model, which is most representative.

Harry, can you please explain what is RW-set?

From the discussion which went on it looks like that I should use smote or the filter suggested by wirefree (meant for non-numeric classification?) . Both seem to add some more instances. Can I do with the original dataset, I mean by reducing the dataset to most representative one? (please correct me) (I feel that the added instances may not be representative ones.)
I am attaching the Q-Q plot of the dataset. (removed it as previous mail bounced  back due to exceeding the size limit)

Looking forward for suggestions.

Thanks again.
Regards,
Jitendra


      ICC World Twenty20 England '09 exclusively on YAHOO! CRICKET http://cricket.yahoo.com



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

 « Return to Thread: Assuring the completeness of training data set