Lainaus "J K Rai" <
jk.anurag@...>:
>
>> This was also my first thought having read the question
>> ("similar+different kind" = 2 classes, a class balance
>> issue). But when I read it again, the question seems more
>> about the spread/location of instances along some measure of
>> outliership (e.g. euclidean space).
>>
>> The asker would need now to confirm this I think, it is
>> also possible both were meant, as both class imbalance and
>> proporition of outliers are known to have an effect on how
>> much the trainset model will overfit, i.e. perform worse
>> that trainset CV tests propose, against RW set.
>
> Thanks for the replies,
>
> I am a new user, and have been using weka for numeric classification
> (i.e like regression for predicting some numeric value). All
> attributes and class values are numeric.
>
> The generated data set of 3025 instances represent behavior of
> programs running on processors, which in a real scenario is not
> bounded (in number of instances, or the numeric values of the class
> (single class) as well as attributes).
>
> Hence I am trying to generate a model which should be most representative.
> Not pretty sure of the representativeness of my 3025 instance dataset.
> Still feel that the existing data can be used in better way to
> generate the model, which is most representative.
>
> Harry, can you please explain what is RW-set?
"real world testset" as meant by you, the performance of the trainset
model you here want to estimate by measuring its (trainset's)
representativeness internally
>
> From the discussion which went on it looks like that I should use
> smote or the filter suggested by wirefree (meant for non-numeric
> classification?) . Both seem to add some more instances. Can I do
> with the original dataset,
sure you can. forget about the adding part for now.
for original dataset I suggested running the following in experimenter:
- iterations: 1
- folds: 10
- classifier selection: select e.g. the following fast-running
classifiers (doesn't matter much for our purposes which we choose but
best have several of them though):
bayes.NaiveBayes
trees.J48
lazy.IBk
(all classifiers in their default configurations)
after it finishes, go to results tab and select from those row/column
buttons something reading "standard deviation" -> results analysis:
stdev could be e.g. +/- 2.3% which under the above assumption of
similarity between trainset and testset is your figure of
representativeness of your trainset (the smaller this deviation is,
the more representative your trainset is)
> I mean by reducing the dataset to most representative one? (please
> correct me) (I feel that the added instances may not be
> representative ones.)
let's get back to that later, after you run the above
> I am attaching the Q-Q plot of the dataset. (removed it as previous
> mail bounced back due to exceeding the size limit)
ok, any sizeable data you have you can send offlist to
my.name@..., including the dataset
>
> Looking forward for suggestions.
>
> Thanks again.
> Regards,
> Jitendra
>
>
> ICC World Twenty20 England '09 exclusively on YAHOO!
> CRICKET
http://cricket.yahoo.com>
>
>
--
kind regards,
Harri M.T. Saarikoski
M.A, PhD graduate student
Helsinki University
Finland
_______________________________________________
Wekalist mailing list
Send posts to:
Wekalist@...
List info and subscription status:
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalistList etiquette:
http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html