Are we supposed to Mention CLASS in TESTING DATA

View: New views
6 Messages — Rating Filter:   Alert me  

Are we supposed to Mention CLASS in TESTING DATA

by All Good :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sorry for this almost stupid question .. But its just for confirmation that in the testing file, are we suppoed to mention the class label after each instance??

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Are we supposed to Mention CLASS in TESTING DATA

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Sorry for this almost stupid question .. But its just for confirmation that
> in the testing file, are we suppoed to mention the class label after each
> instance??

If your test set doesn't contain any values for the class attribute,
then you can't evaluate the performance of a classifier (classifiers
aren't psychic yet). Even if you just want to output the predictions,
your test set still needs the same structure as the training set. In
that case, just use missing values (in ARFF files, missing values are
denoted by the question mark).

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Are we supposed to Mention CLASS in TESTING DATA

by tetsu yatsu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Writing a text analysis program and generating test datasets with StringToWordVector to classify with a pre-trained classifier, I noticed that it classified even without the exact same format.

I am now under the impression that Weka will match up the columns with similar names, and consider the ones that don't exist are ignored. Maybe they're counted as 'missing' by the classifier? I don't really know! I just know that it works!

Is it just taking the first n attributes, and discarding n+1 and on? Incorrectly comparing attributes that aren't the same metric? (if so, is there a way to StringToWordVector a String field into the appropriate attribute set?)

On Sun, Nov 8, 2009 at 12:12 AM, Peter Reutemann <fracpete@...> wrote:
> Sorry for this almost stupid question .. But its just for confirmation that
> in the testing file, are we suppoed to mention the class label after each
> instance??

If your test set doesn't contain any values for the class attribute,
then you can't evaluate the performance of a classifier (classifiers
aren't psychic yet). Even if you just want to output the predictions,
your test set still needs the same structure as the training set. In
that case, just use missing values (in ARFF files, missing values are
denoted by the question mark).

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Are we supposed to Mention CLASS in TESTING DATA

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Please no top-posting, see mailing list etiquette why
(http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html).

> Writing a text analysis program and generating test datasets with
> StringToWordVector to classify with a pre-trained classifier, I noticed that
> it classified even without the exact same format.

Use batch filtering (see FAQ "How do I generate compatible train and
test sets that get processed with a filter?") or the
FilteredClassifier meta-classifier (StringToWordVector and your choice
of base-classifier) to ensure compatible datasets.

> I am now under the impression that Weka will match up the columns with
> similar names, and consider the ones that don't exist are ignored. Maybe
> they're counted as 'missing' by the classifier? I don't really know! I just
> know that it works!

Weka *assumes* that the datasets have the same structure. There is no
magic happening, Weka's classifiers will just use attribute indices
when making predictions. If they datasets differ, values from
different attributes (at the same location) will be picked. In a
nutshell: the results cannot be trusted.

> Is it just taking the first n attributes, and discarding n+1 and on?

That depends on the classifier. A decision tree based classifier might
only need 5 attributes out of a 1000.

> Incorrectly comparing attributes that aren't the same metric?

Weka's internal data format are just doubles (the indices of labels
are stored for nominal attributes), nothing more. Simple and fast
number comparisons happen internally. Computational expensive mappings
original attribute space of input and potentially different attribute
space when predicting would make it unbearably slow.

> (if so, is
> there a way to StringToWordVector a String field into the appropriate
> attribute set?)

Like I said above, use batch filtering or the FilteredClassifier approach.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Are we supposed to Mention CLASS in TESTING DATA

by tetsu yatsu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

My test set may be huge. Can I serialize filters?
 
Gosh, that'd be convenient.

On Sun, Nov 8, 2009 at 5:34 AM, Peter Reutemann <fracpete@...> wrote:
Please no top-posting, see mailing list etiquette why
(http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html).

> Writing a text analysis program and generating test datasets with
> StringToWordVector to classify with a pre-trained classifier, I noticed that
> it classified even without the exact same format.

Use batch filtering (see FAQ "How do I generate compatible train and
test sets that get processed with a filter?") or the
FilteredClassifier meta-classifier (StringToWordVector and your choice
of base-classifier) to ensure compatible datasets.

> I am now under the impression that Weka will match up the columns with
> similar names, and consider the ones that don't exist are ignored. Maybe
> they're counted as 'missing' by the classifier? I don't really know! I just
> know that it works!

Weka *assumes* that the datasets have the same structure. There is no
magic happening, Weka's classifiers will just use attribute indices
when making predictions. If they datasets differ, values from
different attributes (at the same location) will be picked. In a
nutshell: the results cannot be trusted.

> Is it just taking the first n attributes, and discarding n+1 and on?

That depends on the classifier. A decision tree based classifier might
only need 5 attributes out of a 1000.

> Incorrectly comparing attributes that aren't the same metric?

Weka's internal data format are just doubles (the indices of labels
are stored for nominal attributes), nothing more. Simple and fast
number comparisons happen internally. Computational expensive mappings
original attribute space of input and potentially different attribute
space when predicting would make it unbearably slow.

> (if so, is
> there a way to StringToWordVector a String field into the appropriate
> attribute set?)

Like I said above, use batch filtering or the FilteredClassifier approach.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: Are we supposed to Mention CLASS in TESTING DATA

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hint: top-posting is frowned upan and delete irrelevant stuff from posts...

> My test set may be huge. Can I serialize filters?

Yes, like most classes in Weka.

BTW If you use the FilteredClassifier approach (with the classifier
and filter that you want to use), then you can just use the original
datasets and the FilteredClassifier takes care of the rest. The filter
and base-classifier are part of the FilteredClassifier object and will
get serialized with it.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html