A Linear Regression Problem

View: New views
9 Messages — Rating Filter:   Alert me  

A Linear Regression Problem

by All Good :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

We have a set of 25 Topics. We use 10 Topics for training and 10 for Testing and we kept 5 for confirmation of the results with 3 DIFERENT Combinations of FEATURES (so that we can show what features are more important).

The results are same as we expecting except one strange exception is occuring. As we expecting, the result of Comb2 should be larger than Comb1 which is demonstrated when testing on 5 Topics (for confirmation) but its less on TESTING DATA (ie Comb2 < Comb1) .

Do you think its problem of Less Training Data or what??

thanks in adv



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: A Linear Regression Problem

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> We have a set of 25 Topics. We use 10 Topics for training and 10 for Testing
> and we kept 5 for confirmation of the results with 3 DIFERENT Combinations
> of FEATURES (so that we can show what features are more important).
>
> The results are same as we expecting except one strange exception is
> occuring. As we expecting, the result of Comb2 should be larger than Comb1
> which is demonstrated when testing on 5 Topics (for confirmation) but its
> less on TESTING DATA (ie Comb2 < Comb1) .
>
> Do you think its problem of Less Training Data or what??

I'm not quite sure what you mean with 25 topics. Are these 25 data
samples, i.e., rows/instances in the dataset? If so, 25 samples are
very little data... I personally prefer working with thousands of
examples.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: A Linear Regression Problem

by All Good :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sorry that I did not explained in detail. Let me write in details:

Its not samples or instances. You can call it Queries for them we need to find the RELEVANT Documents. So in Training Data (using 10 Queries), we have almost 7800 instances and for Testing data we have almost 8330 instances and on confirmation data (5 Queries) we have 5150 instances (i.e. labels of class is Relevant or NonRelevant for each instance).
So its quite a lot I think. As I explained we are checking on 3 different combinations of Features (using one fetaure in comb1, 8 features in Comb2 and 10 features in comb3). Theoretically the results (Precision@5 being calculated by results of Regression by our program) of regression should be like

Comb1<Comb2<Comb3

It gives same results for Conofirmation Data but on Testing data one exception occurs where result of Comb1> Comb2

So its kind of strange.

(The procedure I am using while using Linear Regression is .... that I give the Training File to Regression and then give it Testibg Data which gives some results and then our program use those results to have results P@5 on Testing Data ... Similarly I use the Confirmation Data to have its results..... )

May be I am doing some mistake somewhere if you can mention ...

Thanks in Advance



On Mon, Nov 9, 2009 at 8:36 AM, Peter Reutemann <fracpete@...> wrote:
> We have a set of 25 Topics. We use 10 Topics for training and 10 for Testing
> and we kept 5 for confirmation of the results with 3 DIFERENT Combinations
> of FEATURES (so that we can show what features are more important).
>
> The results are same as we expecting except one strange exception is
> occuring. As we expecting, the result of Comb2 should be larger than Comb1
> which is demonstrated when testing on 5 Topics (for confirmation) but its
> less on TESTING DATA (ie Comb2 < Comb1) .
>
> Do you think its problem of Less Training Data or what??

I'm not quite sure what you mean with 25 topics. Are these 25 data
samples, i.e., rows/instances in the dataset? If so, 25 samples are
very little data... I personally prefer working with thousands of
examples.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: A Linear Regression Problem

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Please no top-posting, see mailing list etiquette why
(http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html).

> Its not samples or instances. You can call it Queries for them we need to
> find the RELEVANT Documents. So in Training Data (using 10 Queries), we have
> almost 7800 instances and for Testing data we have almost 8330 instances and
> on confirmation data (5 Queries) we have 5150 instances (i.e. labels of
> class is Relevant or NonRelevant for each instance).

I thought you're using LinearRegression, but your dataset seems to
have a binary class attribute instead of numeric one.

> So its quite a lot I think.

it's reasonable large, yes.

> As I explained we are checking on 3 different
> combinations of Features (using one fetaure in comb1, 8 features in Comb2
> and 10 features in comb3). Theoretically the results (Precision@5 being
> calculated by results of Regression by our program) of regression should be
> like

Not sure what you mean with "Precision@5".

> Comb1<Comb2<Comb3
>
> It gives same results for Conofirmation Data but on Testing data one
> exception occurs where result of Comb1> Comb2
>
> So its kind of strange.
>
> (The procedure I am using while using Linear Regression is .... that I give
> the Training File to Regression and then give it Testibg Data which gives
> some results and then our program use those results to have results P@5 on
> Testing Data ... Similarly I use the Confirmation Data to have its
> results..... )
>
> May be I am doing some mistake somewhere if you can mention ...

Since you seem to perform some kind of post-processing of the
regression output, it's hard to figure what's going wrong. One reason
could be, that the choice of datasets (train, test, validation) is
unfortunate, resulting in this unexpected behavior that you described.

Have you thought about using the meta-classifier
ClassificationViaRegression, in case your dataset has a binary class
attribute? Then you could perform cross-validation (and in the
Experimenter even multiple runs of cross-validation) in order to
obtain more reliable statistics.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: A Linear Regression Problem

by All Good :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I thought you're using LinearRegression, but your dataset seems to
have a binary class attribute instead of numeric one.
>>> I am using 1 for Relevant and 0 for Non-Relevant
P@5 means we rank all results by the predictions GIVEN by Regression and then rank them by those values ... For Top 5 , we look at their actual CLASS and then calculate Precision @5. I will read about MetaClassifier bz with current Setup of Training Data, its Line in not enabled in Explorer may be it requires data in different format.





On Mon, Nov 9, 2009 at 10:15 AM, Peter Reutemann <fracpete@...> wrote:
Please no top-posting, see mailing list etiquette why
(http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html).

> Its not samples or instances. You can call it Queries for them we need to
> find the RELEVANT Documents. So in Training Data (using 10 Queries), we have
> almost 7800 instances and for Testing data we have almost 8330 instances and
> on confirmation data (5 Queries) we have 5150 instances (i.e. labels of
> class is Relevant or NonRelevant for each instance).



> So its quite a lot I think.

it's reasonable large, yes.

> As I explained we are checking on 3 different
> combinations of Features (using one fetaure in comb1, 8 features in Comb2
> and 10 features in comb3). Theoretically the results (Precision@5 being
> calculated by results of Regression by our program) of regression should be
> like

Not sure what you mean with "Precision@5".

> Comb1<Comb2<Comb3
>
> It gives same results for Conofirmation Data but on Testing data one
> exception occurs where result of Comb1> Comb2
>
> So its kind of strange.
>
> (The procedure I am using while using Linear Regression is .... that I give
> the Training File to Regression and then give it Testibg Data which gives
> some results and then our program use those results to have results P@5 on
> Testing Data ... Similarly I use the Confirmation Data to have its
> results..... )
>
> May be I am doing some mistake somewhere if you can mention ...

Since you seem to perform some kind of post-processing of the
regression output, it's hard to figure what's going wrong. One reason
could be, that the choice of datasets (train, test, validation) is
unfortunate, resulting in this unexpected behavior that you described.

Have you thought about using the meta-classifier
ClassificationViaRegression, in case your dataset has a binary class
attribute? Then you could perform cross-validation (and in the
Experimenter even multiple runs of cross-validation) in order to
obtain more reliable statistics.

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: A Linear Regression Problem

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I thought you're using LinearRegression, but your dataset seems to
> have a binary class attribute instead of numeric one.
>>>> I am using 1 for Relevant and 0 for Non-Relevant
> P@5 means we rank all results by the predictions GIVEN by Regression and
> then rank them by those values ... For Top 5 , we look at their actual CLASS
> and then calculate Precision @5. I will read about MetaClassifier bz with
> current Setup of Training Data, its Line in not enabled in Explorer may be
> it requires data in different format.

The meta-classifier needs a nominal class attribute. It doesn't work
with numeric class attributes. You'd have to discretize the class
attribute first (using Weka's unsupervised Discretize filter, for
instance).

[...]

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: A Linear Regression Problem

by All Good :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

yes, thanks dear .. I just read it ....

But I think you suggested this just because you thought I have a nominal class attribute.. So it might give same results .. as Linear Regression if they are both same. I will try using Cross Validation and let see the results .. but I am really thankful to you for all your kind help.

Thanks a Lot and god bless you


On Mon, Nov 9, 2009 at 10:30 AM, Peter Reutemann <fracpete@...> wrote:
> I thought you're using LinearRegression, but your dataset seems to
> have a binary class attribute instead of numeric one.
>>>> I am using 1 for Relevant and 0 for Non-Relevant
> P@5 means we rank all results by the predictions GIVEN by Regression and
> then rank them by those values ... For Top 5 , we look at their actual CLASS
> and then calculate Precision @5. I will read about MetaClassifier bz with
> current Setup of Training Data, its Line in not enabled in Explorer may be
> it requires data in different format.

The meta-classifier needs a nominal class attribute. It doesn't work
with numeric class attributes. You'd have to discretize the class
attribute first (using Weka's unsupervised Discretize filter, for
instance).

[...]

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: A Linear Regression Problem

by All Good :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

One of the problem is that I have quite a big difference between Positive and Negative Class levels.

Non Relevant instances are much more than Relevant instance (NR are more than double of RV instances). So I think I should use COST Sensitive Evaluation. I have read something here
http://wekadocs.com/node/15

but still cannot understand that how should I set this Cost Sensitive matrix. Bz in ths matrix shown on this web, I dont know what value corresponds to what label ...

% Rows Columns

2 2

% Matrix elements

0 2

1 0    


like in above matrix, Where I need to put 2 (to tell the classifier that misclassifying a RELEVANT instance is Twice as Expensive than misclassifying the NR one).




Also if you can inform me that what class is for LogisticRegression and RegressionSVM in Weka??

thanks


On Mon, Nov 9, 2009 at 10:36 AM, All Good <kimskams80@...> wrote:
yes, thanks dear .. I just read it ....

But I think you suggested this just because you thought I have a nominal class attribute.. So it might give same results .. as Linear Regression if they are both same. I will try using Cross Validation and let see the results .. but I am really thankful to you for all your kind help.

Thanks a Lot and god bless you



On Mon, Nov 9, 2009 at 10:30 AM, Peter Reutemann <fracpete@...> wrote:
> I thought you're using LinearRegression, but your dataset seems to
> have a binary class attribute instead of numeric one.
>>>> I am using 1 for Relevant and 0 for Non-Relevant
> P@5 means we rank all results by the predictions GIVEN by Regression and
> then rank them by those values ... For Top 5 , we look at their actual CLASS
> and then calculate Precision @5. I will read about MetaClassifier bz with
> current Setup of Training Data, its Line in not enabled in Explorer may be
> it requires data in different format.

The meta-classifier needs a nominal class attribute. It doesn't work
with numeric class attributes. You'd have to discretize the class
attribute first (using Weka's unsupervised Discretize filter, for
instance).

[...]

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html



_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

Re: A Linear Regression Problem

by Peter Reutemann-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Please no top-posting...

> One of the problem is that I have quite a big difference between Positive
> and Negative Class levels.
>
> Non Relevant instances are much more than Relevant instance (NR are more
> than double of RV instances). So I think I should use COST Sensitive
> Evaluation. I have read something here
> http://wekadocs.com/node/15
>
> but still cannot understand that how should I set this Cost Sensitive
> matrix. Bz in ths matrix shown on this web, I dont know what value
> corresponds to what label ...
>
> % Rows Columns
>
> 2 2
>
> % Matrix elements
>
> 0 2
>
> 1 0
>
> like in above matrix, Where I need to put 2 (to tell the classifier that
> misclassifying a RELEVANT instance is Twice as Expensive than misclassifying
> the NR one).

The maxtrix is just the costs for the confusion matrix. On the
diagonal, you have the correctly classified ones and all other cells
are for misclassified ones.

Here is an example of a confusion matrix (J48 on the UCI dataset "iris"):

  a  b  c   <-- classified as
 49  1  0 |  a = Iris-setosa
  0 47  3 |  b = Iris-versicolor
  0  2 48 |  c = Iris-virginica

The first row contains all the instances that have label "Iris-setosa"
(= a). 1 instance got misclassified as "Iris-versicolor" (= b).
The cost matrix merely places "costs" in the misclassification cells.

See also the following wiki article:
  http://weka.wikispaces.com/CostSensitiveClassifier

> Also if you can inform me that what class is for LogisticRegression and
> RegressionSVM in Weka??

weka.classifiers.functions.Logistic
weka.classifiers.functions.SVMreg or SMOreg (depends on your version of Weka)

Cheers, Peter
--
Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ
http://www.cs.waikato.ac.nz/~fracpete/           Ph. +64 (7) 858-5174

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@...
List info and subscription status: https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html