Next step in covariance matrix

View: New views
9 Messages — Rating Filter:   Alert me  

Next step in covariance matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The code currently in the covariance branch should now correctly
produce a covariance matrix with an arbitrary number of categorical
variables subject to the following provisos:

* Missing values aren't properly handled.  
* No interactions are supported.

I don't think that either of these are going to be particularly
hard to add (if I correctly understand the issues). But I'd like
to get the current functionality thoroughly tested first.

I think the next step should be to get some sample procedures working
(ones which don't need interactions) so that we can solidly test what
we have so far.  After that we can fix up missing value handling, and
add support for interactions.

So what procedures would be best ? ANOVA, MANOVA, UNIANOVA or a subset
of GLM? And are there any good texts on how to perform anova from a covariance
matrix?  Most seem to assume that the sums of squares have been seperately
calculated.

Comments?

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Next step in covariance matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I haven't tested it yet, but it looks like the computation of the
dimension might be wrong when categorical variables are involved.
If a categorical variable has k categories, its contribution to the
dimension should be k-1. But this line in covariance.c:

cov->dim = cov->n_vars + categoricals_total (cov->categoricals);

...suggests the contribution to the dimension would be k.

The contribution to the dimension is k-1 because the range of possible
values of k categories is spanned by k-1 basis vectors.  The kth
vector is the origin, which corresponds to exactly one of the
categories.  Which is chosen as the origin is arbitrary (some software
chooses the first category seen, some the last).

> So what procedures would be best ? ANOVA, MANOVA, UNIANOVA or a subset
> of GLM? And are there any good texts on how to perform anova from a covariance
> matrix?  Most seem to assume that the sums of squares have been seperately
> calculated.

glm.q now uses the new routines, so that might be a good place to start.

There are plenty of pertinent texts that cover computation of least
squares estimates from a covariance matrix, but the ones I have seen
are aimed at interpreting results and establishing theory rather than
computations. I'll look up a few and send them along.

Golub and Van Loan has all of the necessary computations, but doesn't
mention "covariance matrices" by name. They just mention least squares.


_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Next step in covariance matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Oct 25, 2009 at 11:54:16PM -0400, Jason Stover wrote:
     I haven't tested it yet, but it looks like the computation of the
     dimension might be wrong when categorical variables are involved.
     If a categorical variable has k categories, its contribution to the
     dimension should be k-1. But this line in covariance.c:
     
     cov->dim = cov->n_vars + categoricals_total (cov->categoricals);
     
     ...suggests the contribution to the dimension would be k.
     
     The contribution to the dimension is k-1 because the range of possible
     values of k categories is spanned by k-1 basis vectors.  The kth
     vector is the origin, which corresponds to exactly one of the
     categories.  Which is chosen as the origin is arbitrary (some software
     chooses the first category seen, some the last).


So I've probably done it wrong.

Just to make sure I understand things correctly, consider the following example,
where x and y are numeric variables and A and B are categorical ones:

x y A B
=======
3 4 x v
5 6 y v
7 8 z w

We replace the categorical variables with bit_vectors:

x y A_0 A_1 A_2  B_0 B_1
========================
3 4  1   0   0    1   0
5 6  0   1   0    1   0
7 8  0   0   1    0   1

and arbitrarily drop the (say zeroth) subscript:

x y  A_1 A_2   B_1
==================
3 4   0   0     0
5 6   1   0     0
7 8   0   1     1

That will produce a 5x5 matrix. 5 is calculated from n + m - p,  where
n is the number of numeric  variables, m is the total number of  categories,
and p is the number of categorical variables.  

However I don't see how such a matrix can be very useful. A better one would involve
the  products of the categorical and numeric variables:

x y  x*A_1 x*A_2  y*A_1 y*A_2   x*B_1 y*B_1
===========================================
3 4     0   0        0     0       0     0
5 6     5   0        6     0       0     0
7 8     0   7        0     8       7     8

This makes an 8x8 matrix, where 8 is calculated from n + n * (m - p) ,
which happens to be identical to n * (1 + m - p).  But this involves
a whole lot more calculations.

Which of these schemes are correct, if any?



J'
     
--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Next step in covariance matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 06:38:19AM +0000, John Darrington wrote:

> Just to make sure I understand things correctly, consider the following example,
> where x and y are numeric variables and A and B are categorical ones:
>
> x y A B
> =======
> 3 4 x v
> 5 6 y v
> 7 8 z w
>
> We replace the categorical variables with bit_vectors:
>
> x y A_0 A_1 A_2  B_0 B_1
> ========================
> 3 4  1   0   0    1   0
> 5 6  0   1   0    1   0
> 7 8  0   0   1    0   1
>
> and arbitrarily drop the (say zeroth) subscript:
>
> x y  A_1 A_2   B_1
> ==================
> 3 4   0   0     0
> 5 6   1   0     0
> 7 8   0   1     1
>
> That will produce a 5x5 matrix. 5 is calculated from n + m - p,  where
> n is the number of numeric  variables, m is the total number of  categories,
> and p is the number of categorical variables.  

This is correct.

> However I don't see how such a matrix can be very useful. A better one would involve
> the  products of the categorical and numeric variables:
>
> x y  x*A_1 x*A_2  y*A_1 y*A_2   x*B_1 y*B_1
> ===========================================
> 3 4     0   0        0     0       0     0
> 5 6     5   0        6     0       0     0
> 7 8     0   7        0     8       7     8
>
> This makes an 8x8 matrix, where 8 is calculated from n + n * (m - p) ,
> which happens to be identical to n * (1 + m - p).  But this involves
> a whole lot more calculations.

This second choice would give you the covariance of x and y, and the
covariances of the *interactions* between x and A, x and B, y and A,
and y and B, but not the covariance between (say) x and A. The
covariance between x and A would be stored in the first matrix you
mentioned, in elements (0,2), (0,3), (2,0) and (3,0) assuming we kept
both upper and lower triangles.

You mention that matrix not being very useful, and in a sense it
isn't: No human would care about the covariance between x and the
column corresponding to the first bit vector of A. But in another
sense, that matrix is absolutely necessary: It's used to solve the
least squares problem, whose solution we use to tell us if A and our
dependent variable are related. That relation is shown via analysis of
variance, whose p-value is many computations away from the covariance
matrix, but depends on it nevertheless.

This matrix is unnecessary for a one-way ANOVA, whose computations from
the matrix above can be simplified into the simple sums used in
oneway.q.  But for a bigger model, with many factors and interactions
and covariates, we need that first matrix because we can't reduce the
problem to a few easy-to-read summations.


_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Next step in covariance matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

So it sounds as if the next step is simply to drop one column per categorical variable.
That should be quite simple.

Will that be enough to allow a subset of GLM to be implemented?

J'

On Tue, Oct 27, 2009 at 11:47:23AM -0400, Jason Stover wrote:
     On Tue, Oct 27, 2009 at 06:38:19AM +0000, John Darrington wrote:
     > Just to make sure I understand things correctly, consider the following example,
     > where x and y are numeric variables and A and B are categorical ones:
     >
     > x y A B
     > =======
     > 3 4 x v
     > 5 6 y v
     > 7 8 z w
     >
     > We replace the categorical variables with bit_vectors:
     >
     > x y A_0 A_1 A_2  B_0 B_1
     > ========================
     > 3 4  1   0   0    1   0
     > 5 6  0   1   0    1   0
     > 7 8  0   0   1    0   1
     >
     > and arbitrarily drop the (say zeroth) subscript:
     >
     > x y  A_1 A_2   B_1
     > ==================
     > 3 4   0   0     0
     > 5 6   1   0     0
     > 7 8   0   1     1
     >
     > That will produce a 5x5 matrix. 5 is calculated from n + m - p,  where
     > n is the number of numeric  variables, m is the total number of  categories,
     > and p is the number of categorical variables.  
     
     This is correct.
     
     > However I don't see how such a matrix can be very useful. A better one would involve
     > the  products of the categorical and numeric variables:
     >
     > x y  x*A_1 x*A_2  y*A_1 y*A_2   x*B_1 y*B_1
     > ===========================================
     > 3 4     0   0        0     0       0     0
     > 5 6     5   0        6     0       0     0
     > 7 8     0   7        0     8       7     8
     >
     > This makes an 8x8 matrix, where 8 is calculated from n + n * (m - p) ,
     > which happens to be identical to n * (1 + m - p).  But this involves
     > a whole lot more calculations.
     
     This second choice would give you the covariance of x and y, and the
     covariances of the *interactions* between x and A, x and B, y and A,
     and y and B, but not the covariance between (say) x and A. The
     covariance between x and A would be stored in the first matrix you
     mentioned, in elements (0,2), (0,3), (2,0) and (3,0) assuming we kept
     both upper and lower triangles.
     
     You mention that matrix not being very useful, and in a sense it
     isn't: No human would care about the covariance between x and the
     column corresponding to the first bit vector of A. But in another
     sense, that matrix is absolutely necessary: It's used to solve the
     least squares problem, whose solution we use to tell us if A and our
     dependent variable are related. That relation is shown via analysis of
     variance, whose p-value is many computations away from the covariance
     matrix, but depends on it nevertheless.
     
     This matrix is unnecessary for a one-way ANOVA, whose computations from
     the matrix above can be simplified into the simple sums used in
     oneway.q.  But for a bigger model, with many factors and interactions
     and covariates, we need that first matrix because we can't reduce the
     problem to a few easy-to-read summations.
     
     
     _______________________________________________
     pspp-dev mailing list
     pspp-dev@...
     http://lists.gnu.org/mailman/listinfo/pspp-dev

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Next step in covariance matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 06:25:32PM +0000, John Darrington wrote:
> Will that be enough to allow a subset of GLM to be implemented?

Yes, except for the interactions.

>
> J'
>
> On Tue, Oct 27, 2009 at 11:47:23AM -0400, Jason Stover wrote:
>      On Tue, Oct 27, 2009 at 06:38:19AM +0000, John Darrington wrote:
>      > Just to make sure I understand things correctly, consider the following example,
>      > where x and y are numeric variables and A and B are categorical ones:
>      >
>      > x y A B
>      > =======
>      > 3 4 x v
>      > 5 6 y v
>      > 7 8 z w
>      >
>      > We replace the categorical variables with bit_vectors:
>      >
>      > x y A_0 A_1 A_2  B_0 B_1
>      > ========================
>      > 3 4  1   0   0    1   0
>      > 5 6  0   1   0    1   0
>      > 7 8  0   0   1    0   1
>      >
>      > and arbitrarily drop the (say zeroth) subscript:
>      >
>      > x y  A_1 A_2   B_1
>      > ==================
>      > 3 4   0   0     0
>      > 5 6   1   0     0
>      > 7 8   0   1     1
>      >
>      > That will produce a 5x5 matrix. 5 is calculated from n + m - p,  where
>      > n is the number of numeric  variables, m is the total number of  categories,
>      > and p is the number of categorical variables.  
>      
>      This is correct.
>      
>      > However I don't see how such a matrix can be very useful. A better one would involve
>      > the  products of the categorical and numeric variables:
>      >
>      > x y  x*A_1 x*A_2  y*A_1 y*A_2   x*B_1 y*B_1
>      > ===========================================
>      > 3 4     0   0        0     0       0     0
>      > 5 6     5   0        6     0       0     0
>      > 7 8     0   7        0     8       7     8
>      >
>      > This makes an 8x8 matrix, where 8 is calculated from n + n * (m - p) ,
>      > which happens to be identical to n * (1 + m - p).  But this involves
>      > a whole lot more calculations.
>      
>      This second choice would give you the covariance of x and y, and the
>      covariances of the *interactions* between x and A, x and B, y and A,
>      and y and B, but not the covariance between (say) x and A. The
>      covariance between x and A would be stored in the first matrix you
>      mentioned, in elements (0,2), (0,3), (2,0) and (3,0) assuming we kept
>      both upper and lower triangles.
>      
>      You mention that matrix not being very useful, and in a sense it
>      isn't: No human would care about the covariance between x and the
>      column corresponding to the first bit vector of A. But in another
>      sense, that matrix is absolutely necessary: It's used to solve the
>      least squares problem, whose solution we use to tell us if A and our
>      dependent variable are related. That relation is shown via analysis of
>      variance, whose p-value is many computations away from the covariance
>      matrix, but depends on it nevertheless.
>      
>      This matrix is unnecessary for a one-way ANOVA, whose computations from
>      the matrix above can be simplified into the simple sums used in
>      oneway.q.  But for a bigger model, with many factors and interactions
>      and covariates, we need that first matrix because we can't reduce the
>      problem to a few easy-to-read summations.
>      
>      
>      _______________________________________________
>      pspp-dev mailing list
>      pspp-dev@...
>      http://lists.gnu.org/mailman/listinfo/pspp-dev
>
> --
> PGP Public key ID: 1024D/2DE827B3
> fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
> See http://pgp.mit.edu or any PGP keyserver for public key.
>
>




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Next step in covariance matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 03:20:20PM -0400, Jason Stover wrote:
     On Tue, Oct 27, 2009 at 06:25:32PM +0000, John Darrington wrote:
     > Will that be enough to allow a subset of GLM to be implemented?
     
     Yes, except for the interactions.

But some tasks can be done without interactions, can't they?

J'

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Next step in covariance matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 07:56:18PM +0000, John Darrington wrote:
> On Tue, Oct 27, 2009 at 03:20:20PM -0400, Jason Stover wrote:
>      On Tue, Oct 27, 2009 at 06:25:32PM +0000, John Darrington wrote:
>      > Will that be enough to allow a subset of GLM to be implemented?
>      
>      Yes, except for the interactions.
>
> But some tasks can be done without interactions, can't they?

Many.



_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Next step in covariance matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've pushed a set of changes which drops the first category of each categorical
variable from the covariance matrix.

Implementation of GLM etc may require some extra accessor functions, but I've
refrained from trying to predict what these might be.  Let me know what you
require.

J'

On Sun, Oct 25, 2009 at 11:54:16PM -0400, Jason Stover wrote:
     I haven't tested it yet, but it looks like the computation of the
     dimension might be wrong when categorical variables are involved.
     If a categorical variable has k categories, its contribution to the
     dimension should be k-1. But this line in covariance.c:
     
     cov->dim = cov->n_vars + categoricals_total (cov->categoricals);
     
     ...suggests the contribution to the dimension would be k.
     
     The contribution to the dimension is k-1 because the range of possible
     values of k categories is spanned by k-1 basis vectors.  The kth
     vector is the origin, which corresponds to exactly one of the
     categories.  Which is chosen as the origin is arbitrary (some software
     chooses the first category seen, some the last).
     
     > So what procedures would be best ? ANOVA, MANOVA, UNIANOVA or a subset
     > of GLM? And are there any good texts on how to perform anova from a covariance
     > matrix?  Most seem to assume that the sums of squares have been seperately
     > calculated.
     
     glm.q now uses the new routines, so that might be a good place to start.
     
     There are plenty of pertinent texts that cover computation of least
     squares estimates from a covariance matrix, but the ones I have seen
     are aimed at interpreting results and establishing theory rather than
     computations. I'll look up a few and send them along.
     
     Golub and Van Loan has all of the necessary computations, but doesn't
     mention "covariance matrices" by name. They just mention least squares.
     
     
     _______________________________________________
     pspp-dev mailing list
     pspp-dev@...
     http://lists.gnu.org/mailman/listinfo/pspp-dev

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment