Covariance Matrix

View: New views
13 Messages — Rating Filter:   Alert me  

Covariance Matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've been trying to hack up a reliable CORRELATIONS command.
It seemed logical to me to use the existing covariance-matrix.c in
src/math.  Whilst this worked for simple examples, it gave completely
wrong answers (or at least different from the spss examples I could
find) when presented with either a) missing values; or b) non-unity
caseweights.

I've started digging into covariance-matrix.c in order to fix the
problems, but the exercise is slowly evolving into a complete rewrite
of the module, and in doing so, I'm not confident that I'm not
breaking some of the functionality which correlations doesn't use
(notably interactions).

Of course, we could just go with 2 implementations of covariance-matrix
- one which works with interactions, but not with missing values, and
the other works with missing values, but doesn't support interactions,
but I don't think this is a sensible way to proceed.

As things stand right now, I don't think there's anything in master which
actually uses interactions, but I don't know how far Jason has got with
GLM.

So I guess the question is what is the best way to proceed?  As I see it,
the options are:

a) Fork the covariance matrix implementation and have 2 different ones
in master - not a good idea I think.

b) Get a working CM implementation which properly handles missing values
and case-weights.   When this is working, we can add support for interactions
later.

c) Get the CM working properly with all features, including weights, missing
values and interactions, before proceeding further.


My opinion is that we should go for (b), but that's largely motivated by
personal interests, and I don't know how it's going to affect other
developers.

Comments ?


--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Covariance Matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Sep 27, 2009 at 02:44:35PM +0800, John Darrington wrote:
> I've been trying to hack up a reliable CORRELATIONS command.
> It seemed logical to me to use the existing covariance-matrix.c in
> src/math.  Whilst this worked for simple examples, it gave completely
> wrong answers (or at least different from the spss examples I could
> find) when presented with either a) missing values; or b) non-unity
> caseweights.

This is something I've been meaning to fix, but haven't gotten to it.

> I've started digging into covariance-matrix.c in order to fix the
> problems, but the exercise is slowly evolving into a complete rewrite
> of the module, and in doing so, I'm not confident that I'm not
> breaking some of the functionality which correlations doesn't use
> (notably interactions).
>
> Of course, we could just go with 2 implementations of covariance-matrix
> - one which works with interactions, but not with missing values, and
> the other works with missing values, but doesn't support interactions,
> but I don't think this is a sensible way to proceed.
>
> As things stand right now, I don't think there's anything in master which
> actually uses interactions, but I don't know how far Jason has got with
> GLM.

Still working on it, though slowly.

> So I guess the question is what is the best way to proceed?  As I see it,
> the options are:
>
> a) Fork the covariance matrix implementation and have 2 different ones
> in master - not a good idea I think.
>
> b) Get a working CM implementation which properly handles missing values
> and case-weights.   When this is working, we can add support for interactions
> later.
>
> c) Get the CM working properly with all features, including weights, missing
> values and interactions, before proceeding further.
>
>
> My opinion is that we should go for (b), but that's largely motivated by
> personal interests, and I don't know how it's going to affect other
> developers.
>
> Comments ?

I don't want to drop support for interactions entirely, but I think
the way it's done now is ugly and cumbersome. But it wouldn't be so
ugly if done in two data passes, instead of one. So how about this:
Start working (b), but with an algorithm that works with two data
passes. I'll add the interactions to the two-pass algorithm later.

In fact, the easiest way to handle interactions might be to use a
simple two-pass algorithm inside another module written specifically
for computing covariances with interactions. That would keep the
actual computation of the covariance cleaner, and push elsewhere
the ugly computations involving the interactions.

So, the way I'm thinking now is that we should have something like this:

struct interaction_covariance
{
        ...blah blah
        struct covariance_matrix;
};

instead of this (what we have now):

struct covariance_matrix
{
        ...blah blah
        const struct interaction_variable **interactions;
};

Then the code in covariance-matrix.c doesn't need to know anything
about interactions. But it will have to know about categorical
variables.  For a two-pass algorithm, I don't think that will be a big
deal.

Much of the current ugliness in covariance-matrix.c is due to its
single data pass. I thought the benefit of one data pass would offset the
ugliness. But one data pass isn't as stable as two, and the ugly hash
tables and the convoluted code make the current covariance-matrix.c a
wreck.

So I would say: Rewrite away. If possible, please keep support for
categorical variables. But as I said, I don't think that will be hard
to add later for a two-pass algorithm. Ugly hashing will be unnecessary.

I wish I hadn't been so stubborn about having a one-pass algorithm
in the first place.


_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Covariance Matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Given that I like solution b, how should we proceed?  Do you want to
erase most or all of covariance-matrix.c?  Should I do it, or avoid
it?

Either way is OK with me. I think one of the first thing to do is
get rid of any hashing business. No more interactions, too.

I ask because I was the one who wrote all the crap in
covariance-matrix.c in the first place, so on one hand, I kind of feel
responsible, but on the other, writing crap doesn't exactly enhance
one's confidence or motivation.


_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Covariance Matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm happy to do it, but I'm not sure when I'll be able to find an
uninterrupted period of time which it requires.

So I suggest that if you have other development tasks to do, then do
those, otherwise feel free to rewrite covariance matrix yourself.

I'm not adverse to using hash tables in covariance matrix if they are
needed, but I don't think they are necessary for the simple case (without
interactions/categorical variables).

J'


On Mon, Sep 28, 2009 at 08:10:58PM -0400, Jason Stover wrote:
     Given that I like solution b, how should we proceed?  Do you want to
     erase most or all of covariance-matrix.c?  Should I do it, or avoid
     it?
     
     Either way is OK with me. I think one of the first thing to do is
     get rid of any hashing business. No more interactions, too.
     
     I ask because I was the one who wrote all the crap in
     covariance-matrix.c in the first place, so on one hand, I kind of feel
     responsible, but on the other, writing crap doesn't exactly enhance
     one's confidence or motivation.
     
     
     _______________________________________________
     pspp-dev mailing list
     pspp-dev@...
     http://lists.gnu.org/mailman/listinfo/pspp-dev

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Covariance Matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've pushed an implementation of CORRELATIONS, along with a new
covariance matrix implementation in src/math/covariance.[ch]

It actually calculates more than the covariance matrix: it also
calculates the 0th 1st and 2nd moments, since these are necessary
steps to obtain the cov.

It's currently implemented as a single pass algorithm, but it will
be straightforward to change that.  There's no categorical variables,
or interactions, at present.

I'm fairly confident that it's giving the right values and works good
with both missing values and with non-unity caseweights.  To test the
former, I tried the examples at http://www.ats.ucla.edu/stat/Spss/modules/missing.htm
and compared my results with those published. For the latter,  I have
a self-consistency test in tests/command/correlation.sh

I'd be interested in any comments and suggestions on how to proceed
with generalising the implementation to accept categorical variables.
Once that has been done, I think it would be a good idea to re-implement
T-TEST and ONEWAY using the new module - which I hope would vanquish
several thousand lines of rather messy code.

J'


On Mon, Sep 28, 2009 at 08:10:58PM -0400, Jason Stover wrote:
     Given that I like solution b, how should we proceed?  Do you want to
     erase most or all of covariance-matrix.c?  Should I do it, or avoid
     it?
     
     Either way is OK with me. I think one of the first thing to do is
     get rid of any hashing business. No more interactions, too.
     
     I ask because I was the one who wrote all the crap in
     covariance-matrix.c in the first place, so on one hand, I kind of feel
     responsible, but on the other, writing crap doesn't exactly enhance
     one's confidence or motivation.

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Covariance Matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Oct 04, 2009 at 03:21:51PM +0000, John Darrington wrote:
> It's currently implemented as a single pass algorithm, but it will
> be straightforward to change that.  There's no categorical variables,
> or interactions, at present.
>
...
> I'd be interested in any comments and suggestions on how to proceed
> with generalising the implementation to accept categorical variables.

I have a patch below to start this. Right now, it just allocates space. It does
not properly retrieve entries with categorical variables, but it is a start:

diff --git a/src/language/stats/correlations.c b/src/language/stats/correlations.c
index e397dae..65679f3 100644
--- a/src/language/stats/correlations.c
+++ b/src/language/stats/correlations.c
@@ -324,7 +324,7 @@ run_corr (struct casereader *r, const struct corr_opts *opts, const struct corr
   const gsl_matrix *var_matrix,  *samples_matrix, *mean_matrix;
   const gsl_matrix *cov_matrix;
   gsl_matrix *corr_matrix;
-  struct covariance *cov = covariance_create (corr->n_vars_total, corr->vars,
+  struct covariance *cov = covariance_create_pass1 (corr->n_vars_total, corr->vars,
       opts->wv, opts->exclude);
 
   for ( ; (c = casereader_read (r) ); case_unref (c))
diff --git a/src/math/covariance.c b/src/math/covariance.c
index ba0de0b..350cab4 100644
--- a/src/math/covariance.c
+++ b/src/math/covariance.c
@@ -31,6 +31,7 @@ struct covariance
 {
   /* The variables for which the covariance matrix is to be calculated */
   size_t n_vars;
+  size_t dim; /* This value equals n_vars if all variables are continuous. */
   const struct variable **vars;
   
   /* The weight variable (or NULL if none) */
@@ -64,11 +65,29 @@ covariance_moments (const struct covariance *cov, int m)
 }
 
 
+static void
+covariance_create_part2 (struct covariance *cov, enum mv_class exclude)
+{
+  size_t i;
+
+  cov->moments = xmalloc (sizeof *cov->moments * n_MOMENTS);
+
+  for (i = 0; i < n_MOMENTS; ++i)
+    cov->moments[i] = gsl_matrix_calloc (cov->dim, cov->dim);
 
-/* Create a covariance struct */
+  cov->exclude = exclude;
+
+  cov->n_cm = (cov->dim * (cov->dim - 1)  ) / 2;
+
+  cov->cm = xcalloc (sizeof *cov->cm, cov->n_cm);
+}
+  
+/* Create a covariance struct to be computed in one data pass.
+   No categorical variables are allowed.
+*/
 struct covariance *
-covariance_create (size_t n_vars, const struct variable **vars,
-   const struct variable *weight, enum mv_class exclude)
+covariance_create_pass1 (size_t n_vars, const struct variable **vars,
+ const struct variable *weight, enum mv_class exclude)
 {
   size_t i;
   struct covariance *cov = xmalloc (sizeof *cov);
@@ -76,20 +95,61 @@ covariance_create (size_t n_vars, const struct variable **vars,
 
   cov->wv = weight;
   cov->n_vars = n_vars;
+  cov->dim = n_vars; /* Only numeric variables are allowed in a single data pass,
+ so these values are equal.
+      */
 
   for (i = 0; i < n_vars; ++i)
-    cov->vars[i] = vars[i];
+    {
+      assert (var_is_numeric (vars[i]));
+      cov->vars[i] = vars[i];
+    }
 
-  cov->moments = xmalloc (sizeof *cov->moments * n_MOMENTS);
-  
-  for (i = 0; i < n_MOMENTS; ++i)
-    cov->moments[i] = gsl_matrix_calloc (n_vars, n_vars);
+  covariance_create_part2 (cov, exclude);
 
-  cov->exclude = exclude;
+  return cov;
+}
 
-  cov->n_cm = (n_vars * (n_vars - 1)  ) / 2;
+static size_t
+get_dim (size_t n_vars, struct variable **vars)
+{
+  size_t i;
+  size_t dim = 0;
 
-  cov->cm = xcalloc (sizeof *cov->cm, cov->n_cm);
+  for (i = 0; i < n_vars; i++)
+    {
+      if (var_is_numeric (vars[i]))
+ {
+  dim++;
+ }
+      else
+ {
+  dim += cat_get_n_categories (vars[i]);
+ }
+    }
+  return dim;
+}
+/* Create a covariance struct with categorical variables.
+   Call this function after the first data pass.
+*/
+struct covariance *
+covariance_create_pass2 (size_t n_vars, const struct variable **vars,
+ const struct variable *weight, enum mv_class exclude)
+{
+  size_t i;
+  struct covariance *cov = xmalloc (sizeof *cov);
+  cov->vars = xmalloc (sizeof *cov->vars * n_vars);
+
+  cov->wv = weight;
+  cov->n_vars = n_vars;
+
+  for (i = 0; i < n_vars; ++i)
+    cov->vars[i] = vars[i];
+
+  cov->dim = get_dim (n_vars, vars);
+
+  
+  covariance_create_part2 (cov, exclude);
 
   return cov;
 }
diff --git a/src/math/covariance.h b/src/math/covariance.h
index 8b8de88..7a13cd2 100644
--- a/src/math/covariance.h
+++ b/src/math/covariance.h
@@ -27,8 +27,14 @@ struct covariance;
 struct variable;
 struct ccase ;
 
-struct covariance * covariance_create (size_t n_vars, const struct variable **vars,
-       const struct variable *wv, enum mv_class excl);
+struct covariance * covariance_create_pass1 (size_t n_vars,
+     const struct variable **vars,
+     const struct variable *wv,
+     enum mv_class excl);
+struct covariance * covariance_create_pass2 (size_t n_vars,
+     const struct variable **vars,
+     const struct variable *wv,
+     enum mv_class excl);
 
 void covariance_accumulate (struct covariance *, const struct ccase *);
 


_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Covariance Matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks for doing this.  I think I see what this is doing.  I have a few comments:

1.  It provokes some compiler warnings (which can be trivially fixed).

2.  I know we've discussed this before, but I'm still not comfortable
    with the implicit assumption that catagorical variables are always
    non-numeric variables.  Although I accept that using numeric
    variables as categorical ones is perhaps inadvisable, it is a legacy
    that we are stuck with.  Both the T-TEST and ONEWAY commands permit
    the categorical variable to be numeric (in fact, I think ONEWAY insists
    upon it).   So I think we have to support numerical categorical
    variables.

3.  Perhaps related to 2.  Our current practice of storing the categorical
    state in the variables' aux parameter is problematic, and the get_dim
    function reminds us of this.  The loop in get_dim should not be necessary,
    since the total number of categories has already been calculated.  I suggest
    that we change the interface and implementation of src/data/category.[ch]
    such that it stores its state in a separate object.  Although it means a
    little more work for the caller, I think the benefits will be worthwhile.


Anyway, I suggest that we create a new branch and check in Jason's patch into it
and work on a categorical covariance implementation within that branch.


J'
   

On Tue, Oct 06, 2009 at 11:37:40AM -0400, Jason Stover wrote:
     On Sun, Oct 04, 2009 at 03:21:51PM +0000, John Darrington wrote:
     > It's currently implemented as a single pass algorithm, but it will
     > be straightforward to change that.  There's no categorical variables,
     > or interactions, at present.
     >
     ...
     > I'd be interested in any comments and suggestions on how to proceed
     > with generalising the implementation to accept categorical variables.
     
     I have a patch below to start this. Right now, it just allocates space. It does
     not properly retrieve entries with categorical variables, but it is a start:
     
     diff --git a/src/language/stats/correlations.c b/src/language/stats/correlations.c
     index e397dae..65679f3 100644
     --- a/src/language/stats/correlations.c
     +++ b/src/language/stats/correlations.c
     @@ -324,7 +324,7 @@ run_corr (struct casereader *r, const struct corr_opts *opts, const struct corr
        const gsl_matrix *var_matrix,  *samples_matrix, *mean_matrix;
        const gsl_matrix *cov_matrix;
        gsl_matrix *corr_matrix;
     -  struct covariance *cov = covariance_create (corr->n_vars_total, corr->vars,
     +  struct covariance *cov = covariance_create_pass1 (corr->n_vars_total, corr->vars,
           opts->wv, opts->exclude);
     
        for ( ; (c = casereader_read (r) ); case_unref (c))
     diff --git a/src/math/covariance.c b/src/math/covariance.c
     index ba0de0b..350cab4 100644
     --- a/src/math/covariance.c
     +++ b/src/math/covariance.c
     @@ -31,6 +31,7 @@ struct covariance
      {
        /* The variables for which the covariance matrix is to be calculated */
        size_t n_vars;
     +  size_t dim; /* This value equals n_vars if all variables are continuous. */
        const struct variable **vars;
       
        /* The weight variable (or NULL if none) */
     @@ -64,11 +65,29 @@ covariance_moments (const struct covariance *cov, int m)
      }
     
     
     +static void
     +covariance_create_part2 (struct covariance *cov, enum mv_class exclude)
     +{
     +  size_t i;
     +
     +  cov->moments = xmalloc (sizeof *cov->moments * n_MOMENTS);
     +
     +  for (i = 0; i < n_MOMENTS; ++i)
     +    cov->moments[i] = gsl_matrix_calloc (cov->dim, cov->dim);
     
     -/* Create a covariance struct */
     +  cov->exclude = exclude;
     +
     +  cov->n_cm = (cov->dim * (cov->dim - 1)  ) / 2;
     +
     +  cov->cm = xcalloc (sizeof *cov->cm, cov->n_cm);
     +}
     +  
     +/* Create a covariance struct to be computed in one data pass.
     +   No categorical variables are allowed.
     +*/
      struct covariance *
     -covariance_create (size_t n_vars, const struct variable **vars,
     -   const struct variable *weight, enum mv_class exclude)
     +covariance_create_pass1 (size_t n_vars, const struct variable **vars,
     + const struct variable *weight, enum mv_class exclude)
      {
        size_t i;
        struct covariance *cov = xmalloc (sizeof *cov);
     @@ -76,20 +95,61 @@ covariance_create (size_t n_vars, const struct variable **vars,
     
        cov->wv = weight;
        cov->n_vars = n_vars;
     +  cov->dim = n_vars; /* Only numeric variables are allowed in a single data pass,
     + so these values are equal.
     +      */
     
        for (i = 0; i < n_vars; ++i)
     -    cov->vars[i] = vars[i];
     +    {
     +      assert (var_is_numeric (vars[i]));
     +      cov->vars[i] = vars[i];
     +    }
     
     -  cov->moments = xmalloc (sizeof *cov->moments * n_MOMENTS);
     -  
     -  for (i = 0; i < n_MOMENTS; ++i)
     -    cov->moments[i] = gsl_matrix_calloc (n_vars, n_vars);
     +  covariance_create_part2 (cov, exclude);
     
     -  cov->exclude = exclude;
     +  return cov;
     +}
     
     -  cov->n_cm = (n_vars * (n_vars - 1)  ) / 2;
     +static size_t
     +get_dim (size_t n_vars, struct variable **vars)
     +{
     +  size_t i;
     +  size_t dim = 0;
     
     -  cov->cm = xcalloc (sizeof *cov->cm, cov->n_cm);
     +  for (i = 0; i < n_vars; i++)
     +    {
     +      if (var_is_numeric (vars[i]))
     + {
     +  dim++;
     + }
     +      else
     + {
     +  dim += cat_get_n_categories (vars[i]);
     + }
     +    }
     +  return dim;
     +}
     +/* Create a covariance struct with categorical variables.
     +   Call this function after the first data pass.
     +*/
     +struct covariance *
     +covariance_create_pass2 (size_t n_vars, const struct variable **vars,
     + const struct variable *weight, enum mv_class exclude)
     +{
     +  size_t i;
     +  struct covariance *cov = xmalloc (sizeof *cov);
     +  cov->vars = xmalloc (sizeof *cov->vars * n_vars);
     +
     +  cov->wv = weight;
     +  cov->n_vars = n_vars;
     +
     +  for (i = 0; i < n_vars; ++i)
     +    cov->vars[i] = vars[i];
     +
     +  cov->dim = get_dim (n_vars, vars);
     +
     +  
     +  covariance_create_part2 (cov, exclude);
     
        return cov;
      }
     diff --git a/src/math/covariance.h b/src/math/covariance.h
     index 8b8de88..7a13cd2 100644
     --- a/src/math/covariance.h
     +++ b/src/math/covariance.h
     @@ -27,8 +27,14 @@ struct covariance;
      struct variable;
      struct ccase ;
     
     -struct covariance * covariance_create (size_t n_vars, const struct variable **vars,
     -       const struct variable *wv, enum mv_class excl);
     +struct covariance * covariance_create_pass1 (size_t n_vars,
     +     const struct variable **vars,
     +     const struct variable *wv,
     +     enum mv_class excl);
     +struct covariance * covariance_create_pass2 (size_t n_vars,
     +     const struct variable **vars,
     +     const struct variable *wv,
     +     enum mv_class excl);
     
      void covariance_accumulate (struct covariance *, const struct ccase *);
     
     
     
     _______________________________________________
     pspp-dev mailing list
     pspp-dev@...
     http://lists.gnu.org/mailman/listinfo/pspp-dev

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Covariance Matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 06, 2009 at 07:43:06PM +0000, John Darrington wrote:
>     ....  The loop in get_dim should not be necessary,
>     since the total number of categories has already been calculated...

Where?


_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Covariance Matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Oct 07, 2009 at 08:54:40AM -0400, Jason Stover wrote:

     On Tue, Oct 06, 2009 at 07:43:06PM +0000, John Darrington wrote:
     >     ....  The loop in get_dim should not be necessary,
     >     since the total number of categories has already been calculated...
     
     Where?

I should have written this more precisely:

There exists the *opportunity* to pre-calculate the total number of categories (eg.
in line 141 of category.c) - This would make the loop in get_dim unnecessary and the
function a whole lot simpler.

J'

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Covariance Matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Oct 07, 2009 at 05:01:37PM +0000, John Darrington wrote:

> On Wed, Oct 07, 2009 at 08:54:40AM -0400, Jason Stover wrote:
>
>      On Tue, Oct 06, 2009 at 07:43:06PM +0000, John Darrington wrote:
>      >     ....  The loop in get_dim should not be necessary,
>      >     since the total number of categories has already been calculated...
>      
>      Where?
>
> I should have written this more precisely:
>
> There exists the *opportunity* to pre-calculate the total number of categories (eg.
> in line 141 of category.c) - This would make the loop in get_dim unnecessary and the
> function a whole lot simpler.

Sort-of: Line 141 of category.c updates the total number of
categories for a single variable: cv->n_categories++;

But that isn't the total number of categories that must be considered
for the covariance matrix. For example, if our dictionary had the
variables v1, v2 and v3, and v1 had n1 categories, v2 had n2
categories and v3 had n3 categories, then our covariance matrix would
need to have (n1 - 1)*(n2 - 1)*(n3 - 1) rows. I guess we could compute
this value in category.c, but what if someone only wanted a covariance
matrix for just v1 and v2? Then the the number of rows necessary would be
(n1 - 1) * (n2 - 1). Or maybe they would want v2 and v3, then we would
need (n2 - 1)*(n3 - 1) rows.

There is no way to know in advance how many rows we would need, unless
we just compute the covariance matrix for all the variables. Which could
be OK, I guess. Did you mean to always compute a covariance matrix
for all the variables in the dictionary?

BTW: Given what I just mentioned about the dimension necessary, that
loop would have to be:

  for (i = 0; i < n_vars; i++)
    {
      if (var_is_numeric (vars[i]))
        {
          dim++;
        }
      else
        {
          dim += cat_get_n_categories (vars[i]) - 1;
        }
    }


...instead of just dim += cat_get_n_categories (vars[i]);



_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Covariance Matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Oct 07, 2009 at 03:17:55PM -0400, Jason Stover wrote:
> But that isn't the total number of categories that must be considered
> for the covariance matrix. For example, if our dictionary had the
> variables v1, v2 and v3, and v1 had n1 categories, v2 had n2
> categories and v3 had n3 categories, then our covariance matrix would
> need to have (n1 - 1)*(n2 - 1)*(n3 - 1) rows. I guess we could compute
> this value in category.c, but what if someone only wanted a covariance
> matrix for just v1 and v2? Then the the number of rows necessary would be
> (n1 - 1) * (n2 - 1). Or maybe they would want v2 and v3, then we would
> need (n2 - 1)*(n3 - 1) rows.

I made a mistake here: The dimensions would be (n1-1)+(n2-1)+(n3-1),
or (n1-1)+(n2-1), etc., but not (n1-1)*(n2-1). Oops.


_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: Covariance Matrix

by John Darrington-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Oct 07, 2009 at 07:35:10PM -0400, Jason Stover wrote:
     On Wed, Oct 07, 2009 at 03:17:55PM -0400, Jason Stover wrote:
     > But that isn't the total number of categories that must be considered
     > for the covariance matrix. For example, if our dictionary had the
     > variables v1, v2 and v3, and v1 had n1 categories, v2 had n2
     > categories and v3 had n3 categories, then our covariance matrix would
     > need to have (n1 - 1)*(n2 - 1)*(n3 - 1) rows. I guess we could compute
     > this value in category.c, but what if someone only wanted a covariance
     > matrix for just v1 and v2? Then the the number of rows necessary would be
     > (n1 - 1) * (n2 - 1). Or maybe they would want v2 and v3, then we would
     > need (n2 - 1)*(n3 - 1) rows.
     
     I made a mistake here: The dimensions would be (n1-1)+(n2-1)+(n3-1),
     or (n1-1)+(n2-1), etc., but not (n1-1)*(n2-1). Oops.
     
My idea is to calculate the number of categories on each procedure (that needs it).
So instead of the categories data having global scope, they will be created on demand
(in this case inside the covariance matrix constructure).

So the constructor for the covariance matrix would be passed (in addition to its
current parameters) a set of categorical variables.  The total number of categories
can be calculated in the first pass.  Thus the code would become something like:

create_covariance (const struct variable **vars, int n_vars,
  const struct variable **catvars, int n_cat_vars)
{
 ...
 cov->categories = create_categories (catvars, n_cat_vars);
}

void
covariance_first_pass (struct covariance *cov, const struct ccase *c)
{
 categories_update (cov->categories, c);
 ...
}

int
get_dim (const struct covariance *cov)
{
 return cov->n_vars + categories_total_number_of_categories (cov->categories);
}


This involves rewriting some code, but I think it will be worthwhile.  We don't
need the total number of categories until AFTER the first pass of the procedure.

--
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev

signature.asc (196 bytes) Download Attachment

Re: Covariance Matrix

by Jason Stover-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


OK, I see your point now, and it makes sense.


On Thu, Oct 08, 2009 at 11:34:54AM +0000, John Darrington wrote:

> On Wed, Oct 07, 2009 at 07:35:10PM -0400, Jason Stover wrote:
>      On Wed, Oct 07, 2009 at 03:17:55PM -0400, Jason Stover wrote:
>      > But that isn't the total number of categories that must be considered
>      > for the covariance matrix. For example, if our dictionary had the
>      > variables v1, v2 and v3, and v1 had n1 categories, v2 had n2
>      > categories and v3 had n3 categories, then our covariance matrix would
>      > need to have (n1 - 1)*(n2 - 1)*(n3 - 1) rows. I guess we could compute
>      > this value in category.c, but what if someone only wanted a covariance
>      > matrix for just v1 and v2? Then the the number of rows necessary would be
>      > (n1 - 1) * (n2 - 1). Or maybe they would want v2 and v3, then we would
>      > need (n2 - 1)*(n3 - 1) rows.
>      
>      I made a mistake here: The dimensions would be (n1-1)+(n2-1)+(n3-1),
>      or (n1-1)+(n2-1), etc., but not (n1-1)*(n2-1). Oops.
>      
> My idea is to calculate the number of categories on each procedure (that needs it).
> So instead of the categories data having global scope, they will be created on demand
> (in this case inside the covariance matrix constructure).
>
> So the constructor for the covariance matrix would be passed (in addition to its
> current parameters) a set of categorical variables.  The total number of categories
> can be calculated in the first pass.  Thus the code would become something like:
>
> create_covariance (const struct variable **vars, int n_vars,
>   const struct variable **catvars, int n_cat_vars)
> {
>  ...
>  cov->categories = create_categories (catvars, n_cat_vars);
> }
>
> void
> covariance_first_pass (struct covariance *cov, const struct ccase *c)
> {
>  categories_update (cov->categories, c);
>  ...
> }
>
> int
> get_dim (const struct covariance *cov)
> {
>  return cov->n_vars + categories_total_number_of_categories (cov->categories);
> }
>
>
> This involves rewriting some code, but I think it will be worthwhile.  We don't
> need the total number of categories until AFTER the first pass of the procedure.
>
> --
> PGP Public key ID: 1024D/2DE827B3
> fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
> See http://pgp.mit.edu or any PGP keyserver for public key.
>
>




_______________________________________________
pspp-dev mailing list
pspp-dev@...
http://lists.gnu.org/mailman/listinfo/pspp-dev