>
> OK, it appears that the problem is the df.resid component of the biglm
> object. Everything else is being updated by the update function
> except the df.resid piece, so it is based solely on the initial fit
> and the chunksize used there. The df.resid piece is then used in the
> computation of the AIC and hence the differences that you see. There
> could also be a difference in the p-values and confidence intervals,
> but at those high of numbers, the differences are smaller than can be
> seen at the level of rounding done.
>
>
>
> This appears to be a bug/overlooked piece to me, Thomas is cc'd on
> this so he should be able to fix this.
>
>
>
> A work around in the meantime is to do something like:
>
> > fit$df.resid <- 10000-4
>
>
>
> Then compute the AIC.
>
>
>
> Also as an aside, if you change your seq to: seq(chunksize,
> 10000-chunksize, chunksize) then you won't get the error messages.
>
>
>
> Hope this helps,
>
>
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
>
greg.snow@...
>
> 801.408.8111
>
>
>
> *From:* utkarshsinghal [mailto:
utkarsh.singhal@...]
> *Sent:* Wednesday, July 08, 2009 2:24 AM
> *To:* Greg Snow
> *Cc:* Thomas Lumley; r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>
>
> Hi Greg,
>
> Many thanks for your precious time. Here is a workable code:
>
> set.seed(1)
> xx = data.frame(x1=runif(10000,0,10), x2=runif(10000,0,10),
> x3=runif(10000,0,10))
> xx$y = 3 + xx$x1 + 2*xx$x2 + 3*xx$x3 + rnorm(10000)
>
> chunksize = 500
> fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
> moredata=xx[(i+1):(i+chunksize),])
> AIC(fit)
> [1] 28956.91
>
> And the AIC for other chunksizes:
> chunksize AIC
> 500 28956.91
> 1000 27956.91
> 2000 25956.91
> 2500 24956.91
> 5000 19956.91
> 10000 9956.91
>
> Also I noted that the estimated coefficients are not dependent on
> chunksize and AIC is exactly a linear function of chunksize. So I
> guess it is some problem with the calculation of AIC, may be in some
> degree of freedom or adding some constant somewhere.
>
> And my comments below.
>
>
> Regards
> Utkarsh
>
>
> Greg Snow wrote:
>
> How many rows does xx have?
>
> Let's look at your example for chunksize 10000, you initially fit
> the first 10000 observations, then the seq results in just the
> value 10000 which means that you do the update based on vaues
> 10001 through 20000, if xx only has 10000 rows, then this should
> give at least one error. If xx has 20000 or more rows, then only
> chunksize 10000 will ever see the 20000^th value, the other
> chunksizes will use less of the data.
>
> Understood your point and apologize that you had to spend time going
> into the logic inside for loop. I definitely thought of that but my
> actual problem was the variation in AICs (which I was sure about), so
> to ignore this loop problem (temporarily), I deliberately chose the
> chunksizes such that the number of rows is a multiple of chunksize. I
> knew there is still one extra iteration happening and I checked that
> it was not causing any problem, the "moredata" in the last iteration
> will be all NA's and "update" does nothing in such a case.
>
> For example:
> Let's say chunksize=5000, even though "xx" has only 10000 rows, "fit2"
> and "fit3" below are exactly same.
>
> fit1 = biglm(y~x1+x2+x3, data=xx[1:5000,])
> fit2 = update(fit1, moredata=xx[5001:10000,])
> fit3 = update(fit2, moredata=xx[10001:15000,])
> AIC(fit1); AIC(fit2); AIC(fit3)
> [1] 5018.282
> [1] 19956.91
> [1] 19956.91
>
> (The AIC matches with the table above and no warnings at all)
>
> I checked all these things before sending my first mail and dropped
> the idea of refining the for loop as this will save me a few lines of
> code and also the loop looks good and easy to understand. Moreover it
> is neither taking any extra run time nor producing any warnings or errors.
>
>
>
>
> Also looking at the help for update.biglm, the 2^nd argument is
> "moredata" not "data", so if the code below is the code that you
> actually ran, then the new data chunks are going into the "..."
> argument (and being ignored as that is there for future expansion and
> does nothing yet) and the "moredata" argument is left empty, which
> should also be giving an error. For the code below, the model is only
> being fit to the initial chunk and never updated, so with different
> chunk sizes, there is different amounts of data per model. You can
> check this by doing summary(fit) and looking at the sample size in the
> 2^nd line.
>
> My fault in writing the mail. In the actual code, I gave "update(fit,
> xx[(i+1):(i+chunksize),])" ,i.e., I just passed the new chunk as the
> 2nd argument without mentioning the argument name, which is correct,
> but while writing the mail I added the argument name as "data" without
> checking what it is.
>
>
>
>
> It is easier for us to help you if you provide code that can be run by
> copying and pasting (we don't have xx, so we can't just run the code
> below, you could include a line to randomly generate an xx, or a link
> to where a copy of xx can be downloaded from). It also helps if you
> mention any errors or warnings that you receive in the process of
> running your code.
>
>
>
> Hope this helps,
>
>
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
>
greg.snow@... <mailto:
greg.snow@...>
>
> 801.408.8111
>
>
>
> *From:* utkarshsinghal [mailto:
utkarsh.singhal@...]
> *Sent:* Tuesday, July 07, 2009 12:10 AM
> *To:* Greg Snow
> *Cc:* Thomas Lumley; r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>
>
> Trust me, it is the same total data I am using, even the chunksizes
> are all equal. I also crosschecked by manually creating the chunks and
> updating as in example given on biglm help page.
> > ?biglm
>
>
> Regards
> Utkarsh
>
>
>
> Greg Snow wrote:
>
> Are you sure that you are fitting all the models on the same total
> data? A first glance looks like you may be including more data in
> some of the chunk sizes, or be producing an error that update does not
> know how to deal with.
>
>
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
>
greg.snow@... <mailto:
greg.snow@...>
>
> 801.408.8111
>
>
>
> *From:* utkarshsinghal [mailto:
utkarsh.singhal@...]
> *Sent:* Monday, July 06, 2009 8:58 AM
> *To:* Thomas Lumley; Greg Snow
> *Cc:* r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>
>
>
> The AIC of the biglm models is highly dependent on the size of chunks
> selected (example provided below). This I can somehow expect because
> the model error will increase with the number of chunks.
>
> It will be helpful if you can provide your opinion for comparing
> different models in such cases:
>
> * can I compare two models fitted with different chunksizes, or
> should I always use the same chunk size.
>
> * although I am not going to use AIC at all in my model selection,
> but I think any other model parameters will also vary in the
> same way. Am I right?
> * what would be the ideal chunksize? should it be the maximum
> possible size R and my system's RAM is able to handle?
>
> Any comments will be helpful.
>
>
> *Example of AIC variation with chunksize:*
>
> I ran the following code on my data which has 10000 observations and 3
> independent variables
>
> > chunksize = 500
> > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> > for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
> data=xx[(i+1):(i+chunksize),])
> > AIC(fit)
> [1] 30647.79
>
> Here are the AIC for other chunksizes:
> chunksize AIC
> 500 30647.79
> 1000 29647.79
> 2000 27647.79
> 2500 26647.79
> 5000 21647.79
> 10000 11647.79
>
>
> Regards
> Utkarsh
>
>
>
>
> utkarshsinghal wrote:
>
> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>
>
>
> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>
>
>
>
>
> Hi Sir,
>
> Thanks for making package available to us. I am facing few problems if
> you can give some hints:
>
> Problem-1:
> The model summary and residual deviance matched (in the mail below)
> but I didn't understand why AIC is still different.
>
>
>
>
> AIC(m1)
>
> [1] 532965
>
>
>
>
> AIC(m1big_longer)
>
> [1] 101442.9
>
>
> That's because AIC.default uses the unnormalized loglikelihood and
> AIC.biglm uses the deviance. Only differences in AIC between models
> are meaningful, not individual values.
>
>
>
>
>
> Problem-2:
> chunksize argument is there in bigglm but not in biglm, consequently,
> udate.biglm is there, but not update.bigglm
> Is my observation correct? If yes, why is this difference?
>
>
> Because update.bigglm is impossible.
>
> Fitting a glm requires iteration, which means that it requires
> multiple passes through the data. Fitting a linear model requires only
> a single pass. update.biglm can take a fitted or partially fitted
> biglm and add more data. To do the same thing for a bigglm you would
> need to start over again from the beginning of the data set.
>
> To fit a glm, you need to specify a data source that bigglm() can
> iterate over. You do this with a function that can be called
> repeatedly to return the next chunk of data.
>
> -thomas
>
> Thomas Lumley Assoc. Professor, Biostatistics
>
tlumley@... <mailto:
tlumley@...>
> University of Washington, Seattle
>
>
>
>
>
>
> I don't know why the AIC is different, but remember that there are
> multiple definitions for AIC (generally differing in the constant
> added) and it may just be a difference in the constant, or it could be
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the
> data. This was the first function written for the package and the
> update mechanism was an easy way to write the function (and still
> works well).
>
> The bigglm function came later and the models other than Gaussian
> require multiple passes through the data so instead of the update
> mechanism that biglm uses, bigglm requires the data argument to be a
> function that returns the next chunk of data and can restart to the
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes
> through the data, usually this is good enough, but in some cases you
> may need to increase the number of passes.
>
> Hope this helps,
>
>
>
>
>
>
>