all equal. I also crosschecked by manually creating the chunks and
>
> Are you sure that you are fitting all the models on the same total
> data? A first glance looks like you may be including more data in
> some of the chunk sizes, or be producing an error that update does not
> know how to deal with.
>
>
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
>
greg.snow@...
>
> 801.408.8111
>
>
>
> *From:* utkarshsinghal [mailto:
utkarsh.singhal@...]
> *Sent:* Monday, July 06, 2009 8:58 AM
> *To:* Thomas Lumley; Greg Snow
> *Cc:* r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>
>
>
> The AIC of the biglm models is highly dependent on the size of chunks
> selected (example provided below). This I can somehow expect because
> the model error will increase with the number of chunks.
>
> It will be helpful if you can provide your opinion for comparing
> different models in such cases:
>
> * can I compare two models fitted with different chunksizes, or
> should I always use the same chunk size.
>
> * although I am not going to use AIC at all in my model selection,
> but I think any other model parameters will also vary in the
> same way. Am I right?
> * what would be the ideal chunksize? should it be the maximum
> possible size R and my system's RAM is able to handle?
>
> Any comments will be helpful.
>
>
> *Example of AIC variation with chunksize:*
>
> I ran the following code on my data which has 10000 observations and 3
> independent variables
>
> > chunksize = 500
> > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> > for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
> data=xx[(i+1):(i+chunksize),])
> > AIC(fit)
> [1] 30647.79
>
> Here are the AIC for other chunksizes:
> chunksize AIC
> 500 30647.79
> 1000 29647.79
> 2000 27647.79
> 2500 26647.79
> 5000 21647.79
> 10000 11647.79
>
>
> Regards
> Utkarsh
>
>
>
>
> utkarshsinghal wrote:
>
> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>
> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>
>
>
> Hi Sir,
>
> Thanks for making package available to us. I am facing few problems if
> you can give some hints:
>
> Problem-1:
> The model summary and residual deviance matched (in the mail below)
> but I didn't understand why AIC is still different.
>
>
> AIC(m1)
>
> [1] 532965
>
>
> AIC(m1big_longer)
>
> [1] 101442.9
>
>
> That's because AIC.default uses the unnormalized loglikelihood and
> AIC.biglm uses the deviance. Only differences in AIC between models
> are meaningful, not individual values.
>
>
>
> Problem-2:
> chunksize argument is there in bigglm but not in biglm, consequently,
> udate.biglm is there, but not update.bigglm
> Is my observation correct? If yes, why is this difference?
>
>
> Because update.bigglm is impossible.
>
> Fitting a glm requires iteration, which means that it requires
> multiple passes through the data. Fitting a linear model requires only
> a single pass. update.biglm can take a fitted or partially fitted
> biglm and add more data. To do the same thing for a bigglm you would
> need to start over again from the beginning of the data set.
>
> To fit a glm, you need to specify a data source that bigglm() can
> iterate over. You do this with a function that can be called
> repeatedly to return the next chunk of data.
>
> -thomas
>
> Thomas Lumley Assoc. Professor, Biostatistics
>
tlumley@... <mailto:
tlumley@...>
> University of Washington, Seattle
>
>
>
>
> I don't know why the AIC is different, but remember that there are
> multiple definitions for AIC (generally differing in the constant
> added) and it may just be a difference in the constant, or it could be
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the
> data. This was the first function written for the package and the
> update mechanism was an easy way to write the function (and still
> works well).
>
> The bigglm function came later and the models other than Gaussian
> require multiple passes through the data so instead of the update
> mechanism that biglm uses, bigglm requires the data argument to be a
> function that returns the next chunk of data and can restart to the
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes
> through the data, usually this is good enough, but in some cases you
> may need to increase the number of passes.
>
> Hope this helps,
>
>
>