« Return to Thread: Re: bigglm() results different from glm()

Re: bigglm() results different from glm()+Another question

by utkarshsinghal :: Rate this Message:

Reply to Author | View in Thread

Trust me, it is the same total data I am using, even the chunksizes are
all equal. I also crosschecked by manually creating the chunks and
updating as in example given on biglm help page.
 > ?biglm


Regards
Utkarsh



Greg Snow wrote:

>
> Are you sure that you are fitting all the models on the same total
> data?  A first glance looks like you may be including more data in
> some of the chunk sizes, or be producing an error that update does not
> know how to deal with.
>
>  
>
> --
>
> Gregory (Greg) L. Snow Ph.D.
>
> Statistical Data Center
>
> Intermountain Healthcare
>
> greg.snow@...
>
> 801.408.8111
>
>  
>
> *From:* utkarshsinghal [mailto:utkarsh.singhal@...]
> *Sent:* Monday, July 06, 2009 8:58 AM
> *To:* Thomas Lumley; Greg Snow
> *Cc:* r help
> *Subject:* Re: [R] bigglm() results different from glm()+Another question
>
>  
>
>
> The AIC of the biglm models is highly dependent on the size of chunks
> selected (example provided below). This I can somehow expect because
> the model error will increase with the number of chunks.
>
> It will be helpful if you can provide your opinion for comparing
> different models in such cases:
>
>     * can I compare two models fitted with different chunksizes, or
>       should I always use the same chunk size.
>
>     * although I am not going to use AIC at all in my model selection,
>       but I think any other model parameters will also vary in the
>       same way. Am I right?
>     * what would be the ideal chunksize? should it be the maximum
>       possible size R and my system's RAM is able to handle?
>
> Any comments will be helpful.
>
>
> *Example of AIC variation with chunksize:*
>
> I ran the following code on my data which has 10000 observations and 3
> independent variables
>
> > chunksize = 500
> > fit = biglm(y~x1+x2+x3, data=xx[1:chunksize,])
> > for(i in seq(chunksize,10000,chunksize)) fit=update(fit,
> data=xx[(i+1):(i+chunksize),])
> > AIC(fit)
> [1] 30647.79
>
> Here are the AIC for other chunksizes:
> chunksize    AIC
> 500          30647.79
> 1000        29647.79
> 2000        27647.79
> 2500        26647.79
> 5000        21647.79
> 10000      11647.79
>
>
> Regards
> Utkarsh
>
>
>
>
> utkarshsinghal wrote:
>
> Thank you Mr. Lumley and Mr. Greg. That was helpful.
>
> Regards
> Utkarsh
>
>
>
> Thomas Lumley wrote:
>
> On Fri, 3 Jul 2009, utkarshsinghal wrote:
>
>
>
> Hi Sir,
>
> Thanks for making package available to us. I am facing few problems if
> you can give some hints:
>
> Problem-1:
> The model summary and residual deviance matched (in the mail below)
> but I didn't understand why AIC is still different.
>
>
> AIC(m1)
>
> [1] 532965
>
>
> AIC(m1big_longer)
>
> [1] 101442.9
>
>
> That's because AIC.default uses the unnormalized loglikelihood and
> AIC.biglm uses the deviance.  Only differences in AIC between models
> are meaningful, not individual values.
>
>
>
> Problem-2:
> chunksize argument is there in bigglm but not in biglm, consequently,
> udate.biglm is there, but not update.bigglm
> Is my observation correct? If yes, why is this difference?
>
>
> Because update.bigglm is impossible.
>
> Fitting a glm requires iteration, which means that it requires
> multiple passes through the data. Fitting a linear model requires only
> a single pass. update.biglm can take a fitted or partially fitted
> biglm and add more data. To do the same thing for a bigglm you would
> need to start over again from the beginning of the data set.
>
> To fit a glm, you need to specify a data source that bigglm() can
> iterate over.  You do this with a function that can be called
> repeatedly to return the next chunk of data.
>
>       -thomas
>
> Thomas Lumley            Assoc. Professor, Biostatistics
> tlumley@... <mailto:tlumley@...>    
> University of Washington, Seattle
>
>
>
>
> I don't know why the AIC is different, but remember that there are
> multiple definitions for AIC (generally differing in the constant
> added) and it may just be a difference in the constant, or it could be
> that you have not fit the whole dataset (based on your other question).
>
> For an lm model biglm only needs to make a single pass through the
> data.  This was the first function written for the package and the
> update mechanism was an easy way to write the function (and still
> works well).
>
> The bigglm function came later and the models other than Gaussian
> require multiple passes through the data so instead of the update
> mechanism that biglm uses, bigglm requires the data argument to be a
> function that returns the next chunk of data and can restart to the
> beginning of the dataset.
>
> Also note that the bigglm function usually only does a few passes
> through the data, usually this is good enough, but in some cases you
> may need to increase the number of passes.
>
> Hope this helps,
>
>  
>


        [[alternative HTML version deleted]]

______________________________________________
R-help@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

 « Return to Thread: Re: bigglm() results different from glm()