Binning of integers with hist() function odd results (PR#14046)

View: New views
4 Messages — Rating Filter:   Alert me  

Binning of integers with hist() function odd results (PR#14046)

by Gerald Guglielmo :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Full_Name: Gerald Guglielmo
Version: 2.8.1 (2008-12-22)
OS: OSX Leopard
Submission from: (NULL) (131.225.103.35)


When I attempt to use the hist() function to bin integers the behavior seems
very odd as the bin boundary seems inconsistent across the various bins. For
some bins the upper boundary includes the next integer value, while in others it
does not. If I add 0.1 to every value, then the hist() binning behavior is what
I would normally expect.

> h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
> h1$mids
[1] 1.5 2.5 3.5 4.5
> h1$counts
[1] 3 3 4 5
> h2<-hist(c(1.1,2.1,2.1,3.1,3.1,3.1,4.1,4.1,4.1,4.1,5.1,5.1,5.1,5.1,5.1))
> h2$mids
[1] 1.5 2.5 3.5 4.5 5.5
> h2$counts
[1] 1 2 3 4 5

Naively I would have expected the same distribution of counts in the two cases,
but clearly that is not happening. This is a simple example to illustrate the
behavior, originally I noticed this while binning a large data sample where I
had set the breaks=c(0,24,1).

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: Binning of integers with hist() function odd results (P

by Ted.Harding-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 06-Nov-09 23:30:12, gug@... wrote:

> Full_Name: Gerald Guglielmo
> Version: 2.8.1 (2008-12-22)
> OS: OSX Leopard
> Submission from: (NULL) (131.225.103.35)
>
> When I attempt to use the hist() function to bin integers the behavior
> seems
> very odd as the bin boundary seems inconsistent across the various
> bins. For
> some bins the upper boundary includes the next integer value, while in
> others it
> does not. If I add 0.1 to every value, then the hist() binning behavior
> is what
> I would normally expect.
>
>> h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
>> h1$mids
> [1] 1.5 2.5 3.5 4.5
>> h1$counts
> [1] 3 3 4 5
>> h2<-hist(c(1.1,2.1,2.1,3.1,3.1,3.1,4.1,4.1,4.1,4.1,5.1,5.1,5.1,5.1,5.1)
>> )
>> h2$mids
> [1] 1.5 2.5 3.5 4.5 5.5
>> h2$counts
> [1] 1 2 3 4 5
>
> Naively I would have expected the same distribution of counts in the
> two cases, but clearly that is not happening. This is a simple example
> to illustrate the behavior, originally I noticed this while binning a
> large data sample where I had set the breaks=c(0,24,1).

This is the correct intended bahaviour. By default, values which are
exactly on the boundary between two bins are counted in the bin which
is just below the boundary value. Except that the bottom-most break
will count values on it into the bin just above it.

Hence 1,2,2 all go into the [1,2] bin; 3,3,3 into (2,3];
4,4,4,4 into (3,4]; and 5,5,5,5,5 into (4,5]. Hence the counts 3,3,4,5.

Since you did not set breaks in
  h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5)),
they were set using the default method, and you can see what they are
with

  h1$breaks
  [1] 1 2 3 4 5

When you add 0.1 to each value, you push the values on the boundaries
up into the next bin. Now each value is inside its bin, and not on
any boundary. Hence 1.1 is in (1,2]; 2.1,2.1 in (2,3];
3.1,3.1,3.1 in (3,4]; 4.1,4.1,4.1,4.1 in (4,5]; and
5.1,5.1,5.1,5.1,5.1 in (5,6], giving counts 1,2,3,4,5 as you observe.

The default behaviour described above is defined by the default options

  include.lowest = TRUE, right = TRUE

where:

include.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'
          value will be included in the first (or last, for 'right =
          FALSE') bar.  This will be ignored (with a warning) unless
          'breaks' is a vector.

   right: logical; if 'TRUE', the histograms cells are right-closed
          (left open) intervals.

See '?hist'. You can change this behaviour by shanging the options.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding@...>
Fax-to-email: +44 (0)870 094 0861
Date: 07-Nov-09                                       Time: 13:57:07
------------------------------ XFMail ------------------------------

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: Binning of integers with hist() function odd results (P

by Gerald Guglielmo :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,
    Thank you for responding quickly and explaining the behavior. By  
adding "include.lowest=TRUE,right=FALSE" and manually including breaks  
that resolved the simple test case. Next I updated my more complex  
data set, which already had manually defined breaks, and that resolved  
my issues there too. I have now gone in and updated all my functions  
which use hist() so I hopefully won't forget this in the future.

On Nov 7, 2009, at 7:57 AM, Ted Harding wrote:

> On 06-Nov-09 23:30:12, gug@... wrote:
>> Full_Name: Gerald Guglielmo
>> Version: 2.8.1 (2008-12-22)
>> OS: OSX Leopard
>> Submission from: (NULL) (131.225.103.35)
>>
>> When I attempt to use the hist() function to bin integers the  
>> behavior
>> seems
>> very odd as the bin boundary seems inconsistent across the various
>> bins. For
>> some bins the upper boundary includes the next integer value, while  
>> in
>> others it
>> does not. If I add 0.1 to every value, then the hist() binning  
>> behavior
>> is what
>> I would normally expect.
>>
>>> h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
>>> h1$mids
>> [1] 1.5 2.5 3.5 4.5
>>> h1$counts
>> [1] 3 3 4 5
>>> h2<-
>>> hist(c(1.1,2.1,2.1,3.1,3.1,3.1,4.1,4.1,4.1,4.1,5.1,5.1,5.1,5.1,5.1)
>>> )
>>> h2$mids
>> [1] 1.5 2.5 3.5 4.5 5.5
>>> h2$counts
>> [1] 1 2 3 4 5
>>
>> Naively I would have expected the same distribution of counts in the
>> two cases, but clearly that is not happening. This is a simple  
>> example
>> to illustrate the behavior, originally I noticed this while binning a
>> large data sample where I had set the breaks=c(0,24,1).
>
> This is the correct intended bahaviour. By default, values which are
> exactly on the boundary between two bins are counted in the bin which
> is just below the boundary value. Except that the bottom-most break
> will count values on it into the bin just above it.
>
> Hence 1,2,2 all go into the [1,2] bin; 3,3,3 into (2,3];
> 4,4,4,4 into (3,4]; and 5,5,5,5,5 into (4,5]. Hence the counts  
> 3,3,4,5.
>
> Since you did not set breaks in
>  h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5)),
> they were set using the default method, and you can see what they are
> with
>
>  h1$breaks
>  [1] 1 2 3 4 5
>
> When you add 0.1 to each value, you push the values on the boundaries
> up into the next bin. Now each value is inside its bin, and not on
> any boundary. Hence 1.1 is in (1,2]; 2.1,2.1 in (2,3];
> 3.1,3.1,3.1 in (3,4]; 4.1,4.1,4.1,4.1 in (4,5]; and
> 5.1,5.1,5.1,5.1,5.1 in (5,6], giving counts 1,2,3,4,5 as you observe.
>
> The default behaviour described above is defined by the default  
> options
>
>  include.lowest = TRUE, right = TRUE
>
> where:
>
> include.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'
>          value will be included in the first (or last, for 'right =
>          FALSE') bar.  This will be ignored (with a warning) unless
>          'breaks' is a vector.
>
>   right: logical; if 'TRUE', the histograms cells are right-closed
>          (left open) intervals.
>
> See '?hist'. You can change this behaviour by shanging the options.
>
> Hoping this helps,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding@...>
> Fax-to-email: +44 (0)870 094 0861
> Date: 07-Nov-09                                       Time: 13:57:07
> ------------------------------ XFMail ------------------------------

--
-Jerry->
gug@...
Pepe's Theory of everything: "Under the right circumstances, things  
happen."


        [[alternative HTML version deleted]]

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: Binning of integers with hist() function odd results (PR#14046)

by Peter Dalgaard :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

gug@... wrote:

> Full_Name: Gerald Guglielmo
> Version: 2.8.1 (2008-12-22)
> OS: OSX Leopard
> Submission from: (NULL) (131.225.103.35)
>
>
> When I attempt to use the hist() function to bin integers the behavior seems
> very odd as the bin boundary seems inconsistent across the various bins. For
> some bins the upper boundary includes the next integer value, while in others it
> does not. If I add 0.1 to every value, then the hist() binning behavior is what
> I would normally expect.
>
>> h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
>> h1$mids
> [1] 1.5 2.5 3.5 4.5
>> h1$counts
> [1] 3 3 4 5
>> h2<-hist(c(1.1,2.1,2.1,3.1,3.1,3.1,4.1,4.1,4.1,4.1,5.1,5.1,5.1,5.1,5.1))
>> h2$mids
> [1] 1.5 2.5 3.5 4.5 5.5
>> h2$counts
> [1] 1 2 3 4 5
>
> Naively I would have expected the same distribution of counts in the two cases,
> but clearly that is not happening. This is a simple example to illustrate the
> behavior, originally I noticed this while binning a large data sample where I
> had set the breaks=c(0,24,1).

This is as documented. See the include.lowest argument. Annoying, but
not a bug.

(It is arguably a design error that hist() is looking for "pretty"
breakpoints rather than pretty midpoints, or maybe something more
advanced to handle cases where the data are effectively tied to a
lattice. It's been around "forever", though.)

--
    O__  ---- Peter Dalgaard             Ă˜ster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@...)              FAX: (+45) 35327907

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel