Advanced Filtering problem

View: New views
3 Messages — Rating Filter:   Alert me  

Advanced Filtering problem

by T.D.Rudolph :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

subdata.csv

I've attached 100 rows of a data frame I am working with.  
I have one factor, id, with 27 levels.  There are two columns of reference data, x and y (UTM coordinates), one column "date" in POSIXct format, and one column "diff" in times format (chron package).

What I am trying to do is as follows:
For each day of the year (date, irrespective of time), select that row for each id which contains the smallest "diff" value, resulting in an output containing in general one value per id per day.

"aggregate" has been suggested but it only produces the columns considered in the function and I need all columns intact.  My data frame contains almost 70,000 entries so manual sorting is not an option.  I know R is robust but my programming skills are elementary.  The only way I know to approach it is to first separate every id, then filter, then recombine somehow.  Is there not a more efficient way for this relatively straight-forward filtering exercise?

Tyler

Re: Advanced Filtering problem

by milton ruser :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Tyler,

I don“t know if I understood well.
Try this. Case not work I try again and again :-)



df<-read.csv("http://www.nabble.com/file/p18018170/subdata.csv")
df.min.diff<-aggregate(df["diff"], df[c("day")], min)
df.subset<-subset(df, paste(df$day, df$diff) %in% paste(df.min.diff$day,
df.min.diff$diff))




On 6/19/08, T.D.Rudolph <prairie.picker@...> wrote:

>
>
> http://www.nabble.com/file/p18018170/subdata.csv subdata.csv
>
> I've attached 100 rows of a data frame I am working with.
> I have one factor, id, with 27 levels.  There are two columns of reference
> data, x and y (UTM coordinates), one column "date" in POSIXct format, and
> one column "diff" in times format (chron package).
>
> What I am trying to do is as follows:
> For each day of the year (date, irrespective of time), select that row for
> each id which contains the smallest "diff" value, resulting in an output
> containing in general one value per id per day.
>
> "aggregate" has been suggested but it only produces the columns considered
> in the function and I need all columns intact.  My data frame contains
> almost 70,000 entries so manual sorting is not an option.  I know R is
> robust but my programming skills are elementary.  The only way I know to
> approach it is to first separate every id, then filter, then recombine
> somehow.  Is there not a more efficient way for this relatively
> straight-forward filtering exercise?
>
> Tyler
> --
> View this message in context:
> http://www.nabble.com/Advanced-Filtering-problem-tp18018170p18018170.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@... mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]


______________________________________________
R-help@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: Advanced Filtering problem

by hadley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Tyler,

> I've attached 100 rows of a data frame I am working with.
> I have one factor, id, with 27 levels.  There are two columns of reference
> data, x and y (UTM coordinates), one column "date" in POSIXct format, and
> one column "diff" in times format (chron package).
>
> What I am trying to do is as follows:
> For each day of the year (date, irrespective of time), select that row for
> each id which contains the smallest "diff" value, resulting in an output
> containing in general one value per id per day.

There's a basic strategy that makes solving this type of problem much
easier.  I call it split-apply-combine.  The basic idea is that if you
had a single day, the problem would be pretty easy:

df <- read.csv("http://www.nabble.com/file/p18018170/subdata.csv")

oneday <- subset(df, day == "01-01-05")
oneday[which.min(oneday$diff), ]

# Let's make that into a function to make it easier to apply to all days

mindiff <- function(df) df[which.min(df$diff), ]

# Now we split up the data frame so that we have a data frame for
# each day

pieces <- split(df, df$day)

# And use lapply to apply that function to each piece:

results <- lapply(pieces, mindiff)

# Then finally join all the pieces back together

df_done <- do.call("rbind", results)

So we split the data frame into individual days, picked the correct
row for each day, and then joined all the pieces back together.  This
isn't the most efficient solution, but I think it's easy to see how
each part works, and how you can apply it to new situations.  If you
aren't familiar with lapply or do.call, it's worth having a look at
their examples to get a feel for how they work (although for this case
you can of course just copy and paste them without caring how they
work)

Hadley

--
http://had.co.nz/

______________________________________________
R-help@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.