textConnection performance quadratic (PR#14053)

View: New views
2 Messages — Rating Filter:   Alert me  

textConnection performance quadratic (PR#14053)

by bill.hopkins :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Full_Name: William E. Hopkins
Version: 2.9.0
OS: Windows XP
Submission from: (NULL) (209.244.4.106)


textConnection() has quadratic performance.

A function I wrote was taking outrageous amount of time to execute on a large
character vector (small test set was used for functional development). I created
a test harness to execute the function and gather stats (system.time) for
various dataset sizes (datasets generated by sample() of very large set). If I
used textConnection() to provide input to read.csv(), the performance was
quadratic with dataset size. However, if I had the function write the character
vector to a temp file then read the data back in via read.csv, the performance
was linear.

The reason for using a textConnection() was that the character vector was within
a data frame read in via read.csv. The character vector (URLs) needed to be
parsed into separate vectors, but no mechanism exists to do that directly (that
I know of). So, I used sub() to extract the proper pieces and put commas in
between so that I can use read.csv() to read the comma-separate strings directly
into vectors.

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: textConnection performance quadratic (PR#14053)

by Gabor Grothendieck :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

strsplit can split by separators and strapply in the gsubfn package
can split by content.

On Mon, Nov 9, 2009 at 6:40 PM,  <bill.hopkins@...> wrote:

> Full_Name: William E. Hopkins
> Version: 2.9.0
> OS: Windows XP
> Submission from: (NULL) (209.244.4.106)
>
>
> textConnection() has quadratic performance.
>
> A function I wrote was taking outrageous amount of time to execute on a large
> character vector (small test set was used for functional development). I created
> a test harness to execute the function and gather stats (system.time) for
> various dataset sizes (datasets generated by sample() of very large set). If I
> used textConnection() to provide input to read.csv(), the performance was
> quadratic with dataset size. However, if I had the function write the character
> vector to a temp file then read the data back in via read.csv, the performance
> was linear.
>
> The reason for using a textConnection() was that the character vector was within
> a data frame read in via read.csv. The character vector (URLs) needed to be
> parsed into separate vectors, but no mechanism exists to do that directly (that
> I know of). So, I used sub() to extract the proper pieces and put commas in
> between so that I can use read.csv() to read the comma-separate strings directly
> into vectors.
>
> ______________________________________________
> R-devel@... mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
R-devel@... mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel