|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Array Performance in 2.8Hello everyone,
we are designing a small scientific bioinformatics software suite and are considering using Scala, at least for the backend. Our first step is to decide, which data structure to use, to store sequence Data. We are working with DNA sequences, that may be several million bases long. Since we need at any timepoint fast and constant access time to any index of the data structure, I considered using Array[Char] as underlying structure and wrap the Array to make it immutable. You may know, that DNA sequences contain only the 4 "letters" A,T,C, and G. So the program needs to filter the input sequence, and remove every character, that doesn't fit. So, using 2.7.6, I chose a very simple way to do it (input is a String) val Sequence = { val seqArray = input.toCharArray seqArray.filter(c => c=='A' || c=='T' || c=='G' || c=='C') } While this is very fast in 2.7.6, In the latest build of 2.8 it takes nearly 3 times more time to parse the String. Looking into the 2.8 API, I noticed that Arrays do not support the filter method anymore, so I assume that the array is implicitly converted into another datatype, that kills performance. Is this correct? What alternative might be suitable? I liked the idea of using Arrays, althogh they are not a purely functional structure, in particular because of the underlying performance. A considerable loss in performance might be a reason for me to drop scala. Could someone comment on the performance of Arrays in 2.8 and possible fast Random Access alternatives? |
|
|
Re: [scala] Array Performance in 2.8On Thursday 15 October 2009 11:26:11 ScalaDan wrote:
> You may know, that DNA sequences contain only the 4 "letters" A,T,C, and G. [Off topic] Why do you use Char? Byte is significantly smaller :-) |
|
|
Re: [scala] Array Performance in 2.8What's wrong with:
scala> val input = "ASCDSFNCAMAFNACIXETGJCGACJCGACCAJCAGCAATATACCCGG" input: java.lang.String = ASCDSFNCAMAFNACIXETGJCGACJCGACCAJCAGCAATATACCCGG scala> val sequence = input.filter(c => c == 'A' || c == 'T' || c == 'G' || c == 'C') sequence: Seq[Char] = ArrayBuffer(A, C, C, A, A, A, C, T, G, C, G, A, C, C, G, A, C, C, A, C, A, G, C, A, A, T, A, T, A, C, C, C, G, G) scala> sequence.toArray res0: Array[Char] = Array(A, C, C, A, A, A, C, T, G, C, G, A, C, C, G, A, C, C, A, C, A, G, C, A, A, T, A, T, A, C, C, C, G, G) On Thu, Oct 15, 2009 at 9:26 AM, ScalaDan <DanoneD@...> wrote:
-- Viktor Klang Blog: klangism.blogspot.com Twttr: viktorklang Wave: viktor.klang@... Code: github.com/viktorklang AKKA Committer - akkasource.org Lift Committer - liftweb.com Atmosphere Committer - atmosphere.dev.java.net SoftPub founder: http://groups.google.com/group/softpub |
|
|
Re: [scala] Array Performance in 2.8Minor observation on filter:
Can use like this: input filter Set('A', 'T', 'G', 'C') Instead of this: input.filter(c => c == 'A' || c == 'T' || c == 'G' || c == 'C') On Thu, Oct 15, 2009 at 1:03 PM, Viktor Klang <viktor.klang@...> wrote: What's wrong with: |
|
|
Re: [scala] Array Performance in 2.8Hi,
On Thu, 2009-10-15 at 00:26 -0700, ScalaDan wrote: > val Sequence = { > val seqArray = input.toCharArray > seqArray.filter(c => c=='A' || c=='T' || c=='G' || c=='C') > } > > While this is very fast in 2.7.6, In the latest build of 2.8 it takes nearly > 3 times more time to parse the String. > Looking into the 2.8 API, I noticed that Arrays do not support the filter > method anymore, so I assume that the array is implicitly converted into > another datatype, that kills performance. Is this correct? Interesting. In 2.7.x, calls like filter would also cause boxing (unless the array was already boxed) since native JVM arrays don't support it. Given that we have very little data, my guess is that filter in the previous boxed representation for monomorphic arrays was faster than the filter in the new boxed representation. It would help if we had some profiling information or a runnable test case to play with. Alternatively, provide your own filter(array: Array[Char]) method and see how that performs. How does it compare to 2.8.x and 2.7.x? By the way, I assume that you're doing these measurements the right way? (e.g. giving the JVM time to warm-up, etc.) > What alternative might be suitable? There are a few, but the ideal one would be for arrays to be really fast out of the box. Let's find the bottlenecks to see if we can fix them. > I liked the idea of using Arrays, > althogh they are not a purely functional structure, in particular because of > the underlying performance. A considerable loss in performance might be a > reason for me to drop scala. Fast arrays are very important and a lot of work was done in 2.8.x to make their performance more predictable. Seems like there is still work to be done on the performance of the boxed representation. Best, Ismael |
|
|
Re: [scala] Array Performance in 2.8On Thu, Oct 15, 2009 at 9:26 AM, ScalaDan <DanoneD@...> wrote: While this is very fast in 2.7.6, In the latest build of 2.8 it takes nearly It looks to me, from my reading of the 2.8.x source, that arrays should not kill performance, but should actually perform better than in 2.7.x. The new array support just wraps the array once (at the call to filter) with wrapper that should perform better because it doesn't use reflection to access array elements. Check out Pref's genericArrayOps method and the ArrayOps.ofByte class, for instance. Bill |
|
|
Re: [scala] Array Performance in 2.8On Thu, 2009-10-15 at 10:36 +0200, Bill Burdick wrote:
> It looks to me, from my reading of the 2.8.x source, that arrays > should not kill performance, but should actually perform better than > in 2.7.x. The new array support just wraps the array once (at the > call to filter) with wrapper that should perform better because it > doesn't use reflection to access array elements. Reflection was not used in 2.7.x for the case outlined either. Ismael |
|
|
Re: [scala] Array Performance in 2.8A comment unrelated to scala: you should really be packing each base
as two consequtive bits, it's crazily wasteful not to. I'm not sure which underlying type would be faster, if anyone (i.e. byte, int, long). java.util.BitSet uses long. You could wrap this on a BitSet, it should be fine. Not only you'll need 4 times less memory than encoding each base as a byte, but you'll have much less cache misses. Dimitris 2009/10/15 ScalaDan <DanoneD@...>: > > Hello everyone, > > we are designing a small scientific bioinformatics software suite and are > considering using Scala, at least for the backend. > Our first step is to decide, which data structure to use, to store sequence > Data. We are working with DNA sequences, that may be several million bases > long. > Since we need at any timepoint fast and constant access time to any index of > the data structure, I considered using Array[Char] as underlying structure > and wrap the Array to make it immutable. > > You may know, that DNA sequences contain only the 4 "letters" A,T,C, and G. > So the program needs to filter the input sequence, and remove every > character, that doesn't fit. > > So, using 2.7.6, I chose a very simple way to do it (input is a String) > > val Sequence = { > val seqArray = input.toCharArray > seqArray.filter(c => c=='A' || c=='T' || c=='G' || c=='C') > } > > While this is very fast in 2.7.6, In the latest build of 2.8 it takes nearly > 3 times more time to parse the String. > Looking into the 2.8 API, I noticed that Arrays do not support the filter > method anymore, so I assume that the array is implicitly converted into > another datatype, that kills performance. Is this correct? > > What alternative might be suitable? I liked the idea of using Arrays, > althogh they are not a purely functional structure, in particular because of > the underlying performance. A considerable loss in performance might be a > reason for me to drop scala. > > Could someone comment on the performance of Arrays in 2.8 and possible fast > Random Access alternatives? > > -- > View this message in context: http://www.nabble.com/Array-Performance-in-2.8-tp25904108p25904108.html > Sent from the Scala mailing list archive at Nabble.com. > > |
|
|
Re: [scala] Array Performance in 2.8Still, the conversion should not kill performance because it only creates one wrapper for the call to filter.
Bill On Thu, Oct 15, 2009 at 10:44 AM, Ismael Juma <mlists@...> wrote:
|
|
|
Re: [scala] Array Performance in 2.8On Thu, Oct 15, 2009 at 9:26 AM, ScalaDan <DanoneD@...> wrote:
This should not happen. I can't reproduce it, what is the size of your array? and what is 'input'? Looking into the 2.8 API, I noticed that Arrays do not support the filter The previous design used array boxing, so in a way it is similar with regards to performance of 'filter'. iulian -- « Je déteste la montagne, ça cache le paysage » Alphonse Allais |
|
|
Re: [scala] Array Performance in 2.8On Oct 15, 2009, at 4:58 AM, Jim Andreou wrote: > A comment unrelated to scala: you should really be packing each base > as two consequtive bits, it's crazily wasteful not to. I'm not sure > which underlying type would be faster, if anyone (i.e. byte, int, > long). java.util.BitSet uses long. You could wrap this on a BitSet, it > should be fine. > > Not only you'll need 4 times less memory than encoding each base as a > byte, but you'll have much less cache misses. Indeed ... you might want to look at how biojava does it to get some inspiration (assuming you can look at GPL'd code(?)). I'm interested in your project, though, especially if it's open source. Please keep us (me :-) informed. Thanks, -steve > > Dimitris > > 2009/10/15 ScalaDan <DanoneD@...>: >> >> Hello everyone, >> >> we are designing a small scientific bioinformatics software suite >> and are >> considering using Scala, at least for the backend. >> Our first step is to decide, which data structure to use, to store >> sequence >> Data. We are working with DNA sequences, that may be several >> million bases >> long. >> Since we need at any timepoint fast and constant access time to any >> index of >> the data structure, I considered using Array[Char] as underlying >> structure >> and wrap the Array to make it immutable. >> >> You may know, that DNA sequences contain only the 4 "letters" >> A,T,C, and G. >> So the program needs to filter the input sequence, and remove every >> character, that doesn't fit. >> >> So, using 2.7.6, I chose a very simple way to do it (input is a >> String) >> >> val Sequence = { >> val seqArray = input.toCharArray >> seqArray.filter(c => c=='A' || c=='T' || c=='G' || c=='C') >> } >> >> While this is very fast in 2.7.6, In the latest build of 2.8 it >> takes nearly >> 3 times more time to parse the String. >> Looking into the 2.8 API, I noticed that Arrays do not support the >> filter >> method anymore, so I assume that the array is implicitly converted >> into >> another datatype, that kills performance. Is this correct? >> >> What alternative might be suitable? I liked the idea of using Arrays, >> althogh they are not a purely functional structure, in particular >> because of >> the underlying performance. A considerable loss in performance >> might be a >> reason for me to drop scala. >> >> Could someone comment on the performance of Arrays in 2.8 and >> possible fast >> Random Access alternatives? >> >> -- >> View this message in context: http://www.nabble.com/Array-Performance-in-2.8-tp25904108p25904108.html >> Sent from the Scala mailing list archive at Nabble.com. >> >> -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact |
|
|
Re: [scala] Array Performance in 2.8Ok, first, thanks for the many replies and the interest. Actually, I'm not a developer, but a biologist ;-) So I just try to do my best in providing the needed information, but I'm not an expert. I'm trying to convince a good friend of mine, who is a java developer, to use Scala for our project, so performance issues in general will be crucial for the language choice. I wrote a small benchmarking program. Interestingly, the differences here are not as dramatic, as in the actual program, but still visible. Here, the input String is generated at random and then used to generate a DNA Object. In the real program, the input String is generated by a Swing TextArea, where the user can paste the Sequence. The input of the TextArea is then used to generate the DNA object. import scala.compat.Platform object Main { val NUCL = Array('A','C','G','T','N','P','W','K','X','Z') def makeTestSequence(n: Int): String ={ val arr = new Array[Char](n) for (i <- 0 to (n-1)){ var nucl = (Math.random*10).asInstanceOf[Byte] arr(i) = NUCL(nucl) } arr.mkString } def main(args: Array[String]):Unit ={ //benchmarking makeTestSequence var start = Platform.currentTime val testSequenceString = makeTestSequence(10000000) var end = Platform.currentTime println("MakeTestSequence: "+ (end-start)+" ms") //warming up var i = 0 while(i < 100){ val TestDNA = new DNA(testSequenceString) i+=1 } //benchmarking Array.filter i = 0 start = Platform.currentTime while(i < 10){ val TestDNA = new DNA(testSequenceString) i+=1 } end = Platform.currentTime println("FilterTest: "+ (end-start)/10 +" ms") } } class DNA(input: String) { val Sequence = { val seqArray = input.toCharArray seqArray.filter(c => c=='A' || c== 'T' || c== 'C' || c == 'G' || c== 'N') } } My results: 2.7.6: MakeTestSequence: 1500ms FilterTest: 275 ms 2.8 MakeTestSequence: 1765 ms FilterTest: 321 ms As I said, in this case the differences are not as big as I observed before, but still measurable. I will provide a more "real-life" situation in the next days. Hope this helps. Dan |
|
|
Re: [scala] Array Performance in 2.8
Thanks for the good advice, I will look into it. Actually 2 bits won't be enough, I still need an 'N' for unknown nucleotides, but maybe 3 bits will do the job. As far as I remember, biojava uses singletons to represent each nucleotide, and constructs an Array with a sequence of references to the singletons. At least this was the design some years ago, I didn't look into it recently. I tried a similar approach in Scala, but its much slower that anything that's based on primitive types. |
|
|
Re: [scala] Array Performance in 2.8Sorry, much slower that my approach. Methods on Strings are just painfully slow, in Java as in Scala. |
|
|
Re: [scala] Array Performance in 2.8Sorry, also much slower than the longer version. Dan |
|
|
Re: [scala] Array Performance in 2.8On Oct 15, 2009, at 4:18 PM, ScalaDan wrote: > Steve Lianoglou-6 wrote: <snip> >> Indeed ... you might want to look at how biojava does it to get some >> inspiration (assuming you can look at GPL'd code(?)). <snip> > Thanks for the good advice, I will look into it. Actually 2 bits > won't be > enough, I still need an 'N' for unknown nucleotides, but maybe 3 > bits will > do the job. > > As far as I remember, biojava uses singletons to represent each > nucleotide, > and constructs an Array with a sequence of references to the > singletons. At > least this was the design some years ago, I didn't look into it > recently. > I tried a similar approach in Scala, but its much slower that anything > that's based on primitive types. Would it make sense to generate one Symbol per nucleotide, then use an array of those for your sequence data? I guess this will still be larger than a packed byte array, but I'm not sure if it'd by you any speed. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact |
|
|
Re: [scala] Array Performance in 2.8, more detailed look at methodsHi, I just wrote a very simple test Suite, just to run a few benchmarks on some methods of Arrays. I am aware of the fact that these are very artificial microbenchmarks that might not resemble an application runtime behavior, but possibly this might still be interesting: Here my results for the following methods applied to a Char Array with a length of 10,000,000 chars: Scala 2.7.6: Filter: 152 ms Map: 285 ms Foreach: 190 ms mkString: 828 ms toList: 1079 ms Scala 2.8: Filter: 243 ms Map: 364 ms Foreach: 160 ms mkString: 1084 ms toList: 1471 ms So, the foreach methods appears to be faster in 2.8, but all other seem slower, especially the filter an the toList method. I am not an expert and actually I don't know how to really design good benchmarks, so feel free to adjust the tests as needed. Here's the code: import scala.compat.Platform object Main { val NUCL = Array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z') def makeTestSequence(n: Int) ={ val arr = new Array[Char](n) var i = 0 while(i< n){ var nucl = (Math.random*26).asInstanceOf[Byte] arr(i) = NUCL(nucl) i+=1 } arr } def main(args: Array[String]):Unit ={ var d='A' val testSequence = makeTestSequence(10000000) //warming up var i = 0 while(i < 10){ val Filter = testSequence.filter(c=> c =='A') val Map = testSequence.map(c=> 'A') val ForEach = testSequence.foreach(c=> d=c ) val MkString = testSequence.mkString val ToList = testSequence.toList i+=1 } //benchmarking Array.filter i = 0 var start = Platform.currentTime while(i < 10){ val Filter = testSequence.filter(c=> c =='A') i+=1 } var end = Platform.currentTime println("Filter: "+ (end-start)/10 +" ms") //benchmarking Array.map i = 0 start = Platform.currentTime while(i < 10){ val Map = testSequence.map(c=> 'A') i+=1 } end = Platform.currentTime println("Map: "+ (end-start)/10 +" ms") //benchmarking Array.foreach i = 0 start = Platform.currentTime while(i < 10){ val ForEach = testSequence.foreach(c=> d=c ) i+=1 } end = Platform.currentTime println("Foreach: "+ (end-start)/10 +" ms") //benchmarking Array.mkString i = 0 start = Platform.currentTime while(i < 10){ val MkString = testSequence.mkString i+=1 } end = Platform.currentTime println("mkString: "+ (end-start)/10 +" ms") //benchmarking Array.toList i = 0 start = Platform.currentTime while(i < 10){ val ToList = testSequence.toList i+=1 } end = Platform.currentTime println("toList: "+ (end-start)/10 +" ms") } } Dan |
|
|
Re: [scala] Array Performance in 2.8On Thu, Oct 15, 2009 at 01:13:41PM -0700, ScalaDan wrote:
> Date: Thu, 15 Oct 2009 13:13:41 -0700 (PDT) > From: ScalaDan <DanoneD@...> > To: scala@... > Subject: Re: [scala] Array Performance in 2.8 > > > > Ismael Juma wrote: > > > > > > > > Interesting. In 2.7.x, calls like filter would also cause boxing (unless > > the array was already boxed) since native JVM arrays don't support it. > > > > Given that we have very little data, my guess is that filter in the > > previous boxed representation for monomorphic arrays was faster than the > > filter in the new boxed representation. > > > > It would help if we had some profiling information or a runnable test > > case to play with. Alternatively, provide your own filter(array: > > Array[Char]) method and see how that performs. How does it compare to > > 2.8.x and 2.7.x? > > > > > > Ok, first, thanks for the many replies and the interest. Actually, I'm not a > developer, but a biologist ;-) So I just try to do my best in providing the > needed information, but I'm not an expert. I'm trying to convince a good > friend of mine, who is a java developer, to use Scala for our project, so > performance issues in general will be crucial for the language choice. > > I wrote a small benchmarking program. Interestingly, the differences here > are not as dramatic, as in the actual program, but still visible. > > Here, the input String is generated at random and then used to generate a > DNA Object. In the real program, the input String is generated by a Swing > TextArea, where the user can paste the Sequence. The input of the TextArea > is then used to generate the DNA object. > > import scala.compat.Platform > > object Main { > val NUCL = Array('A','C','G','T','N','P','W','K','X','Z') > def makeTestSequence(n: Int): String ={ > val arr = new Array[Char](n) > for (i <- 0 to (n-1)){ > var nucl = (Math.random*10).asInstanceOf[Byte] > arr(i) = NUCL(nucl) > } > arr.mkString > } > def main(args: Array[String]):Unit ={ > //benchmarking makeTestSequence > var start = Platform.currentTime > val testSequenceString = makeTestSequence(10000000) > var end = Platform.currentTime > println("MakeTestSequence: "+ (end-start)+" ms") > //warming up > var i = 0 > while(i < 100){ > val TestDNA = new DNA(testSequenceString) > i+=1 > } > //benchmarking Array.filter > i = 0 > start = Platform.currentTime > while(i < 10){ > val TestDNA = new DNA(testSequenceString) > i+=1 > } > end = Platform.currentTime > println("FilterTest: "+ (end-start)/10 +" ms") > } > } > > class DNA(input: String) { > val Sequence = { > val seqArray = input.toCharArray > seqArray.filter(c => c=='A' || c== 'T' || c== 'C' || c == 'G' || c== > 'N') > } > } > > My results: > > 2.7.6: > > MakeTestSequence: 1500ms > FilterTest: 275 ms > > 2.8 > > MakeTestSequence: 1765 ms > FilterTest: 321 ms > > As I said, in this case the differences are not as big as I observed before, > but still measurable. I will provide a more "real-life" situation in the > next days. > > Hope this helps. > > Dan Dan, To paraphrase Andrew Gaydenko's question, why are you using a String? Each Char in a String is 2 bytes, which is 4 times as much memory as you need to represent your 10 choices in each location. Using Array[Byte] is half the size of using String, is a bit simpler because it avoids some conversions, and you can encode the values of your nucleotides (i.e. put them in a particular order, as you did) to simplify and speed up your common operations (in this case, the filter operation in class DNA. If you are concerned about having an immutable data structure, you can wrap the Array[Byte] in another class. Here's my version of your test program using Array[Byte] rather than String: import scala.compat.Platform object Main { val NUCL = Array('A','C','G','T','N','P','W','K','X','Z') def makeTestSequence(n: Int): Array[Byte] ={ val arr = new Array[Byte](n) for (i <- 0 to (n-1)){ var nucl = (Math.random*10).asInstanceOf[Byte] arr(i) = nucl } arr } def main(args: Array[String]):Unit ={ //benchmarking makeTestSequence var start = Platform.currentTime val testSequenceString = makeTestSequence(10000000) var end = Platform.currentTime println("MakeTestSequence: "+ (end-start)+" ms") //warming up var i = 0 while(i < 100){ val TestDNA = new DNA(testSequenceString) i+=1 } //benchmarking Array.filter i = 0 start = Platform.currentTime while(i < 10){ val TestDNA = new DNA(testSequenceString) i+=1 } end = Platform.currentTime println("FilterTest: "+ (end-start)/10 +" ms") } } class DNA(input: Array[Byte]) { val Sequence:Array[Byte] = { input.filter(_<5) } } My results with your original program, compiled and run under Scala 2.8.0.r18925: MakeTestSequence: 2234 ms FilterTest: 400 ms My results with my program, compiled and run under Scala 2.8.0.r18925: MakeTestSequence: 1120 ms FilterTest: 259 ms I also compiled and ran these under 2.7.4, and for both implementations they were about 20% slower under 2.7.4 than under 2.8. If you pack two 4-bit nucleotides into one byte it will take half again as much memory. As Jim Andreou (Dimitris) pointed out, you will likely have fewer cache misses, which might lead to a significant performance improvement over putting only one nucleotide into each byte. And if the memory difference tips you into swapping, that performance difference would well outstrip any of these other effects. This is really dependent on the relative sizes of your problem and the hardware you are running it on, and also what else is running at the same time. But as far as I can see, using Array[Byte] instead of String will always perform better. -- Jim |
|
|
Re: [scala] Array Performance in 2.8Thank you very much for these advices. I really wasn't thinking about bits, just because I thought Chars would be the most natural way to do it. But you are perfectly right. Something like this should work fine.
I'm still wondering why I am having so different results comparing 2.7.x with 2.8 than other people. On my machine most of the methods are slower with 2.8. Apparently your machine is just the other way around. Dan
|
|
|
Re: [scala] Array Performance in 2.8On Sat, Oct 17, 2009 at 6:45 PM, ScalaDan <DanoneD@...> wrote:
> > Thank you very much for these advices. I really wasn't thinking about bits, > just because I thought Chars would be the most natural way to do it. But you > are perfectly right. Something like this should work fine. > > I'm still wondering why I am having so different results comparing 2.7.x > with 2.8 than other people. On my machine most of the methods are slower > with 2.8. Apparently your machine is just the other way around. > It often depends on which VM you are running. What's your java version? On Sun's VMs -server is often,but not always, faster than -client. Also, newer Java 6 VMs do some optimizations not available on older VMs. But in any case no opimization was done yet for 2.8 collections. Hopefully, we'll see much better performance in the future. Cheers -- Martin |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |