[scala] New collections (slice/drop) problem

View: New views
4 Messages — Rating Filter:   Alert me  

[scala] New collections (slice/drop) problem

by Ivan Tarasov :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm implementing some binary stream data parsing, and I approached it in the following way:

1. Data is read into Array[Byte] (when I have access to all the data I need).
2. The array is passed into a function deserialize(data: Seq[Byte]): Seq[Record], which parses the bytes into records.
3. deserialize function has a tail recursive function deserializeRecord(data: Seq[Byte], records: Queue[Record]): Seq[Record] inside which parses a single Record and enqueues it into the end of records queue (initially empty).
4. deserializeRecord tail-calls itself: deserializeRecord(data.drop(amountOfParsedBytes), records.enqueue(parsedRecord))

The initial data array size is around 256kb, each record is around 6 bytes. The first problem I found was that data.drop call was extremely slow, likely because a new array was created on each drop and the data was copied (I didn't expect that).

Then I started passing a view of my initial array: deserialize(dataBytesArray.view). That increased the performance quite a bit (although it's still bad), but introduced another problem: I started getting StackOverflowError on calls to data in deserializeRecord when a lot of drops have been done.

Example:

Welcome to Scala version 2.8.0.r17998-b20090605125720 (Java HotSpot(TM) Server VM, Java 1.6.0_12).
Type in expressions to have them evaluated.
Type :help for more information.

scala> var a = new Array[Byte](200000).view
a: java.lang.Object with scala.collection.generic.MutableVectorView[Byte,scala.collection.mutable.Vector[Byte]] = MutableVectorTemplate(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
scala> (1 to 10000).foreach { i => a = a.drop(6) }

scala> a
java.lang.StackOverflowError
    at scala.collection.generic.MutableVectorViewTemplate$$anon$2.stringPrefix(MutableVectorViewTemplate.scala:61)
    at scala.collection.generic.TraversableViewTemplate$Sliced$class.stringPrefix(TraversableViewTemplate.scala:54)
    at scala.collection.generic.MutableVectorViewTemplate$$anon$2.stringPrefix(MutableVectorViewTemplate.scala:61)
    at scala.collection....
scala> a.length
java.lang.StackOverflowError
    at scala.collection.generic.SequenceViewTemplate$Sliced$class.length(SequenceViewTemplate.scala:34)
    at scala.collection.generic.MutableVectorViewTemplate$$anon$2.length(MutableVectorViewTemplate.scala:61)
    at scala.collection.generic.SequenceViewTemplate$Sliced$class.length(SequenceViewTemplate.scala:34)
    at scala.collection.generic.MutableVectorViewTemp...
scala> a(1)
java.lang.StackOverflowError
    at scala.collection.generic.MutableVectorViewTemplate$$anon$2.apply(MutableVectorViewTemplate.scala:61)
    at scala.collection.generic.SequenceViewTemplate$Sliced$class.apply(SequenceViewTemplate.scala:36)
    at scala.collection.generic.MutableVectorViewTemplate$$anon$2.apply(MutableVectorViewTemplate.scala:61)
    at scala.collection.generic.SequenceViewTemplat...

As far as I understand, the slice and drop functions do not try to flatten the data access path, forwarding all the calls to the immediate parent slice instead of the original datastructure. This basically means that the slice is unusable and doesn't scale well (at least performance-wise).

Ivan

Re: [scala] New collections (slice/drop) problem

by Landei :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Ivan Tarasov wrote:
I'm implementing some binary stream data parsing, and I approached it in the
following way:

1. Data is read into Array[Byte] (when I have access to all the data I
need).
2. The array is passed into a function deserialize(data: Seq[Byte]):
Seq[Record], which parses the bytes into records.
3. deserialize function has a tail recursive function
deserializeRecord(data: Seq[Byte], records: Queue[Record]): Seq[Record]
inside which parses a single Record and enqueues it into the end of records
queue (initially empty).
4. deserializeRecord tail-calls itself:
deserializeRecord(data.drop(amountOfParsedBytes),
records.enqueue(parsedRecord))

The initial data array size is around 256kb, each record is around 6 bytes.
The first problem I found was that data.drop call was extremely slow, likely
because a new array was created on each drop and the data was copied (I
didn't expect that).

Then I started passing a view of my initial array:
deserialize(dataBytesArray.view). That increased the performance quite a bit
(although it's still bad), but introduced another problem: I started getting
StackOverflowError on calls to data in deserializeRecord when a lot of drops
have been done.

Example:

Welcome to Scala version 2.8.0.r17998-b20090605125720 (Java HotSpot(TM)
> Server VM, Java 1.6.0_12).
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> var a = new Array[Byte](200000).view
> a: java.lang.Object with
> scala.collection.generic.MutableVectorView[Byte,scala.collection.mutable.Vector[Byte]]
> = MutableVectorTemplate(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
> scala> (1 to 10000).foreach { i => a = a.drop(6) }
>
> scala> a
> java.lang.StackOverflowError
>     at
> scala.collection.generic.MutableVectorViewTemplate$$anon$2.stringPrefix(MutableVectorViewTemplate.scala:61)
>     at
> scala.collection.generic.TraversableViewTemplate$Sliced$class.stringPrefix(TraversableViewTemplate.scala:54)
>     at
> scala.collection.generic.MutableVectorViewTemplate$$anon$2.stringPrefix(MutableVectorViewTemplate.scala:61)
>     at scala.collection....
> scala> a.length
> java.lang.StackOverflowError
>     at
> scala.collection.generic.SequenceViewTemplate$Sliced$class.length(SequenceViewTemplate.scala:34)
>     at
> scala.collection.generic.MutableVectorViewTemplate$$anon$2.length(MutableVectorViewTemplate.scala:61)
>     at
> scala.collection.generic.SequenceViewTemplate$Sliced$class.length(SequenceViewTemplate.scala:34)
>     at scala.collection.generic.MutableVectorViewTemp...
> scala> a(1)
> java.lang.StackOverflowError
>     at
> scala.collection.generic.MutableVectorViewTemplate$$anon$2.apply(MutableVectorViewTemplate.scala:61)
>     at
> scala.collection.generic.SequenceViewTemplate$Sliced$class.apply(SequenceViewTemplate.scala:36)
>     at
> scala.collection.generic.MutableVectorViewTemplate$$anon$2.apply(MutableVectorViewTemplate.scala:61)
>     at scala.collection.generic.SequenceViewTemplat...
>

As far as I understand, the slice and drop functions do not try to flatten
the data access path, forwarding all the calls to the immediate parent slice
instead of the original datastructure. This basically means that the slice
is unusable and doesn't scale well (at least performance-wise).

Ivan
Hi Ivan,

without more code it's hard to give you a hint. An obvious but ugly solution would be to drop nothing from the array but to add a running index. Or can you change the type of the initial data structure to something more "drop friendly" (e.g a queue)? Could you try out Scala 2.8 with the new collections (which might solve the problem)?

Privyet,
Daniel


Re: [scala] New collections (slice/drop) problem

by Ivan Tarasov :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm using one of the recent development builds of 2.8.0 (so I'm using the new collections API). Do you suggest upgrading to the latest?

I don't like the idea of "not dropping and passing the index" — that's quite ugly and defeats the purpose of the APIs.

To me it looks like a bug in the Scala collections classes, however I wanted to make sure that I'm not doing anything stupid (or missing some obvious not-ugly solution) before I submit a bug report.

Ivan

On Tue, Jun 16, 2009 at 12:38 AM, Landei <Daniel.Gronau@...> wrote:



Ivan Tarasov wrote:
>
> I'm implementing some binary stream data parsing, and I approached it in
> the
> following way:
>
> 1. Data is read into Array[Byte] (when I have access to all the data I
> need).
> 2. The array is passed into a function deserialize(data: Seq[Byte]):
> Seq[Record], which parses the bytes into records.
> 3. deserialize function has a tail recursive function
> deserializeRecord(data: Seq[Byte], records: Queue[Record]): Seq[Record]
> inside which parses a single Record and enqueues it into the end of
> records
> queue (initially empty).
> 4. deserializeRecord tail-calls itself:
> deserializeRecord(data.drop(amountOfParsedBytes),
> records.enqueue(parsedRecord))
>
> The initial data array size is around 256kb, each record is around 6
> bytes.
> The first problem I found was that data.drop call was extremely slow,
> likely
> because a new array was created on each drop and the data was copied (I
> didn't expect that).
>
> Then I started passing a view of my initial array:
> deserialize(dataBytesArray.view). That increased the performance quite a
> bit
> (although it's still bad), but introduced another problem: I started
> getting
> StackOverflowError on calls to data in deserializeRecord when a lot of
> drops
> have been done.
>
> Example:
>
> Welcome to Scala version 2.8.0.r17998-b20090605125720 (Java HotSpot(TM)
>> Server VM, Java 1.6.0_12).
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala> var a = new Array[Byte](200000).view
>> a: java.lang.Object with
>> scala.collection.generic.MutableVectorView[Byte,scala.collection.mutable.Vector[Byte]]
>> = MutableVectorTemplate(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> 0,
>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> 0,
>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> 0,
>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
>> scala> (1 to 10000).foreach { i => a = a.drop(6) }
>>
>> scala> a
>> java.lang.StackOverflowError
>>     at
>> scala.collection.generic.MutableVectorViewTemplate$$anon$2.stringPrefix(MutableVectorViewTemplate.scala:61)
>>     at
>> scala.collection.generic.TraversableViewTemplate$Sliced$class.stringPrefix(TraversableViewTemplate.scala:54)
>>     at
>> scala.collection.generic.MutableVectorViewTemplate$$anon$2.stringPrefix(MutableVectorViewTemplate.scala:61)
>>     at scala.collection....
>> scala> a.length
>> java.lang.StackOverflowError
>>     at
>> scala.collection.generic.SequenceViewTemplate$Sliced$class.length(SequenceViewTemplate.scala:34)
>>     at
>> scala.collection.generic.MutableVectorViewTemplate$$anon$2.length(MutableVectorViewTemplate.scala:61)
>>     at
>> scala.collection.generic.SequenceViewTemplate$Sliced$class.length(SequenceViewTemplate.scala:34)
>>     at scala.collection.generic.MutableVectorViewTemp...
>> scala> a(1)
>> java.lang.StackOverflowError
>>     at
>> scala.collection.generic.MutableVectorViewTemplate$$anon$2.apply(MutableVectorViewTemplate.scala:61)
>>     at
>> scala.collection.generic.SequenceViewTemplate$Sliced$class.apply(SequenceViewTemplate.scala:36)
>>     at
>> scala.collection.generic.MutableVectorViewTemplate$$anon$2.apply(MutableVectorViewTemplate.scala:61)
>>     at scala.collection.generic.SequenceViewTemplat...
>>
>
> As far as I understand, the slice and drop functions do not try to flatten
> the data access path, forwarding all the calls to the immediate parent
> slice
> instead of the original datastructure. This basically means that the slice
> is unusable and doesn't scale well (at least performance-wise).
>
> Ivan
>
>

Hi Ivan,

without more code it's hard to give you a hint. An obvious but ugly solution
would be to drop nothing from the array but to add a running index. Or can
you change the type of the initial data structure to something more "drop
friendly" (e.g a queue)? Could you try out Scala 2.8 with the new
collections (which might solve the problem)?

Privyet,
Daniel


--
View this message in context: http://www.nabble.com/-scala--New-collections-%28slice-drop%29-problem-tp24040606p24049217.html
Sent from the Scala mailing list archive at Nabble.com.



Re: [scala] New collections (slice/drop) problem

by Martin Odersky :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Ivan,

What you describe is a bug. View slices are supposed to compactify
their start and end indices, so that you should not see a stack
overflow. I'll look into it.

Cheers

 -- Martin

On Tue, Jun 23, 2009 at 8:29 PM, Ivan Tarasov<ivan.tarasov@...> wrote:

> I'm using one of the recent development builds of 2.8.0 (so I'm using the
> new collections API). Do you suggest upgrading to the latest?
>
> I don't like the idea of "not dropping and passing the index" — that's quite
> ugly and defeats the purpose of the APIs.
>
> To me it looks like a bug in the Scala collections classes, however I wanted
> to make sure that I'm not doing anything stupid (or missing some obvious
> not-ugly solution) before I submit a bug report.
>
> Ivan
>
> On Tue, Jun 16, 2009 at 12:38 AM, Landei <Daniel.Gronau@...> wrote:
>>
>>
>>
>> Ivan Tarasov wrote:
>> >
>> > I'm implementing some binary stream data parsing, and I approached it in
>> > the
>> > following way:
>> >
>> > 1. Data is read into Array[Byte] (when I have access to all the data I
>> > need).
>> > 2. The array is passed into a function deserialize(data: Seq[Byte]):
>> > Seq[Record], which parses the bytes into records.
>> > 3. deserialize function has a tail recursive function
>> > deserializeRecord(data: Seq[Byte], records: Queue[Record]): Seq[Record]
>> > inside which parses a single Record and enqueues it into the end of
>> > records
>> > queue (initially empty).
>> > 4. deserializeRecord tail-calls itself:
>> > deserializeRecord(data.drop(amountOfParsedBytes),
>> > records.enqueue(parsedRecord))
>> >
>> > The initial data array size is around 256kb, each record is around 6
>> > bytes.
>> > The first problem I found was that data.drop call was extremely slow,
>> > likely
>> > because a new array was created on each drop and the data was copied (I
>> > didn't expect that).
>> >
>> > Then I started passing a view of my initial array:
>> > deserialize(dataBytesArray.view). That increased the performance quite a
>> > bit
>> > (although it's still bad), but introduced another problem: I started
>> > getting
>> > StackOverflowError on calls to data in deserializeRecord when a lot of
>> > drops
>> > have been done.
>> >
>> > Example:
>> >
>> > Welcome to Scala version 2.8.0.r17998-b20090605125720 (Java HotSpot(TM)
>> >> Server VM, Java 1.6.0_12).
>> >> Type in expressions to have them evaluated.
>> >> Type :help for more information.
>> >>
>> >> scala> var a = new Array[Byte](200000).view
>> >> a: java.lang.Object with
>> >>
>> >> scala.collection.generic.MutableVectorView[Byte,scala.collection.mutable.Vector[Byte]]
>> >> = MutableVectorTemplate(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> >> 0,
>> >> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> >> 0,
>> >> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> >> 0,
>> >> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
>> >> scala> (1 to 10000).foreach { i => a = a.drop(6) }
>> >>
>> >> scala> a
>> >> java.lang.StackOverflowError
>> >>     at
>> >>
>> >> scala.collection.generic.MutableVectorViewTemplate$$anon$2.stringPrefix(MutableVectorViewTemplate.scala:61)
>> >>     at
>> >>
>> >> scala.collection.generic.TraversableViewTemplate$Sliced$class.stringPrefix(TraversableViewTemplate.scala:54)
>> >>     at
>> >>
>> >> scala.collection.generic.MutableVectorViewTemplate$$anon$2.stringPrefix(MutableVectorViewTemplate.scala:61)
>> >>     at scala.collection....
>> >> scala> a.length
>> >> java.lang.StackOverflowError
>> >>     at
>> >>
>> >> scala.collection.generic.SequenceViewTemplate$Sliced$class.length(SequenceViewTemplate.scala:34)
>> >>     at
>> >>
>> >> scala.collection.generic.MutableVectorViewTemplate$$anon$2.length(MutableVectorViewTemplate.scala:61)
>> >>     at
>> >>
>> >> scala.collection.generic.SequenceViewTemplate$Sliced$class.length(SequenceViewTemplate.scala:34)
>> >>     at scala.collection.generic.MutableVectorViewTemp...
>> >> scala> a(1)
>> >> java.lang.StackOverflowError
>> >>     at
>> >>
>> >> scala.collection.generic.MutableVectorViewTemplate$$anon$2.apply(MutableVectorViewTemplate.scala:61)
>> >>     at
>> >>
>> >> scala.collection.generic.SequenceViewTemplate$Sliced$class.apply(SequenceViewTemplate.scala:36)
>> >>     at
>> >>
>> >> scala.collection.generic.MutableVectorViewTemplate$$anon$2.apply(MutableVectorViewTemplate.scala:61)
>> >>     at scala.collection.generic.SequenceViewTemplat...
>> >>
>> >
>> > As far as I understand, the slice and drop functions do not try to
>> > flatten
>> > the data access path, forwarding all the calls to the immediate parent
>> > slice
>> > instead of the original datastructure. This basically means that the
>> > slice
>> > is unusable and doesn't scale well (at least performance-wise).
>> >
>> > Ivan
>> >
>> >
>>
>> Hi Ivan,
>>
>> without more code it's hard to give you a hint. An obvious but ugly
>> solution
>> would be to drop nothing from the array but to add a running index. Or can
>> you change the type of the initial data structure to something more "drop
>> friendly" (e.g a queue)? Could you try out Scala 2.8 with the new
>> collections (which might solve the problem)?
>>
>> Privyet,
>> Daniel
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/-scala--New-collections-%28slice-drop%29-problem-tp24040606p24049217.html
>> Sent from the Scala mailing list archive at Nabble.com.
>>
>
>