lazy ByteStrings: toChunks

older
Re: patch applied (packages/base):...

Ross Paterson

23 Jan 2007 23 Jan '07

7:12 p.m.

toChunks exposes the implementation, and so shouldn't be in the public interface, should it? There could be a function from lazy to ordinary ByteStrings (B.concat . toChunks), though.

Show replies by date

dons＠cse.unsw.edu.au

23 Jan 23 Jan

8:32 p.m.

ross:

...

toChunks exposes the implementation, and so shouldn't be in the public interface, should it? There could be a function from lazy to ordinary ByteStrings (B.concat . toChunks), though.

That seems reasonable. All uses I've ever had for toChunks involve also concat'ing them. The idea originally was to avoid unnecessary strictness. -- Don

Duncan Coutts

9:02 p.m.

On Wed, 2007-01-24 at 07:32 +1100, Donald Bruce Stewart wrote:

...

...
toChunks exposes the implementation, and so shouldn't be in the public interface, should it? There could be a function from lazy to ordinary ByteStrings (B.concat . toChunks), though.

No, I don't think it exposes the implementation that much. In particular we could change the internal representation from a list of chunks to a tree of chunks or a element-strict list of chunks without breaking the toChunks function. In the worst case of representation change, toChunks could still return a single massive chunk.

...

That seems reasonable. All uses I've ever had for toChunks involve also concat'ing them.

This is indeed the most common use however libraries like zlib/bzlib compression, charset conversion, encryption, (de)serialisation etc that need to work on contiguous chunks of memory need to be able to get at the chunks. The only other thing they can do is to import the internal module and get at the LPS constructor which is more evil and will break if we change the underlying representation (and I do intend to experiment with making the lazy byte string rep use element-strict lists to remove one indirection).

...

The idea originally was to avoid unnecessary strictness.

The other reason that we decided to include toChunks and decided not to include a function that converts to a strict byte string is that I didn't want to hide the expense of the operation from the user. toChunks is O(1) and should remain O(1) with any reasonable representation change that I can think of. toStrict however is O(n) and has to force the whole stream into memory and copy it. It's expensive. If the user writes (B.concat . toChunks) then this expense is explicit since they already know that B.concat incurs that expense. So I vote for the status-quo. Duncan

Ross Paterson

24 Jan 24 Jan

11:58 a.m.

On Tue, Jan 23, 2007 at 09:02:31PM +0000, Duncan Coutts wrote:

...

On Wed, 2007-01-24 at 07:32 +1100, Donald Bruce Stewart wrote:

...
ross:

...
toChunks exposes the implementation, and so shouldn't be in the public interface, should it? There could be a function from lazy to ordinary ByteStrings (B.concat . toChunks), though.

No, I don't think it exposes the implementation that much. In particular we could change the internal representation from a list of chunks to a tree of chunks or a element-strict list of chunks without breaking the toChunks function.

OK, but it does break the abstraction (list of bytes), while B.concat . toChunks doesn't. I think that qualifies it as an internal interface.

...

...
That seems reasonable. All uses I've ever had for toChunks involve also concat'ing them.

This is indeed the most common use however libraries like zlib/bzlib compression, charset conversion, encryption, (de)serialisation etc that need to work on contiguous chunks of memory need to be able to get at the chunks.

Can you point at some examples? I had a quick look, but couldn't find any uses of toChunks not preceded by concat. I would expect most of those examples to operate on substrings that might span chunk boundaries.

...

The only other thing they can do is to import the internal module and get at the LPS constructor which is more evil and will break if we change the underlying representation (and I do intend to experiment with making the lazy byte string rep use element-strict lists to remove one indirection).

The internal module could offer different levels of interface, though.

Duncan Coutts

12:13 p.m.

On Wed, 2007-01-24 at 11:58 +0000, Ross Paterson wrote:

...

On Tue, Jan 23, 2007 at 09:02:31PM +0000, Duncan Coutts wrote:

...
On Wed, 2007-01-24 at 07:32 +1100, Donald Bruce Stewart wrote:

...
ross:

...
toChunks exposes the implementation, and so shouldn't be in the public interface, should it? There could be a function from lazy to ordinary ByteStrings (B.concat . toChunks), though.

No, I don't think it exposes the implementation that much. In particular we could change the internal representation from a list of chunks to a tree of chunks or a element-strict list of chunks without breaking the toChunks function.

OK, but it does break the abstraction (list of bytes), while B.concat . toChunks doesn't. I think that qualifies it as an internal interface.

How about the other way around? fromChunks would be terrible if it had to go via a single strict chunk. It'd loose all the laziness and force everything into memory. Given a toStrict function, toChunks could be implemented anyway, just not as efficiently. You can do it via unfolding uses of take/drop and applying the toStrict function to each bit. So you can implement one in terms of the other. So I'm claiming that toChunks is the better primitive.

...

...
...
That seems reasonable. All uses I've ever had for toChunks involve also concat'ing them.

This is indeed the most common use however libraries like zlib/bzlib compression, charset conversion, encryption, (de)serialisation etc that need to work on contiguous chunks of memory need to be able to get at the chunks.

Can you point at some examples? I had a quick look, but couldn't find any uses of toChunks not preceded by concat.

Well at the moment most of these just import the internal Base module and the LPS constructor, but that doesn't mean they should!

...

I would expect most of those examples to operate on substrings that might span chunk boundaries.

No, because they need access to contiguous chunks of memory. Some because they're calling out to C libs that expect contiguous chunks (eg zlib, bzlib, iconv) and others like the binary deserialisation want contiguous chunks for efficiency so that it becomes possible to common-up bounds checks rather than doing a bounds check for each byte as head/tail must do.

...

...
The only other thing they can do is to import the internal module and get at the LPS constructor which is more evil and will break if we change the underlying representation (and I do intend to experiment with making the lazy byte string rep use element-strict lists to remove one indirection).

The internal module could offer different levels of interface, though.

Yeah it could. Duncan

Einar Karttunen

8:13 p.m.

On 24.01 11:58, Ross Paterson wrote:

...

Can you point at some examples? I had a quick look, but couldn't find any uses of toChunks not preceded by concat. I would expect most of those examples to operate on substrings that might span chunk boundaries.

I am using toChunks for encryption and it is a very valuable tool. Basically the task is "process each chunk of this bytestring which might be >4gb in total". - Einar Karttunen

6735

Age (days ago)

6736

Last active (days ago)

List overview

Download

5 comments

4 participants

participants (4)

dons＠cse.unsw.edu.au
Duncan Coutts
Einar Karttunen
Ross Paterson