memory useage of data types in the time package

newer
Cabal sqlite3 installation problems

older
unix package patch to add support...

Duncan Coutts

21 May 2010 21 May '10

1:54 p.m.

Hi Ashley (and other people interested in the time lib), I wonder if we could look more closely at the in-memory representations of the data types in the time library to see if we could make them a bit smaller. There are various sorts of programs that deal with a large quantity of data and often that includes some date/time data. It's a great shame that programs dealing with lots of time data might have to avoid using the time package and revert to things like newtype Seconds = Seconds Int32 simply because they take less space, and can be unpacked into other data structures. Looking at the representations in the time package, I think there are a number of places where the size could be reduced without affecting the behaviour or significantly affecting the range of values that can be represented. For example, data TimeOfDay = TimeOfDay { todHour :: Int, todMin :: Int, todSec :: Pico } This uses 40 bytes (or 80 bytes on 64bit) but it could be brought down to just 20 bytes (or 24 on 64bit). We could use strict fields here for the Ints and build the package with -funbox-strict-fields. This would save 4 words. The most expensive one here is Pico which uses an Integer underneath and thus cannot be unboxed. We could save a further two words if Pico were based on Int64 instead of Integer. In principle, the hour and minutes could be Int16 and perhaps in future ghc might be able to pack them into a single word inside the TimeOfDay record. If we find cannot change Pico (since it's in the base package), then we could change the time package to use some equivalent fixed point type. newtype Day = ModifiedJulianDay {toModifiedJulianDay :: Integer} This could use Int64 without significantly affecting the range. It would still let us represent dates trillions of years into the future or past. (Using Int32 would allow dates a mere few million years in the future). Again, the point of using Int64 is to save an indirection and to allow the type to be unboxed into other constructors such as: data UTCTime = UTCTime { utctDay :: Day, utctDayTime :: DiffTime } This takes 7 words, 28 or 56 bytes. It could be reduced to 20 or 24 bytes. This one is especially useful to make smaller since it is used as a timestamp in many applications, so there tend to be a lot of them. newtype DiffTime = MkDiffTime Pico Again, Pico based on Int64 rather than Integer would save space, save an indirection and allow further unboxing. In general, do any of the date/time record types need to have lazy fields? My guess is that they do not. Duncan

Show replies by date

Yitzchak Gale

21 May 21 May

4:17 p.m.

Duncan Coutts wrote:

...

There are various sorts of programs that deal with a large quantity of data and often that includes some date/time data. It's a great shame that programs dealing with lots of time data might have to avoid using the time package and revert to things like newtype Seconds = Seconds Int32 simply because they take less space, and can be unpacked into other data structures.

That is true. But on the other hand, my guess is that most applications don't need those optimizations. One of the most common complaints heard about Haskell is that Data.Time is so much more difficult to use than the basic time API in most other languages. I believe that those complaints are totally unjustified. It's more difficult because unlike the others, it is correct, and it uses the type system to prevent subtle intermittant bugs that almost always exist in applications using those other libraries. But in any case, I think we should be very careful not to make the interface even more complicated just to achieve what is a premature optimization in the vast majority of applications. Many of the suggestions you are making would actually be transparent except when constructors are used explicitly. So perhaps we could achieve most of what you are suggesting without changing the interface if we provide polymorphic functions in place of each of the constructors, then move the actual contructors to "Internals" modules for use only by those who need them. We would then be free to change the internal types now and in the future without significant impact on the interface. As for laziness, it's usually not correct to have strictness by default in a library as general as this one. For example, it's conceivable that someone will construct a UTCTime where the Day is easy to compute but the TimeOfDay results in heavy computation or even a bottom. That user would be unpleasantly surprised if we introduced strictness. Gratuitous strictness is also a premature optimization, especially in a general-purpose library. Haskell is lazy by default. Perhaps we could introduce strict alternatives for some of the functions. That wouldn't help the size of the data types though, unless we change some of them to type classes... Regards, Yitz

Duncan Coutts

7:01 p.m.

On Fri, 2010-05-21 at 19:17 +0300, Yitzchak Gale wrote:

...

Duncan Coutts wrote:

...
There are various sorts of programs that deal with a large quantity of data and often that includes some date/time data. It's a great shame that programs dealing with lots of time data might have to avoid using the time package and revert to things like newtype Seconds = Seconds Int32 simply because they take less space, and can be unpacked into other data structures.

That is true. But on the other hand, my guess is that most applications don't need those optimizations.

One of the most common complaints heard about Haskell is that Data.Time is so much more difficult to use than the basic time API in most other languages. I believe that those complaints are totally unjustified. It's more difficult because unlike the others, it is correct, and it uses the type system to prevent subtle intermittant bugs that almost always exist in applications using those other libraries.

Indeed. I appreciate that the interface helps us all to write correct time-handling code. That's why it's such a shame if people are tempted to go back to primitive stuff simply because of the memory profile.

...

But in any case, I think we should be very careful not to make the interface even more complicated

As you note below, it's not a real change or complication to the API.

...

just to achieve what is a premature optimization in the vast majority of applications.

Unfortunately the burden of a standard/common library is that people want to use it in a wide range of circumstances. So while it'd certainly be a premature optimisation in most circumstances, it's fairly important in others. I happened to be working recently on an unremarkable program, where the heap profiler told me that a very significant portion of the heap was storing time data structures.

...

Many of the suggestions you are making would actually be transparent except when constructors are used explicitly. So perhaps we could achieve most of what you are suggesting without changing the interface if we provide polymorphic functions in place of each of the constructors, then move the actual contructors to "Internals" modules for use only by those who need them. We would then be free to change the internal types now and in the future without significant impact on the interface.

I don't think we need to go that far in making the representations completely hidden, though I don't object to that if you think that's a design improvement. While technically it is an API change to switch the Pico fields to an equivalent numeric type, it is one that is likely not to break many uses, and where it does it's likely only to be a type signature or a fromIntegral. The point is it keeps the existing spirit of the interface and does not add interface complexity.

...

As for laziness, it's usually not correct to have strictness by default in a library as general as this one. For example, it's conceivable that someone will construct a UTCTime where the Day is easy to compute but the TimeOfDay results in heavy computation or even a bottom.

Honestly I find it a bit hard to conceive of. :-) It appears that pretty much all the functions that construct and consume TimeOfDay, LocalTime and UTCTime are strict[*]. So unless people are using the constructors directly and then not using other functions on them, then it looks like one cannot really use time structures lazily anyway (though perhaps I missed some, I don't know the library that well.) [*] localTimeToUTC, utcToLocalTime and utcToZonedTime appear to be lazy in the time of day component, but strict in the day component.

...

That user would be unpleasantly surprised if we introduced strictness. Gratuitous strictness is also a premature optimization, especially in a general-purpose library. Haskell is lazy by default.

It's not a hard and fast rule that everything should be lazy. I certainly appreciate that we need lazyness in the right places. I think it depends on whether we consider these values as units or not. I tend to think of time values as atomic units, like complex or rational numbers. The standard H98 complex and rational types are strict in their components, because conceptually they're not compound types, they're atomic.

...

Perhaps we could introduce strict alternatives for some of the functions. That wouldn't help the size of the data types though, unless we change some of them to type classes...

It's not about the strictness of the functions. It's about the in-memory representation. We cannot achieve a compact representation if we use large branching records. We only get it from using strict fields since that allows us to unbox them into flat compact records. I admit I was assuming that the lazyness in most of the time records is unnecessary. If we all conclude that the lazyness of the existing representations is essential then there's really few improvements we can make (probably even if we were to hide the representations). In that case it might make more sense to add separate compact representations that can be converted to the standard representations, eg a reduced resolution compact TimeStamp type that can be converted to/from UTCTime or LocalTime or something like that. The idea being you store your hundreds of thousands of compact TimeStamps in data structures, but convert to the regular type for the calculations. The downside of course is that this does add interface complexity, it would be nicer if we could make the regular types reasonably compact. Perhaps we can see what other people think about the balance of use cases between those that need the lazyness vs those that need compat representations. I may well be overestimating how common the latter use case is. Duncan

wren ng thornton

23 May 23 May

10:42 a.m.

Duncan Coutts wrote:

...

Yitzchak Gale wrote:

...
That user would be unpleasantly surprised if we introduced strictness. Gratuitous strictness is also a premature optimization, especially in a general-purpose library. Haskell is lazy by default.

It's not a hard and fast rule that everything should be lazy. I certainly appreciate that we need lazyness in the right places. I think it depends on whether we consider these values as units or not.

I tend to think of time values as atomic units, like complex or rational numbers. The standard H98 complex and rational types are strict in their components, because conceptually they're not compound types, they're atomic.

I agree. While I haven't used this library (I've had no reason to yet), when dealing with times in other languages, my default conception is to treat them like atomic units and I see no reason to think of them differently in Haskell. While laziness by default is nice, too much laziness is just as bad as too much strictness. I can't really imagine a use for laziness in calculating the components of a time record. Even if there are some cases where it'd be desirable, they seem rare enough that they could be supported by a separate module that provides lazy variants of the default implementation (just as other libraries offer both strict and lazy variants). It'd be nice not to have to complicate the API, but I don't see a good argument for not having the default library use strict unboxed records.

...

Perhaps we can see what other people think about the balance of use cases between those that need the lazyness vs those that need compat representations. I may well be overestimating how common the latter use case is.

I don't think the *need* for compact representations is terribly common, but this strikes me as the sort of optimizations we should expect from base libraries. When the need does arise, it's much better to be pleasantly surprised that things just work, rather than lamenting the fact that Haskell isn't ready for prime time. Because it's a base library and is intended to be the primary library for time, this doesn't strike me as premature in the slightest. -- Live well, ~wren

Evan Laforge

21 May 21 May

7:48 p.m.

...

One of the most common complaints heard about Haskell is that Data.Time is so much more difficult to use than the basic time API in most other languages. I believe that those complaints are totally unjustified. It's

Getting off the subject, but: I think if the complaint is that it's hard because it doesn't have much documentation, as it is for me, then the complaint is totally justified. Data.Time has no documentation, so right off you have to start guessing about which module to use. It took me a long time to figure out that a plain yyyy/mm/dd is modeled as a Data.Time.Calendar.Day. Or is it JulianDay? What's an OrdinalDate? MonthDay sounds promising but defines no data types. The haddock talks about Julian days and "proleptic Gregorian calendars" and never quite gets around to mentioning that this is not just for people wanting to calculate how many days there are between now and the assassination of Julius Caesar, but for anyone wanting to deal with a modern date. Most people, when looking for "a date" are not going to start looking for things that say "julian day" on them. After wading through all this stuff that has no explanation but by function name sounds like it's for scholars of Christian antiquity, the temptation to go back to System.Time, which provides only one module and a totally straightforward pair of types, is really strong. Especially when System.Time says "this is deprecated" but doesn't say why. It must have been written for a scholar of Christian antiquity for other scholars of Christian antiquity who are disappointed that System.Time lacks functions to find the Paschal full moon of the Easter feast day of the Western Orthodox Catholic tradition as standardized by Pope XXXIV but not put into effect until... Say someone who understands Data.Time wrote some top level haddock and some examples of how to create dates, create times, get the current date and time, and find the number of months and days between now and then along with some of the things to take into account, etc. Perhaps the majority of the complaints would be satisfied. I don't know enough about dates to write the intro, but I'd be happy to proofread one.

Jesper Louis Andersen

22 May 22 May

5:40 p.m.

On Fri, May 21, 2010 at 9:48 PM, Evan Laforge wrote:

...

Getting off the subject, but:

I think if the complaint is that it's hard because it doesn't have much documentation, as it is for me, then the complaint is totally justified.

This is true. I had an easier time wrapping my head around iteratees than around the Time library. Whenever I need to use the time package, I end up screaming like a rampaging orc who has just seen an elf. Is there any documentation written at all so I, or others, can have a go at injecting it into the haddock documentation? I am afraid I have too little grasp of time to go starting on the subject entirely by myself :/ -- J.

Isaac Dupree

21 May 21 May

11:18 p.m.

On 05/21/10 09:54, Duncan Coutts wrote:

...

This one is especially useful to make smaller since it is used as a timestamp in many applications, so there tend to be a lot of them.

newtype DiffTime = MkDiffTime Pico

Again, Pico based on Int64 rather than Integer would save space, save an indirection and allow further unboxing.

But is it acceptable for the maximum DiffTime to be on the order of 18 million years? (Maybe... but it's less obvious to be alright than trillions of years is...) -Isaac

wren ng thornton

23 May 23 May

10:47 a.m.

Isaac Dupree wrote:

...

On 05/21/10 09:54, Duncan Coutts wrote:

...
This one is especially useful to make smaller since it is used as a timestamp in many applications, so there tend to be a lot of them.

newtype DiffTime = MkDiffTime Pico

Again, Pico based on Int64 rather than Integer would save space, save an indirection and allow further unboxing.

But is it acceptable for the maximum DiffTime to be on the order of 18 million years? (Maybe... but it's less obvious to be alright than trillions of years is...)

I don't know how much it'd complicate the API, but this sounds like an argument for having two different types: one for a general difference in time (trillions of years), and another for a difference in timestamps (a few thousand years should suffice, I'd think). -- Live well, ~wren

Ashley Yakeley

22 May 22 May

9:01 a.m.

On Fri, 2010-05-21 at 14:54 +0100, Duncan Coutts wrote:

...

Hi Ashley (and other people interested in the time lib),

I wonder if we could look more closely at the in-memory representations of the data types in the time library to see if we could make them a bit smaller.

These are breaking changes. If we do them, we should bump the major version to 2. And this might open the door to other breaking changes... What should replace the Pico type? Is it worth generalising Fixed (in base) to allow any underlying integer type (also a breaking change)? -- Ashley Yakeley

5536

Age (days ago)

5538

Last active (days ago)

List overview

Download

8 comments

7 participants

participants (7)

Ashley Yakeley
Duncan Coutts
Evan Laforge
Isaac Dupree
Jesper Louis Andersen
wren ng thornton
Yitzchak Gale