Storing big datasets

newer
Could Haddock export documentation...

older
The maths behind the Fastest Fibb...

Mikhail Volkhov

6 May 2016 6 May '16

4:28 p.m.

Hi! I'm using ACID package as main database -- it's simple and... ACID (which is cool of course). So now I need to store up to ~30GB of data in a single data structure (!) that will be constantly updated (some kind of a huge tree set). Here's a question -- how to operate that big structure? 1. It doesn't even fit in RAM 2. It should be updated atomically and frequently (every 10 seconds up to 500 elements out of 10^7). 3. What structures should I use? I'd like to store up to 10^6~10^7 some simple elements there too, that will be gigabytes of data. So it seems to me I can't use Data.Set. Thanks for any ideas! -- Volkhov Mikhail M3#38 ITMO study group 17' Computer Technologies Department

Show replies by date

Bardur Arantsson

6 May 6 May

4:41 p.m.

On 05/06/2016 10:28 PM, Mikhail Volkhov wrote:

...

Hi!

I'm using ACID package as main database -- it's simple and... ACID (which is cool of course).

Just for clarification: Do you mean acid-state?

...

So now I need to store up to ~30GB of data in a single data structure (!) that will be constantly updated (some kind of a huge tree set). Here's a question -- how to operate that big structure? 1. It doesn't even fit in RAM

Yes it does. My desktop machine has 32GB RAM. I'm not trying to brag or anything, and *you* may not have that amount of memory, but it *should* be cheap enough to buy enough RAM if it avoids radically redesigning an existing system (with all the associated risk, developer time, etc.). If this is a one-off or rare occurrence, you could get 160GB RAM on an Amazon M4 instance (m4.10xlarge) if you wanted to. (Amazon's pricing structure is incredibly opaque and I didn't do a thorough investigation, so I apologize if this is wrong.)

...

2. It should be updated atomically and frequently (every 10 seconds up to 500 elements out of 10^7).

This is a bit vague. Do you mean that an *average* of 500 (out of 10^7) will be updated every 10 seconds?

...

3. What structures should I use? I'd like to store up to 10^6~10^7 some simple elements there too, that will be gigabytes of data. So it seems to me I can't use Data.Set.

Well, if storing things in RAM is out of the question, so is using Data.Set. Regards,

Mikhail Volkhov

4:55 p.m.

Thank you for the answer! 1. acid-state, yes. 2. The application is supposed to work on the regular pc, without 32gb of ram. I agree that it doesn't cost much to buy 32gb of ram, but it's not the case. 3. Come to think of it, i think it would be more like about 10k elements per 10-20 second, yes. On average. On 05/06/2016 11:41 PM, Bardur Arantsson wrote:

...

On 05/06/2016 10:28 PM, Mikhail Volkhov wrote:

...
Hi!

I'm using ACID package as main database -- it's simple and... ACID (which is cool of course).

Just for clarification: Do you mean acid-state?

...
So now I need to store up to ~30GB of data in a single data structure (!) that will be constantly updated (some kind of a huge tree set). Here's a question -- how to operate that big structure? 1. It doesn't even fit in RAM

Yes it does. My desktop machine has 32GB RAM.

I'm not trying to brag or anything, and *you* may not have that amount of memory, but it *should* be cheap enough to buy enough RAM if it avoids radically redesigning an existing system (with all the associated risk, developer time, etc.).

If this is a one-off or rare occurrence, you could get 160GB RAM on an Amazon M4 instance (m4.10xlarge) if you wanted to. (Amazon's pricing structure is incredibly opaque and I didn't do a thorough investigation, so I apologize if this is wrong.)

...
2. It should be updated atomically and frequently (every 10 seconds up to 500 elements out of 10^7).

This is a bit vague. Do you mean that an *average* of 500 (out of 10^7) will be updated every 10 seconds?

...
3. What structures should I use? I'd like to store up to 10^6~10^7 some simple elements there too, that will be gigabytes of data. So it seems to me I can't use Data.Set.

Well, if storing things in RAM is out of the question, so is using Data.Set.

Regards,

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

-- Volkhov Mikhail M3#38 ITMO study group 17' Computer Technologies Department

Lana Black

8:27 p.m.

Hi Mikhail, Have you considered external database engines? I suppose you could benefit from using Redis in your case. Original Message From: Mikhail Volkhov Sent: Friday, May 6, 2016 8:28 PM To: haskell-cafe@haskell.org Subject: [Haskell-cafe] Storing big datasets Hi! I'm using ACID package as main database -- it's simple and... ACID (which is cool of course). So now I need to store up to ~30GB of data in a single data structure (!) that will be constantly updated (some kind of a huge tree set). Here's a question -- how to operate that big structure? 1. It doesn't even fit in RAM 2. It should be updated atomically and frequently (every 10 seconds up to 500 elements out of 10^7). 3. What structures should I use? I'd like to store up to 10^6~10^7 some simple elements there too, that will be gigabytes of data. So it seems to me I can't use Data.Set. Thanks for any ideas! -- Volkhov Mikhail M3#38 ITMO study group 17' Computer Technologies Department _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Joachim Durchholz

7 May 7 May

5:17 a.m.

Am 07.05.2016 um 02:27 schrieb Lana Black:

...

Hi Mikhail,

Have you considered external database engines? I suppose you could benefit from using Redis in your case.

Wikipedia says that while Redis can use the disk as "virtual memory", that feature is deprecated, so it definitely expects to keep the whole dataset in memory. Which kind of defeats the whole point for Mikhail. sqlite comes to mind. HDBC claims to have a driver for it, and it's linked into the application so there would be no additional setup required. If Mikhail already has another database in his application, the setup cost is already paid for, and he might want to check whether that's fast enough for his purposes. I'd expect postgresql or mysql to be too slow, but H2 to work just fine. Of course that's just expectations, so testing would be needed. Regards, Jo

Mikhail Volkhov

6:58 a.m.

First, about external database engines -- I don't really understand how should I store my own set datatype (let's say it's a red-black tree with some extra information in nodes, something like that) in the key-value db. No idea. And yes, what about Redis, it's not supposed to work with sizes more than RAM can hold. Currently it's not the problem to switch application db. I'll check your solutions (sqlite/HDBC/postgre/H2). Thanks! On 05/07/2016 12:17 PM, Joachim Durchholz wrote:

...

Am 07.05.2016 um 02:27 schrieb Lana Black:

...
Hi Mikhail,

Have you considered external database engines? I suppose you could benefit from using Redis in your case.

Wikipedia says that while Redis can use the disk as "virtual memory", that feature is deprecated, so it definitely expects to keep the whole dataset in memory. Which kind of defeats the whole point for Mikhail.

sqlite comes to mind. HDBC claims to have a driver for it, and it's linked into the application so there would be no additional setup required.

If Mikhail already has another database in his application, the setup cost is already paid for, and he might want to check whether that's fast enough for his purposes. I'd expect postgresql or mysql to be too slow, but H2 to work just fine. Of course that's just expectations, so testing would be needed.

Regards, Jo _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

-- Volkhov Mikhail M3#38 ITMO study group 17' Computer Technologies Department

Imants Cekusins

7:13 a.m.

neo4j may be a good fit for trees. it may be necessary to store large blobs of data in another db.

Lana Black

9:34 p.m.

On 11:17 Sat 07 May , Joachim Durchholz wrote:

...

Am 07.05.2016 um 02:27 schrieb Lana Black:

...
Hi Mikhail,

Have you considered external database engines? I suppose you could benefit from using Redis in your case.

Wikipedia says that while Redis can use the disk as "virtual memory", that feature is deprecated, so it definitely expects to keep the whole dataset in memory. Which kind of defeats the whole point for Mikhail.

I believe we're talking about different things here. This page [1] says that disk persistence is a standard feature in redis. Furthermore, my employer uses redis as a storage backend for raw video, and it works just fine. But yes, redis might not be the best choice depending on what data Mikhail works with.

...

sqlite comes to mind. HDBC claims to have a driver for it, and it's linked into the application so there would be no additional setup required.

If Mikhail already has another database in his application, the setup cost is already paid for, and he might want to check whether that's fast enough for his purposes. I'd expect postgresql or mysql to be too slow, but H2 to work just fine. Of course that's just expectations, so testing would be needed.

It is certainly possible to get 2k requests per second with postgresql [2] on a fairly limited hardware if network latency is taken out of the picture. Again this heavily depends on the kind of data and other conditions. [1] http://redis.io/topics/persistence [2] https://gist.github.com/chanks/7585810

Joachim Durchholz

8 May 8 May

3:41 a.m.

Am 08.05.2016 um 03:34 schrieb Lana Black:

...

On 11:17 Sat 07 May , Joachim Durchholz wrote:

...
Am 07.05.2016 um 02:27 schrieb Lana Black:

...
Hi Mikhail,

Have you considered external database engines? I suppose you could benefit from using Redis in your case.

Wikipedia says that while Redis can use the disk as "virtual memory", that feature is deprecated, so it definitely expects to keep the whole dataset in memory. Which kind of defeats the whole point for Mikhail.

I believe we're talking about different things here. This page [1] says that disk persistence is a standard feature in redis. [1] http://redis.io/topics/persistence

I see. Seems like WP got something wrong here. Actually the claims look quite nice, even though I'd want to verify these before actually relying on them as usual with claims :-)

...

It is certainly possible to get 2k requests per second with postgresql [2] on a fairly limited hardware if network latency is taken out of the picture. Again this heavily depends on the kind of data and other conditions.

[2] https://gist.github.com/chanks/7585810

Ah, that's about scaling up under concurrent load. Mikhail's use case does not have multiple processes per data instance. It's an interesting data point though. I've been hearing figures of "a few hundred transactions per second, max, unless you have dedicated hardware", but now I'm wondering how things might work out in a single-threading scenario. It could easily be faster, though it could also easily be slower because the engine can't parallelize work. I have to admit I have no idea how this would turn out. Regards, Jo

David Turner

7 May 7 May

6:48 a.m.

Btrees are good for storing data on disk. And something like postgres is an extremely efficient implementation of a btree supporting atomic updates and the like. I'd use that! On 6 May 2016 21:29, "Mikhail Volkhov" wrote:

...

Hi!

I'm using ACID package as main database -- it's simple and... ACID (which is cool of course).

So now I need to store up to ~30GB of data in a single data structure (!) that will be constantly updated (some kind of a huge tree set). Here's a question -- how to operate that big structure? 1. It doesn't even fit in RAM 2. It should be updated atomically and frequently (every 10 seconds up to 500 elements out of 10^7). 3. What structures should I use? I'd like to store up to 10^6~10^7 some simple elements there too, that will be gigabytes of data. So it seems to me I can't use Data.Set.

Thanks for any ideas! -- Volkhov Mikhail M3#38 ITMO study group 17' Computer Technologies Department _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Joachim Durchholz

8:44 a.m.

Am 07.05.2016 um 12:48 schrieb David Turner:

...

Btrees are good for storing data on disk. And something like postgres is an extremely efficient implementation of a btree supporting atomic updates and the like. I'd use that!

The original question was about standard hardware (i.e. still including rotating rust) and ~50 updates/second. I'd assume that that's doable with an ACID-compliant DB on standard hardware, though it does not leave room for inefficiencies so you need to know what you're doing in SQL. On a later update, he corrected the specs to 1,000-2,000 updates/second, and I believe that it's impossible to do that on a standard single-HDD. I don't know whether Mikhail considers SSDs standard configuration. Now transaction rates aren't the same as write rates. If he can batch multiple writes in one transaction, Postgresql or any other RDBMS might actually work.

Mikhail Volkhov

9:08 a.m.

...

On a later update, he corrected the specs to 1,000-2,000 updates/second, and I believe that it's impossible to do that on a standard single-HDD. I don't know whether Mikhail considers SSDs standard configuration.

Yes, I suppose SSD is standard configuration -- it costs less than >32Gb of RAM and it can be installed on any device while it's the problem with extra RAM on laptops

...

Now transaction rates aren't the same as write rates. If he can batch multiple writes in one transaction, Postgresql or any other RDBMS might actually work.

Thank you, will look into these approaches. -- Volkhov Mikhail M3#38 ITMO study group 17', CT Department

David Turner

10:42 a.m.

On 7 May 2016 13:45, "Joachim Durchholz" wrote:

...

Am 07.05.2016 um 12:48 schrieb David Turner:

...
Btrees are good for storing data on disk. And something like postgres is

...

...
extremely efficient implementation of a btree supporting atomic updates and the like. I'd use that!

The original question was about standard hardware (i.e. still including rotating rust) and ~50 updates/second. I'd assume that that's doable with an ACID-compliant DB on standard hardware, though it does not leave room for inefficiencies so you need to know what you're doing in SQL.

On a later update, he corrected the specs to 1,000-2,000 updates/second, and I believe that it's impossible to do that on a standard single-HDD. I don't know whether Mikhail considers SSDs standard configuration.

Now transaction rates aren't the same as write rates. If he can batch multiple writes in one transaction, Postgresql or any other RDBMS might actually work.

Thousands of transactions per second is getting into needs-clever-optimisation territory if it can't be done in RAM, but it's not that tricky. Postgres will batch transactions for you: see for instance http://pgeoghegan.blogspot.co.uk/2012/06/towards-14000-write-transactions-on... Cheers,

...

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Joachim Durchholz

5:11 p.m.

Am 07.05.2016 um 16:42 schrieb David Turner:

...

Thousands of transactions per second is getting into needs-clever-optimisation territory if it can't be done in RAM, but it's not that tricky. Postgres will batch transactions for you: see for instance http://pgeoghegan.blogspot.co.uk/2012/06/towards-14000-write-transactions-on...

Well, actually it is - that particular guy maintains 14,000 transactions per second, but with benchmarks that may or may not transfer to Mikhail's scenario, and with a quite high-end 7,200 RPM HDD which may or may not fit his definition of standard hardware. That said, these results are still pretty impressive. Regards, Jo BTW please "reply to list", replying directly and CC'ing to the list means I have to manually copy/paste the list address to answer.

3464

Age (days ago)

3466

Last active (days ago)

List overview

Download

13 comments

6 participants

participants (6)

Bardur Arantsson
David Turner
Imants Cekusins
Joachim Durchholz
Lana Black
Mikhail Volkhov

Storing big datasets

tags

participants (6)