Property checks and PostgreSQL?

Note: I posted this [elsewhere] — I hope it is permissible to also post here. There seems to be no single right venue for Haskell questions at this time. [elsewhere]: https://discourse.haskell.org/t/property-checks-and-postgresql/2078 So, I have a three layer stack: PostgreSQL data base, Haskell back end, PureScript front end. And there is a fault somewhere in this stack: sometimes some items present in the data base do not make it to the front end. A small fortune in programmer hours has been spent searching for the fault but there is no progress. No one can figure out even roughly where the fault is. Our last hope is to property check the whole thing, from end to end, verifying that any things put into the data base can be retrieved by the front end. We have in place a fixture that creates a new PostgreSQL cluster, runs the back end on it and makes some scripted HTTP requests. We use it to run unit tests, such like _«if I make this `PUT` request and then that `GET` request to the same end point, I should get back the same thing as I put in»_. In principle it would not be a problem to make a property check out of this property. Practically, tearing down the cluster, building a new pristine one and initializing the data base anew takes seconds, so this way even a trivial property check would take minutes to run. Theoretically, we can carefully reset the data base back to the initial state after each run of the property check. Given that only a small amount of data is modified, it would take imperceptibly little time. But it is hard to know what exactly should be reset, and it is easy to get it wrong if done by hand! One trick we do use is SQL transactions. We take a single connexion, start a transaction, perform a check and then roll the transaction back. Unfortunately, even this is not completely safe: it looks as though sequences used to draw primary keys from are not reset to their previous values! _(Or maybe we are doing something wrong — I am not really a PostgreSQL guru so I am not too sure of myself.)_ But even ignoring this problem _(it is irrelevant to most checks)_, there is another, more severe problem: transactions guarantee that uncommitted data is only visible inside the same connexion. So, there is no way to request it from the HTTP API. This trick is suitable for fuzzing the data base – back end layer, but nothing beyond that. Another thing I heard being done is to selectively drop and re-create exactly the tables affected by a run of a property check. This seems challenging to automate though. How can I know in a general way if a table has been touched? And then, how can I re-create exactly the affected tables given that the data base initialization code is an opaque blob? I wonder if this problem is already solved. In any case, any advice is welcome!

On Sat, Mar 13, 2021 at 04:39:40PM +0500, Ignat Insarov wrote:
So, I have a three layer stack: PostgreSQL data base, Haskell back end, PureScript front end. And there is a fault somewhere in this stack [...] Our last hope is to property check the whole thing, from end to end, verifying that any things put into the data base can be retrieved by the front end.
It's not clear to me why you need to reset anything. Can you attach a unique value to each entry you make and then subsequently check that the unique value is present in the data you read back?

Yes, usually I can ignore extraneous values and this would work. But this is not a local fault. We already check that individual items can always be retrieved. It must be some relation between the items that determines whether an item will be broken. It also depends on the query — any single item can always be retrieved, but some items disappear when a set of items is requested. This is why I want to ensure that the runs of the property check are completely isolated. I am not looking for an example of a single broken item — I am looking for an example of a broken data base as a whole.

On Sat, Mar 13, 2021 at 04:39:40PM +0500, Ignat Insarov wrote:
Note: I posted this [elsewhere] — I hope it is permissible to also post here. There seems to be no single right venue for Haskell questions at this time.
[elsewhere]: https://discourse.haskell.org/t/property-checks-and-postgresql/2078
So, I have a three layer stack: PostgreSQL data base, Haskell back end, PureScript front end. And there is a fault somewhere in this stack: sometimes some items present in the data base do not make it to the front end. A small fortune in programmer hours has been spent searching for the fault but there is no progress. No one can figure out even roughly where the fault is. Our last hope is to property check the whole thing, from end to end, verifying that any things put into the data base can be retrieved by the front end.
We have in place a fixture that creates a new PostgreSQL cluster, runs the back end on it and makes some scripted HTTP requests. We use it to run unit tests, such like _«if I make this `PUT` request and then that `GET` request to the same end point, I should get back the same thing as I put in»_. In principle it would not be a problem to make a property check out of this property. Practically, tearing down the cluster, building a new pristine one and initializing the data base anew takes seconds, so this way even a trivial property check would take minutes to run.
Do you know where the majority of the time is spent? I.e. can you just drop the schema without re-creating the cluster? I'd guess it'd be fast but I don't know your setup.
Another thing I heard being done is to selectively drop and re-create exactly the tables affected by a run of a property check. This seems challenging to automate though. How can I know in a general way if a table has been touched? And then, how can I re-create exactly the affected tables given that the data base initialization code is an opaque blob?
At $WORK we do this: property tests on PSQL data dropping data between tests, but we know which tables are modified so we can just truncate the affected ones. Tom

On Sat, Mar 13, 2021 at 02:22:25PM -0700, amindfv@mailbox.org wrote:
On Sat, Mar 13, 2021 at 04:39:40PM +0500, Ignat Insarov wrote:
Note: I posted this [elsewhere] — I hope it is permissible to also post here. There seems to be no single right venue for Haskell questions at this time.
[elsewhere]: https://discourse.haskell.org/t/property-checks-and-postgresql/2078
So, I have a three layer stack: PostgreSQL data base, Haskell back end, PureScript front end. And there is a fault somewhere in this stack: sometimes some items present in the data base do not make it to the front end. A small fortune in programmer hours has been spent searching for the fault but there is no progress. No one can figure out even roughly where the fault is. Our last hope is to property check the whole thing, from end to end, verifying that any things put into the data base can be retrieved by the front end.
We have in place a fixture that creates a new PostgreSQL cluster, runs the back end on it and makes some scripted HTTP requests. We use it to run unit tests, such like _«if I make this `PUT` request and then that `GET` request to the same end point, I should get back the same thing as I put in»_. In principle it would not be a problem to make a property check out of this property. Practically, tearing down the cluster, building a new pristine one and initializing the data base anew takes seconds, so this way even a trivial property check would take minutes to run.
Do you know where the majority of the time is spent? I.e. can you just drop the schema without re-creating the cluster? I'd guess it'd be fast but I don't know your setup.
Another thing I heard being done is to selectively drop and re-create exactly the tables affected by a run of a property check. This seems challenging to automate though. How can I know in a general way if a table has been touched? And then, how can I re-create exactly the affected tables given that the data base initialization code is an opaque blob?
At $WORK we do this: property tests on PSQL data dropping data between tests, but we know which tables are modified so we can just truncate the affected ones.
I should add, too: it's not always desirable to truncate or start from a pristine database. Sometimes having data lying around that shouldn't affect your tests can be a good way of testing that that data truly doesn't affect the thing being tested. I.e. it's another method of fuzzing.
Tom

Do you know where the majority of the time is spent? I.e. can you just drop the schema without re-creating the cluster? I'd guess it'd be fast but I don't know your setup.
The first step would be to keep the cluster, yes. It takes a few seconds, although we can use a `tmpfs` backed by RAM to cut this down to about one second. But initializing the data base from prepared `*.sql` files also takes a few seconds.
At $WORK we do this: property tests on PSQL data dropping data between tests, but we know which tables are modified so we can just truncate the affected ones.
So how exactly do you do that? Like, do you write the SQL statements needed to reset the data base for every check by hand? And what sort of statements do you use? Do you drop the table or delete the rows? It would be ideal if I could somehow detect the tables that were touched and reset them automatically. But I cannot think of a simple way to do that. I cannot simply erase all the data because some data needs to be in place for the application to even start.

On Sun, Mar 14, 2021 at 04:06:41AM +0500, Ignat Insarov wrote:
Do you know where the majority of the time is spent? I.e. can you just drop the schema without re-creating the cluster? I'd guess it'd be fast but I don't know your setup.
The first step would be to keep the cluster, yes. It takes a few seconds, although we can use a `tmpfs` backed by RAM to cut this down to about one second. But initializing the data base from prepared `*.sql` files also takes a few seconds.
I don't know your setup but if the problem is not with the creation of your schema it does feel like a waste to tear it down and re-create it from a tmpfs each time.
At $WORK we do this: property tests on PSQL data dropping data between tests, but we know which tables are modified so we can just truncate the affected ones.
So how exactly do you do that? Like, do you write the SQL statements needed to reset the data base for every check by hand? And what sort of statements do you use? Do you drop the table or delete the rows?
Depends on the test but it's a lot more common to just call "truncate table foo" when necessary.
It would be ideal if I could somehow detect the tables that were touched and reset them automatically. But I cannot think of a simple way to do that. I cannot simply erase all the data because some data needs to be in place for the application to even start.
It's possible I'm answering an X/Y question[0] right now, but: - Can you create a function that populates this essential data and only call that function after truncating? - If your data is timestamped you could always delete all "new" rows? Tom [0] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem

On Mar 13, 2021, at 9:39 AM, Ignat Insarov
wrote: er thing I heard being done is to selectively drop and re-create exactly the tables affected by a run of a property check. This seems challenging to automate though. How can I know in a general way if a table has been touched? And then, how can I re-create exactly the affected tables given that the data base initialization code is an opaque blob?
I wonder if this problem is already solved. In any case, any advice is welcome!
The template test database should be immutable during the tests. Just clone the template database, and run tests against the clone. When done, drop the clone. https://www.postgresql.org/docs/12/manage-ag-templatedbs.html https://www.postgresql.org/docs/12/manage-ag-dropdb.html https://www.postgresqltutorial.com/postgresql-copy-database/ https://www.postgresqltutorial.com/postgresql-drop-database/ -- Viktor.

The template test database should be immutable during the tests. Just clone the template database, and run tests against the clone. When done, drop the clone.
Is it significantly faster than importing from prepared `*.sql` files? The problem here is that, as I understand, we will have to restart the application every time so that it connects to the new data base. It takes about a second. I was hoping that there is some way to clone a single schema, but I figure it is not possible. So now I have to either drop the schema and import it again from `*.sql`, or drop the data base, clone it from the template and restart the application. It does not seem like a clear win. Note that I need the application to be running so that I may observe the HTTP API. Although perhaps I can factor the API out and write a simple wrapper for it that would start faster. But this is extra work and it would increase the complexity of the project, so I am not sure if I should.

So, I have a three layer stack: PostgreSQL data base, Haskell back end, PureScript front end. And there is a fault somewhere in this stack: sometimes some items present in the data base do not make it to the front end. A small fortune in programmer hours has been spent searching for the fault but there is no progress. No one can figure out even roughly where the fault is.
Are you running at the standard isolation level? If yes, your code may be affected by nonrepeatable read, phantom read, or serialization anomaly - in a nutshell, this is what happens if you expect to get the same results if the query is run twice, but the rows that affect this result are somehow not included in the transactions (there are edge cases that can do this, the isolation levels progressively exclude them, at the expense of performance loss). More details are on https://www.postgresql.org/docs/current/transaction-iso.html .
Theoretically, we can carefully reset the data base back to the initial state after each run of the property check. Given that only a small amount of data is modified, it would take imperceptibly little time. But it is hard to know what exactly should be reset, and it is easy to get it wrong if done by hand!
Have you tried nested transactions? I have no PSQL experience myself, but it looks like it should do the minimal-effort rollback you're after: Start a transaction, let the unit tests run (including their own transactions, just they're nested now), roll back the overarching initial transaction.
One trick we do use is SQL transactions. We take a single connexion, start a transaction, perform a check and then roll the transaction back. Unfortunately, even this is not completely safe: it looks as though sequences used to draw primary keys from are not reset to their previous values! _(Or maybe we are doing something wrong — I am not really a PostgreSQL guru so I am not too sure of myself.)_
No, this is documented (if surprising) behaviour. But TBH if your code is affected by the values you get from a sequence, that's not what sequences are intended for - they're for being unique, not for their numerical properties. See https://www.postgresql.org/message-id/501B1494.9040502@ringerc.id.au for more details.
But even ignoring this problem _(it is irrelevant to most checks)_,
there is another, more severe problem: transactions guarantee that uncommitted data is only visible inside the same connexion. So, there is no way to request it from the HTTP API. This trick is suitable for fuzzing the data base – back end layer, but nothing beyond that.
Yes, Postgresql gives you no way to read dirty data. This is pretty typical of MVCC databases. But even if there were a way, you'd never get a guarantee that you're actually getting the data that another transaction sees. If you wish to see what another transaction sees, you'll have to instrument the code that does the transaction. This should be doable in the backend.
Another thing I heard being done is to selectively drop and re-create exactly the tables affected by a run of a property check. This seems challenging to automate though. How can I know in a general way if a table has been touched? And then, how can I re-create exactly the affected tables given that the data base initialization code is an opaque blob?
I wonder if this problem is already solved. In any case, any advice is welcome!
I doubt it's a useful way forward. It's just not a use case that database are built for, and you have a high risk of hitting nonfunctional problems (performance, possibly from cold query optimization caches and such) or functional problems (sequencing of dropping/creating DB objects that depend on each other, for example, and you already saw the sequence number issue). Background is that databases are such complex beasts that the developer teams struggle to get the core goals done, nice-to-have goals get dropped pretty quickly. PGSQL is exceptional in that it tries to keep as many nice-to-have goals intact as possible, but I wouldn't count on that. Regards, Jo P.S.: This is concentrating on the database layer, which is unrelated to Haskell. Which might be because your problem is database-related, or because I happen to know more about databases than about Haskell. Or because you described your problems in terms of database issue. If it's really database-related, you may have more concrete advice on a PGQSL-related forum than here.
participants (5)
-
amindfv@mailbox.org
-
Ignat Insarov
-
Joachim Durchholz
-
Tom Ellis
-
Viktor Dukhovni