Request for Feedback on COPY IN/OUT support for postgresql-simple

Although the postgresql-libpq binding has supported both COPY IN and COPY OUT for some time now, this functionality wasn't available to postgresql-simple without mucking around in the Internal module to call libpq functions on your own. I just now fixed that. The interface I went with ultimately mimics the libpq interface rather closely, albeit with many simplifications. I did consider offering fold-like combinators, and while they would certainly be much more convenient for some use cases, they also introduce a number of complications, such as dealing with exceptions. If anybody's interested in reviewing these latest patches, I'd appreciate it. I suppose the one thing I'm most unsure about is exactly which exceptions should be raised under various conditions. https://github.com/lpsmith/postgresql-simple/commit/491247f687fcbfd30c8b0f50... https://github.com/lpsmith/postgresql-simple/commit/4cfcdbee22a0cb1fb7985988... Best, Leon

I wrote a package for escaping data for COPY FROM [1]. Unfortunately, integration with ToField is problematic, since the Plain constructor has no corresponding COPY representation (a Plain can contain an expression). Also, threadWaitWrite doesn't work on Windows, so it needs a mingw32_HOST_OS guard too. [1]: http://hackage.haskell.org/package/postgresql-copy-escape

Yeah, eventually adding some level of understanding of the formats would be nice, but honestly my immediate need is a higher-performance (and somewhat specialized) replication solution; I only need a tiny bit of understanding of the data I'm shipping from server to server. But at the same time, a complete solution would need to deal with all three formats and the numerous flavors thereof, not just CSV. I don't doubt that the current FromField/ToField interface isn't particularly well suited to dealing with copy data, though. =) As for threadWaitWrite, I understand that, but the code is actually dead code at the moment because postgresql-simple doesn't use libpq connections in asynchronous mode; it's just there for when postgresql-simple starts using async libpq calls. Which is something I would like to do, but it would also be a disaster on Windows at the moment. I'm thinking perhaps that the most sensible thing to do would be to reimplement the blocking libpq calls in haskell using nonblocking libpq calls, and then only use this module on unix systems. Hopefully that would cut down on the CPP hackery.

On Thu, Jun 27, 2013 at 11:40 PM, Leon Smith
Yeah, eventually adding some level of understanding of the formats would be nice, but honestly my immediate need is a higher-performance (and somewhat specialized) replication solution; I only need a tiny bit of understanding of the data I'm shipping from server to server. But at the same time, a complete solution would need to deal with all three formats and the numerous flavors thereof, not just CSV. I don't doubt that the current FromField/ToField interface isn't particularly well suited to dealing with copy data, though. =)
In response to a proposal to add COPY escaping functionality to libpq, Tom Lane pointed out that "one of the key requirements for anything dealing with COPY is that it be fast" [1]. postgresql-copy-escape as-is provides a faster way to insert data than multi-row INSERT. We don't need to support all of the formats for this. I didn't even implement COPY TO support since PostgreSQL returns result sets pretty efficiently through the normal interface. Your case is even simpler: send COPY data from one server to another. This shouldn't require packing/unpacking fields. [1]: http://www.postgresql.org/message-id/19641.1331821069@sss.pgh.pa.us As for threadWaitWrite, I understand that, but the code is actually dead
code at the moment because postgresql-simple doesn't use libpq connections in asynchronous mode; it's just there for when postgresql-simple starts using async libpq calls. Which is something I would like to do, but it would also be a disaster on Windows at the moment. I'm thinking perhaps that the most sensible thing to do would be to reimplement the blocking libpq calls in haskell using nonblocking libpq calls, and then only use this module on unix systems. Hopefully that would cut down on the CPP hackery.
The libpq source may be helpful. Take a look at PQexecFinish in fe-exec.c, which reveals some interesting considerations: * When PQexec is given multiple SQL commands, it discards all but the last PGresult, but it combines error messages from these results. * When PQexec encounters a result status of PGRES_COPY_*, it stops returning results, to allow the application to perform the transfer. One thing to bear in mind is that threadWait* can be interrupted on Unix (that's one motivation for using the async API in the first place). You'll have to figure out what 'exec' should do when interrupted while waiting for a result. Perhaps it should use PQcancel to cancel the request, then continue waiting for results. If interrupted yet again, a subsequent 'exec' will need to call PQgetResult to discard results from the interrupted query before calling PQsendQuery again.

Ok, I stubbed out an approach to implementing an async-based binding to libpq, sans implementation of course: https://github.com/lpsmith/postgresql-libpq/commit/96c82197b558019de44cc4b5e... What do you think?
The libpq source may be helpful. Take a look at PQexecFinish in fe-exec.c, which reveals some interesting considerations:
* When PQexec is given multiple SQL commands, it discards all but the last PGresult, but it combines error messages from these results.
* When PQexec encounters a result status of PGRES_COPY_*, it stops returning results, to allow the application to perform the transfer.
One thing to bear in mind is that threadWait* can be interrupted on Unix (that's one motivation for using the async API in the first place). You'll have to figure out what 'exec' should do when interrupted while waiting for a result. Perhaps it should use PQcancel to cancel the request, then continue waiting for results. If interrupted yet again, a subsequent 'exec' will need to call PQgetResult to discard results from the interrupted query before calling PQsendQuery again.
Yeah, I did plan to read the libpq source; in any case thanks for the
comments, they are helpful.
Async exceptions do present a thorny semantic problem; but at least my
use cases would basically assume the connection is dead. I mean, didn't
we have this conversation with respect to (timeout (readMVar foo)) and
(timeout (readChan foo)), in that you should basically then treat the
resource as a dead resource due to the fact that there is some small
possibility that you got interrupted after you took the resource but before
you've actually returned it?
Calling PQcancel and waiting it out seem problematic from the point of view
of my use cases as well: namely detecting a flaky network and cases where
the backend was restarted in a timely fashion. (Although, come to think
of it, the latter case should be quickly detectable without timeouts, so
perhaps I should investigate deeper into why my clients are taking so long
to reconnect to a restarted server.)
However, perhaps there are other good justifications for attempting more
complex behavior, along the lines you suggest.
Best,
Leon
On Fri, Jun 28, 2013 at 5:12 AM, Joey Adams
On Thu, Jun 27, 2013 at 11:40 PM, Leon Smith
wrote: Yeah, eventually adding some level of understanding of the formats would be nice, but honestly my immediate need is a higher-performance (and somewhat specialized) replication solution; I only need a tiny bit of understanding of the data I'm shipping from server to server. But at the same time, a complete solution would need to deal with all three formats and the numerous flavors thereof, not just CSV. I don't doubt that the current FromField/ToField interface isn't particularly well suited to dealing with copy data, though. =)
In response to a proposal to add COPY escaping functionality to libpq, Tom Lane pointed out that "one of the key requirements for anything dealing with COPY is that it be fast" [1]. postgresql-copy-escape as-is provides a faster way to insert data than multi-row INSERT. We don't need to support all of the formats for this. I didn't even implement COPY TO support since PostgreSQL returns result sets pretty efficiently through the normal interface.
Your case is even simpler: send COPY data from one server to another. This shouldn't require packing/unpacking fields.
[1]: http://www.postgresql.org/message-id/19641.1331821069@sss.pgh.pa.us
As for threadWaitWrite, I understand that, but the code is actually dead
code at the moment because postgresql-simple doesn't use libpq connections in asynchronous mode; it's just there for when postgresql-simple starts using async libpq calls. Which is something I would like to do, but it would also be a disaster on Windows at the moment. I'm thinking perhaps that the most sensible thing to do would be to reimplement the blocking libpq calls in haskell using nonblocking libpq calls, and then only use this module on unix systems. Hopefully that would cut down on the CPP hackery.
The libpq source may be helpful. Take a look at PQexecFinish in fe-exec.c, which reveals some interesting considerations:
* When PQexec is given multiple SQL commands, it discards all but the last PGresult, but it combines error messages from these results.
* When PQexec encounters a result status of PGRES_COPY_*, it stops returning results, to allow the application to perform the transfer.
One thing to bear in mind is that threadWait* can be interrupted on Unix (that's one motivation for using the async API in the first place). You'll have to figure out what 'exec' should do when interrupted while waiting for a result. Perhaps it should use PQcancel to cancel the request, then continue waiting for results. If interrupted yet again, a subsequent 'exec' will need to call PQgetResult to discard results from the interrupted query before calling PQsendQuery again.

Hmm, upon further thought regarding to exec's behavior on async exceptions, surely there are use cases for both a simple and timely abandonment of the connection and for attempting to cancel a long-running query while attempting to save the connection. In the latter case, I don't have anything too concrete in mind, but surely it exists. One obvious solution would be to use two different async exceptions. Perhaps a ThreadKilled exception would simply abort while a Timeout exception would trigger the cancel behavior. But reviewing the source of System.Timeout, Timeout is not exported from the module, which would seem to rule out that behavior.... so maybe we need to define a separate Cancel exception. One question I'd have is whether or not this type of behavior would compose well: e.g. could you lift it through postgresql-simple with a minimum of fuss, or would the cancel behavior of query_ and execute_ be a little off? Another possible solution would be to define a combinator that deals with Cancel exceptions. Best, Leon
participants (2)
-
Joey Adams
-
Leon Smith