Request for Feedback on COPY IN/OUT support for postgresql-simple

older
Announcing postgresql-libpq-0.8.2.3

Leon Smith

27 Jun 2013 27 Jun '13

10:56 p.m.

Although the postgresql-libpq binding has supported both COPY IN and COPY OUT for some time now, this functionality wasn't available to postgresql-simple without mucking around in the Internal module to call libpq functions on your own. I just now fixed that. The interface I went with ultimately mimics the libpq interface rather closely, albeit with many simplifications. I did consider offering fold-like combinators, and while they would certainly be much more convenient for some use cases, they also introduce a number of complications, such as dealing with exceptions. If anybody's interested in reviewing these latest patches, I'd appreciate it. I suppose the one thing I'm most unsure about is exactly which exceptions should be raised under various conditions. https://github.com/lpsmith/postgresql-simple/commit/491247f687fcbfd30c8b0f50... https://github.com/lpsmith/postgresql-simple/commit/4cfcdbee22a0cb1fb7985988... Best, Leon

Attachments:

attachment.html (text/html — 1.3 KB)

Show replies by date

Joey Adams

28 Jun 28 Jun

12:07 a.m.

New subject: Request for Feedback on COPY IN/OUT support for postgresql-simple

I wrote a package for escaping data for COPY FROM [1]. Unfortunately, integration with ToField is problematic, since the Plain constructor has no corresponding COPY representation (a Plain can contain an expression). Also, threadWaitWrite doesn't work on Windows, so it needs a mingw32_HOST_OS guard too. [1]: http://hackage.haskell.org/package/postgresql-copy-escape

Leon Smith

3:40 a.m.

New subject: Request for Feedback on COPY IN/OUT support for postgresql-simple

Yeah, eventually adding some level of understanding of the formats would be nice, but honestly my immediate need is a higher-performance (and somewhat specialized) replication solution; I only need a tiny bit of understanding of the data I'm shipping from server to server. But at the same time, a complete solution would need to deal with all three formats and the numerous flavors thereof, not just CSV. I don't doubt that the current FromField/ToField interface isn't particularly well suited to dealing with copy data, though. =) As for threadWaitWrite, I understand that, but the code is actually dead code at the moment because postgresql-simple doesn't use libpq connections in asynchronous mode; it's just there for when postgresql-simple starts using async libpq calls. Which is something I would like to do, but it would also be a disaster on Windows at the moment. I'm thinking perhaps that the most sensible thing to do would be to reimplement the blocking libpq calls in haskell using nonblocking libpq calls, and then only use this module on unix systems. Hopefully that would cut down on the CPP hackery.

Joey Adams

9:12 a.m.

New subject: Request for Feedback on COPY IN/OUT support for postgresql-simple

On Thu, Jun 27, 2013 at 11:40 PM, Leon Smith wrote:

...

Yeah, eventually adding some level of understanding of the formats would be nice, but honestly my immediate need is a higher-performance (and somewhat specialized) replication solution; I only need a tiny bit of understanding of the data I'm shipping from server to server. But at the same time, a complete solution would need to deal with all three formats and the numerous flavors thereof, not just CSV. I don't doubt that the current FromField/ToField interface isn't particularly well suited to dealing with copy data, though. =)

In response to a proposal to add COPY escaping functionality to libpq, Tom Lane pointed out that "one of the key requirements for anything dealing with COPY is that it be fast" [1]. postgresql-copy-escape as-is provides a faster way to insert data than multi-row INSERT. We don't need to support all of the formats for this. I didn't even implement COPY TO support since PostgreSQL returns result sets pretty efficiently through the normal interface. Your case is even simpler: send COPY data from one server to another. This shouldn't require packing/unpacking fields. [1]: http://www.postgresql.org/message-id/19641.1331821069@sss.pgh.pa.us As for threadWaitWrite, I understand that, but the code is actually dead

...

code at the moment because postgresql-simple doesn't use libpq connections in asynchronous mode; it's just there for when postgresql-simple starts using async libpq calls. Which is something I would like to do, but it would also be a disaster on Windows at the moment. I'm thinking perhaps that the most sensible thing to do would be to reimplement the blocking libpq calls in haskell using nonblocking libpq calls, and then only use this module on unix systems. Hopefully that would cut down on the CPP hackery.

The libpq source may be helpful. Take a look at PQexecFinish in fe-exec.c, which reveals some interesting considerations: * When PQexec is given multiple SQL commands, it discards all but the last PGresult, but it combines error messages from these results. * When PQexec encounters a result status of PGRES_COPY_*, it stops returning results, to allow the application to perform the transfer. One thing to bear in mind is that threadWait* can be interrupted on Unix (that's one motivation for using the async API in the first place). You'll have to figure out what 'exec' should do when interrupted while waiting for a result. Perhaps it should use PQcancel to cancel the request, then continue waiting for results. If interrupted yet again, a subsequent 'exec' will need to call PQgetResult to discard results from the interrupted query before calling PQsendQuery again.

Leon Smith

12 Jul 12 Jul

12:09 a.m.

New subject: Request for Feedback on COPY IN/OUT support for postgresql-simple

Ok, I stubbed out an approach to implementing an async-based binding to libpq, sans implementation of course: https://github.com/lpsmith/postgresql-libpq/commit/96c82197b558019de44cc4b5e... What do you think?

...

The libpq source may be helpful. Take a look at PQexecFinish in fe-exec.c, which reveals some interesting considerations:

* When PQexec is given multiple SQL commands, it discards all but the last PGresult, but it combines error messages from these results.

...

* When PQexec encounters a result status of PGRES_COPY_*, it stops returning results, to allow the application to perform the transfer.

One thing to bear in mind is that threadWait* can be interrupted on Unix (that's one motivation for using the async API in the first place). You'll have to figure out what 'exec' should do when interrupted while waiting for a result. Perhaps it should use PQcancel to cancel the request, then continue waiting for results. If interrupted yet again, a subsequent 'exec' will need to call PQgetResult to discard results from the interrupted query before calling PQsendQuery again.

Yeah, I did plan to read the libpq source; in any case thanks for the comments, they are helpful. Async exceptions do present a thorny semantic problem; but at least my use cases would basically assume the connection is dead. I mean, didn't we have this conversation with respect to (timeout (readMVar foo)) and (timeout (readChan foo)), in that you should basically then treat the resource as a dead resource due to the fact that there is some small possibility that you got interrupted after you took the resource but before you've actually returned it? Calling PQcancel and waiting it out seem problematic from the point of view of my use cases as well: namely detecting a flaky network and cases where the backend was restarted in a timely fashion. (Although, come to think of it, the latter case should be quickly detectable without timeouts, so perhaps I should investigate deeper into why my clients are taking so long to reconnect to a restarted server.) However, perhaps there are other good justifications for attempting more complex behavior, along the lines you suggest. Best, Leon On Fri, Jun 28, 2013 at 5:12 AM, Joey Adams wrote:

...

On Thu, Jun 27, 2013 at 11:40 PM, Leon Smith wrote:

...
Yeah, eventually adding some level of understanding of the formats would be nice, but honestly my immediate need is a higher-performance (and somewhat specialized) replication solution; I only need a tiny bit of understanding of the data I'm shipping from server to server. But at the same time, a complete solution would need to deal with all three formats and the numerous flavors thereof, not just CSV. I don't doubt that the current FromField/ToField interface isn't particularly well suited to dealing with copy data, though. =)

In response to a proposal to add COPY escaping functionality to libpq, Tom Lane pointed out that "one of the key requirements for anything dealing with COPY is that it be fast" [1]. postgresql-copy-escape as-is provides a faster way to insert data than multi-row INSERT. We don't need to support all of the formats for this. I didn't even implement COPY TO support since PostgreSQL returns result sets pretty efficiently through the normal interface.

Your case is even simpler: send COPY data from one server to another. This shouldn't require packing/unpacking fields.

[1]: http://www.postgresql.org/message-id/19641.1331821069@sss.pgh.pa.us

As for threadWaitWrite, I understand that, but the code is actually dead

...
code at the moment because postgresql-simple doesn't use libpq connections in asynchronous mode; it's just there for when postgresql-simple starts using async libpq calls. Which is something I would like to do, but it would also be a disaster on Windows at the moment. I'm thinking perhaps that the most sensible thing to do would be to reimplement the blocking libpq calls in haskell using nonblocking libpq calls, and then only use this module on unix systems. Hopefully that would cut down on the CPP hackery.

The libpq source may be helpful. Take a look at PQexecFinish in fe-exec.c, which reveals some interesting considerations:

* When PQexec is given multiple SQL commands, it discards all but the last PGresult, but it combines error messages from these results.

* When PQexec encounters a result status of PGRES_COPY_*, it stops returning results, to allow the application to perform the transfer.

One thing to bear in mind is that threadWait* can be interrupted on Unix (that's one motivation for using the async API in the first place). You'll have to figure out what 'exec' should do when interrupted while waiting for a result. Perhaps it should use PQcancel to cancel the request, then continue waiting for results. If interrupted yet again, a subsequent 'exec' will need to call PQgetResult to discard results from the interrupted query before calling PQsendQuery again.

Leon Smith

4:27 a.m.

New subject: Request for Feedback on COPY IN/OUT support for postgresql-simple

Hmm, upon further thought regarding to exec's behavior on async exceptions, surely there are use cases for both a simple and timely abandonment of the connection and for attempting to cancel a long-running query while attempting to save the connection. In the latter case, I don't have anything too concrete in mind, but surely it exists. One obvious solution would be to use two different async exceptions. Perhaps a ThreadKilled exception would simply abort while a Timeout exception would trigger the cancel behavior. But reviewing the source of System.Timeout, Timeout is not exported from the module, which would seem to rule out that behavior.... so maybe we need to define a separate Cancel exception. One question I'd have is whether or not this type of behavior would compose well: e.g. could you lift it through postgresql-simple with a minimum of fuss, or would the cancel behavior of query_ and execute_ be a little off? Another possible solution would be to define a combinator that deals with Cancel exceptions. Best, Leon

4375

Age (days ago)

4390

Last active (days ago)

List overview

Download

5 comments

2 participants

participants (2)

Joey Adams
Leon Smith