streaming translation using monads

I am working on a query language translator, and although I feel that a monadic formulation would work well for this application, I've stumbled on a number of questions and difficulties that I thought the knowledgeable people here might be able to help me with. As a translator, there's a source language and a target language, and both of these are specified as a grammar of nested expressions. My first thought was to formulate the target language as an embedded language where each clause expression is represented as a monad operation, so that the bind operator can join the pieces together, e.g.: (clause1, clause2 ...) could be specified as an embedded language as: clause1 >>= \ v1 -> clause2 >>= \ v2 -> ... However, each of the clauses is actually an output routine to send the expression that it denotes to a remote server, and a parser for receiving the results. Since a clause is really a pair of operations, it doesn't seem possible to formulate a monad that will compose all the output routines together and compose all the input routines together in one shot. (Note that the variables in the above code (v1, v2) represent inputs to be received from the remote server -- all outputs are packaged into the clause expressions themselves and are effectively literals.) A naive formulation of a monad to implement the above as "output -> input v" might appear to work, but has the ill-effect of interleaving the output and input statements for each clause rather than composing something that can send the entire request, and then receive the entire result. This forces me to use "output * input v" as the type of each clause expression, but now it is not possible to write a monad with a bind operation that will compose pieces in terms of input variables. Instead I have had to resort to using a set of combinators that thread a continuation function through each clause and accumulate inputs as they are received: clause1 ==> (\ k v1 -> k (trans1 v1)) ++ clause2 ==> (\ k v2 -> k (trans2 v2)) ++ ... This threading is necessary in that I want to stream the translation back to the client requesting the translation rather than building up the (possibly large) results in memory. This formulation has proven to be quite cumbersome in practice, as the resulting continuation types reflect the depth-first traversal of the nested query expressions, and type errors can be quite unintuitive. (It is quite interesting though that each continuation/transformation function can receive not only receive the input from the immediately preceding clause, but from any of the preceding clauses, and also return more or fewer results. However getting anything wrong can be very problematic in that it can lead to either downstream *or* upstream errors depending on how the clauses are composed into an overall query expression.) An alternative to all this would be to use an algebraic datatype to specify the target language (with separate routines for the output and input operations), but that would appear to require another sum type to express the values to be received. I'd like to avoid that if possible since the projection of those values back into my program could lead to dynamic type errors, and also causes seemingly needless memory allocations. There must be another technique for this sort of streaming translation out there... I welcome any suggestions you might have! Warren Harris

On 2008 Nov 18, at 21:23, Warren Harris wrote:
However, each of the clauses is actually an output routine to send the expression that it denotes to a remote server, and a parser for receiving the results. Since a clause is really a pair of operations, it doesn't seem possible to formulate a monad that will compose all the output routines together and compose all the input routines together in one shot. (Note that the variables in the above code (v1, v2) represent inputs to be received from the remote server -- all outputs are packaged into the clause expressions themselves and are effectively literals.)
Have you considered using arrows instead? -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Nov 18, 2008, at 6:53 PM, Brandon S. Allbery KF8NH wrote:
On 2008 Nov 18, at 21:23, Warren Harris wrote:
However, each of the clauses is actually an output routine to send the expression that it denotes to a remote server, and a parser for receiving the results. Since a clause is really a pair of operations, it doesn't seem possible to formulate a monad that will compose all the output routines together and compose all the input routines together in one shot. (Note that the variables in the above code (v1, v2) represent inputs to be received from the remote server -- all outputs are packaged into the clause expressions themselves and are effectively literals.)
Have you considered using arrows instead?
I'm not that familiar with arrows, but have just looked at some papers by Hughes and Paterson. It appears that unlike monads, the right-hand side of the arrow operator is not a function, and as such could be used to define a pairwise stream sequencing operator for my particular case. Are there any specific references that show something like this, e.g. for implementing network protocols? (I'm still trying to get my head around the basics of arrows.) Thanks, Warren

On Tue, Nov 18, 2008 at 6:23 PM, Warren Harris
I am working on a query language translator, and although I feel that a monadic formulation would work well for this application, I've stumbled on a number of questions and difficulties that I thought the knowledgeable people here might be able to help me with.
HaskellDB takes a similar approach. It's "Query" monad allows you to build queries which are then translated to SQL by a "runQuery" function. Could your bind operation collect the 'input' expressions and then output them all at once via a "runTranslation" function? Do you have to do in-place translation? Justin

On Nov 19, 2008, at 9:11 AM, Justin Bailey wrote:
On Tue, Nov 18, 2008 at 6:23 PM, Warren Harris
wrote: I am working on a query language translator, and although I feel that a monadic formulation would work well for this application, I've stumbled on a number of questions and difficulties that I thought the knowledgeable people here might be able to help me with.
HaskellDB takes a similar approach. It's "Query" monad allows you to build queries which are then translated to SQL by a "runQuery" function. Could your bind operation collect the 'input' expressions and then output them all at once via a "runTranslation" function?
Thanks for pointing this out. I had looked at Leigen & Meijer's paper a while back which describes this, and in one branch of my code taken a very similar approach by using an abstract datatype with phantom types to represent the target language. However, I had concluded (perhaps incorrectly) that the technique was not amenable to streaming the results back to the client without first constructing an in-memory list with a universal type representing the values returned. Now perhaps the in-memory list part was a bad conclusion since the queries can be decorated with translation functions capable of streaming the results out to another channel. However, the use of a universal type for the values would still seem to be required since there is no way to implement type-indexed values when the queries themselves are expressed as an abstract datatype rather than as functions. Am I overlooking something? As an aside, maybe I've been unduly concerned with wrapping the raw values in a universal type since I've found that in most cases for my target language I must deal with the possibility of null results, and must wrap most values in a Maybe type anyway. So I must pay the allocation cost in most cases... and with phantom types perhaps I can avoid the possibility of dynamic type errors due to unexpected data or ill-formed target queries.
Do you have to do in-place translation?
Not sure I follow. Warren

Warren Harris
However, the use of a universal type for the values would still seem to be required since there is no way to implement type-indexed values when the queries themselves are expressed as an abstract datatype rather than as functions. Am I overlooking something?
You might find inspiration in the fact that printf and scanf can be expressed in ML/Haskell without any fancy type-system features. http://www.brics.dk/RS/98/12/ http://cs.nyu.edu/zheyang/papers/YangZ--ICFP98.html http://article.gmane.org/gmane.comp.lang.haskell.general/16409 http://www.itu.dk/people/mir/typesafepatterns.pdf Credits to: Olivier Danvy, Zhe Yang, Kenichi Asai, Oleg Kiselyov, Morten Rhiger. -- Edit this signature at http://www.digitas.harvard.edu/cgi-bin/ken/sig 2008-11-20 Universal Children's Day http://unicef.org/ 1948-12-10 Universal Declaration of Human Rights http://everyhumanhasrights.org

On Wed, Nov 19, 2008 at 11:50 AM, Warren Harris
Now perhaps the in-memory list part was a bad conclusion since the queries can be decorated with translation functions capable of streaming the results out to another channel. However, the use of a universal type for the values would still seem to be required since there is no way to implement type-indexed values when the queries themselves are expressed as an abstract datatype rather than as functions. Am I overlooking something?
If the type of the input determines the type of the output, and the type of the input can be determined statically, then I think you are overlooking this technique. But if your application requires that each input expression be sent to the remote client before type information can be determined, then you are right. In the case of HaskellDB, each operator/term in the Query monad enriches the type information available. Of course, that enrichment happens at compile time and occurs via type inference. For example, assuming two tables T1 and T2, with columns T1col and T2col, this expression: simpleQuery = do t1 <- table T1 t2 <- table T2 restrict ( .. some expression ...) project (t1 ! T1col1 # t2 ! T2col1) Gives a type similar to Query (RecCons T1Col1 Int (RecCons T2 Col2 Int RecNil)). RecCons and RecNil are types that allow a list to be built at the type level. The list carries the column names and types in the resulting projection. The type information is used to ensure queries only refer to columns that exist, datatypes are compared sensibly, etc. If in your case you can build that kind of structure based purely on the "input language" operators/terms, then it seems you could build the entire "output expression" in one shot. But again, if input and output have to be interleaved then I think you are stuck. Justin

On Nov 19, 2008, at 12:19 PM, Justin Bailey wrote:
On Wed, Nov 19, 2008 at 11:50 AM, Warren Harris
wrote: Now perhaps the in-memory list part was a bad conclusion since the queries can be decorated with translation functions capable of streaming the results out to another channel. However, the use of a universal type for the values would still seem to be required since there is no way to implement type-indexed values when the queries themselves are expressed as an abstract datatype rather than as functions. Am I overlooking something?
If the type of the input determines the type of the output, and the type of the input can be determined statically, then I think you are overlooking this technique. But if your application requires that each input expression be sent to the remote client before type information can be determined, then you are right.
In the case of HaskellDB, each operator/term in the Query monad enriches the type information available. Of course, that enrichment happens at compile time and occurs via type inference. For example, assuming two tables T1 and T2, with columns T1col and T2col, this expression:
simpleQuery = do t1 <- table T1 t2 <- table T2 restrict ( .. some expression ...) project (t1 ! T1col1 # t2 ! T2col1)
Gives a type similar to Query (RecCons T1Col1 Int (RecCons T2 Col2 Int RecNil)). RecCons and RecNil are types that allow a list to be built at the type level. The list carries the column names and types in the resulting projection. The type information is used to ensure queries only refer to columns that exist, datatypes are compared sensibly, etc. If in your case you can build that kind of structure based purely on the "input language" operators/terms, then it seems you could build the entire "output expression" in one shot. But again, if input and output have to be interleaved then I think you are stuck.
What you describe is very similar to my current implementation, except I'm using a continuation function to receive the results input from the remote server rather than creating a list. The continuation's type (a curried function) is analogous to a list at the type level, if I understand correctly. However, since what I would like to do is stream the results in from one server and out to another (translating as I go), I really need some way to hook my translation in between each primitive field read from the input. So inside the body of the 'project' call you've given above, I might like to say something like: v1 <- t1 ! T1col1; v2 <- t2 ! T2col1; send client (trans v1 v2); return () However, since this input-receiving expression is now expressed as monad, it doesn't seem possible to use it to simultaneously formulate the output expressions that must be sent to the remote server to cause T1col1 and T2col1 to be requested. In my particular application, the remote server not only handles simple directives to retrieve sequences of primitives like T1col1, T2col1, but also higher-order formulations of them as lists and tuples. So the server may be sent a request like "(T1col1, (list (T2col1, T2col2)))" meaning "request the tuple of T1col1 followed by a (possibly long) sequence of T2col1, T2col2 pairs" and I want to be able to translate the incoming result stream without holding the entire thing in memory. Obviously I can do this by having one operation to generate the outbound query from a target language expression (implemented with an algebraic datatype), and another operation to handle the results based on the same expression, but it seems like some sort of formulation should be possible that allows both operations to be expressed simultaneously. In fact, I get that with the continuation-oriented combinators I mentioned earlier, but in practice I've found them to be very hard to work with. Just to make it more concrete, the above example might be expressed with my continuation-based combinators as: T1col1 ==> (\ k t1 -> send client (trans1 t1); k) ++ list (T2col1 ++ T2col2 ==> (\ k t2c1 t2c2 -> send client (trans2 t2c1 t2c2); k))) + + ... Here, the first function receives the value read from T1col1, translates it and send it to the originator of the request. This function is substituted for the an overall continuation -- receives the overall continuation as its first argument, k, and returns it at the end without passing any values to it. This effectively consumes the value t1. The second function is similar, and is called for each pair received from the list. The combinator '++', primitive directives like T1col1, and higher-order directives like 'list' are responsible for dealing with the overall expression grammar (i/o of delimiters, etc) and folding the overall continuation through the results as they are received. Sorry for being so long-winded here... I really appreciate your input. Warren
participants (4)
-
Brandon S. Allbery KF8NH
-
Chung-chieh Shan
-
Justin Bailey
-
Warren Harris