
In summary, I think that the dependencies on the pagemaster are not adequate, he mixes too many concerns that should be separated.
True, but then that's even more miscellaneous bits and pieces to carry around. I guess what makes me uncomfortable is that when I'm writing down a function like process1 (not its real name, as you might imagine), I want to concentrate on the high-level data flow and the steps of the transformation. I don't want to have to exposes all of the little bits and pieces that aren't really relevant to the high-level picture. Obviously, in the definitions of the functions that make up process1, those details become important, but all of that should be internal to those function definitions.
Yes, we want to get rid of the bits and pieces. Your actual code is between two extremes that both manage to get rid of them. One extreme is the "universal" structure like you already noted:
Alternatively, I can wrap all of the state up into a single universal structure that holds everything I will ever need at every step, but doing so seems to me to fly in the face of strong typing; at the early stages of processing, the structure will have "holes" in it that don't contain useful values and shouldn't be accessed.
Currently, (pagemaster) has tendencies to become such a universal beast. The other extreme is the one I favor: the whole pipeline is expressible as a chain of function compositions via (.). One should be able to write process = rectangles2pages . questions2rectangles This means that (rectangles2pages) comes from a (self written) layout library and that (questions2rectangles) comes from a question formatting library and both concern are completely separated from each other. If such a factorization can be achieved, you get clear semantics, bug reduction and code reuse for free. Of course, the main problem is: the factorization does not arise by coding, only by thinking. Often the situation is as following and I for myself encounter it again and again: one starts with an abstraction along function composition but it quickly turns out, as you noted, that "there are some complicated reasons why that doesn't work". To get working code, one creates some miniature "universal structure" that incorporates all the missing data that makes the thing work. After some time, the different concerns get more and more intertwined and soon, every data depends on everything else until the code finally gets unmaintainable, it became "monolithic". What can be done? The original problem was that the solutions to the originally separated concerns (layout library and questions2rectangles) simply were not powerful, not general enough. The remedy is to separately increase the power and expressiveness of both libraries until the intended result can be achieved by plugging them together. Admittedly, this is not an easy task. But the outcome is rewarding: by thinking about the often ill-specified problems, one understands them much better and it most often turns out that some implementation details were wrong and so on. In contrast, the ad-hoc approach that introduces miniature "universal structures" does not make the libraries more general, but tries to fit them together by appealing to the special case, the special problem at hand. In my experience, this only makes things worse. The point is: you have to implement the functionality anyway, so you may as well grab some free generalizations and implement it once and for all in an independent and reusable library. I think that the following toy example (inspired by a discussion from this mailing list) shows how to break intertwined data dependencies: foo :: Keyvalue -> (Blueprint, Map') -> (Blueprint', Map) foo x (bp,m') = (insert x bp, uninsert x bp m') The type for (foo) is much too general: it says that foo may mix the (Blueprint) and the (Map') to generate (Blueprint'). But this is not the case, the type for foo introduces data dependencies that are not present at all. A better version would be foo' :: Keyvalue -> Blueprint -> (Blueprint', Map' -> Map) foo' x bp = (insert x bp, \m' -> uninsert x bp m') Here, it is clear that the resulting (Map) depends on (blueprint) and (Map'), but that the resulting (Blueprint') does not depend on (map'). The point relevant to your problem is that one can use (foo') in more compositional ways than (foo) simply because the type allows it. For instance, you can recover (insert) from (foo'): insert :: Keyvalue -> Blueprint -> Blueprint' insert x bp = fst $ foo' x bp but this is impossible with (foo).* In the original problem, the type signature for (foo') was that best one could get. But here, the best type signature is of course foo'' :: ( Keyvalue -> Blueprint -> Blueprint' , Keyvalue -> Blueprint -> Map' -> Map ) foo'' = (insert, uninsert) because in essence, (foo) is just the pair (insert, uninsert). One morale from the above example is that functions returned as result (as in the signature of (foo')) are your friends when tackling the problem of making libraries more expressive while keeping them independent. In summary, I think that your question about style of pipelines roots in questions far deeper and I think that the "high level only" wish is an illusion: you simply have to write down every dependency you introduce, there is no way around this law of nature. But IMHO and compared to imperative languages, Haskell is the first programming language that really offers the possibility to specify data dependencies exactly as they are because Haskell is pure, higher order and has a powerful type system. Concerning your code, I wish to thank you for its detailed explanation. The post already got quite long, so I'm adding only some remarks. Of course, they are my personal opinion and you don't need to incorporate or comment on them, because it's your code after all.
process :: Item -> MediaKind -> MediaSize -> Language -> SFO The reason it's just "Item" is that it can be a number of different things. It can be a full-blown questionnaire, composed of a number of questions, but it could also be just one question (sometimes the users want to see what a question layout looks like before okaying its inclusion into the questionnaire stream). The functions are overloaded to handle the various different kinds of Items.
If there are only the cases of some single question or a full questionnaire, you could always do blowup :: SingleQuestion -> FullQuestionaire preview = process (blowup a_question) ... In general, I think that it's the task of (process) to inspect (Item) and to plug together the right steps. For instance, a single question does not need page breaks or similar. I would avoid overloading the (load*) functions and (paginate) on (Item).
A pagemaster defines the sizes and locations of the various parts of the page (top and bottom margins, left and right sidebars, body region), as well as the content of everything except the body region (which is where the questions go). [...]
The pagemaster also contains a couple of other bits of information that don't fit neatly anywhere else (discussed below).
As you may guess, I'd throw out these other bits from (pagemaster) and reserve him for arranging rectangles on a page only. I suspect that he can be fully absorbed by (paginate) afterwards for (buildLayout) does not use it (?).
Maybe one should write filter willBeDisplayedQuestion $ instead, but I think the name 'stripUndisplayedQuestions' says it all.
Sure. "stripUndisplayedQuestions" is indeed just a simple filter.
Writing (filter willBeDisplayedQuestion) has the minor advantage that it is absolutely clear that this step in the pipeline will only filter stuff. The name (stripUndisplayedQuestions) suggests that, too, but names are no proofs and the type does not prove it either in this case.
appendEndQuestions :: Item -> Pagemaster -> [Question] -> [Question] End questions are questions that are inserted automagically at the end of (almost) every questionnaire. [...] It may seem like it would be better stored in the questionnaire itself, but there are some complicated reasons why that doesn't work. Obviously, it would be possible to rearrange the data after it is retrieved from the database, although I'm not sure that there would be a net simplification.
I'd go for a rearrange because my experience is that while taking over foreign data structures eases import, it most often makes the actual algorithm extremely cumbersome. The algorithm dictates the data structure. Btw, the special place "end" suggests that the "question markup language" does not incorporate all of: "conditional questions", "question groups", "group templates"? Otherwise, I'd just let the user insert <if media="print"> <template-instance ref="endquestions.xml" /> </if> at the end of every questionnaire. If you use such a tiny macro language (preferably with sane and simple semantics), you can actually merge (stripUndisplayedQuestions) and (appendEndQuestions) into a function (evalMacros) without much fuss. I think that this will even make the code simpler. Numbering and cross-references could be implemented as macro expansion, too. Perhaps it is also advisable to do (validateQuestionContent) before macro expansion. And, best of all, the macro language is completely independent of the question formatting task, you can easily outsource this into a library.
coalesceParentedQuestions :: [Question] -> [Question] [...] Some questions are composed of multiple sub-questions that are treated as separate questions in the database. Because the people who created and maintain the database have difficulty fully grasping the concept of trees (or hierarchies in general, actually), I have to jump through a few hoops here and there to massage the data into something meaningful.
While it's true that a parent question looks superficially like a tree of child questions, there's more to it than that; the visual layout of the parent question is not generated by a simple traversal over its children, for example. So, for all of the processing that follows, a parent question (one with child questions) looks just like any other question, and any parent question-specific details remain hidden inside.
Again, I'd say that the algorithm and now more than ever the meaning dictates the data structure. Assuming that processing children of different parents is independent and that processing children of the same parent is *not* independent, I'd group "families" right together in a data structure. Whether it's a simple traversal (I interpret this as "independent"?) or not, at some point you have to mess with the whole group at once anyway, so you can put it together right now.
validateQuestionContent :: [Question] -> [Question] Uh, I think the type is plain wrong. Doesn't the name suggest 'Question -> Bool' and a fatal error when a question content is invalid?
No. The idea is to never fail to assemble the questionnaire. If there is a question with invalid content, then it is replaced by a dummy question > [...]
Ah, of course you are right, I didn't think of enhanced error processing. I guess that (validateQuestionContent) is not a filter, because you have to check "non-local" parent-child relations as well? If so, then I suggest grouping them beforehand to make it a filter.
(numberedQuestions,questionCategories) = numberQuestions pagemaster questions;
Another piece of miscellaneous information contained within the pagemaster is the starting question number.
You can still automatically "number" questions in dependence of a first number by overloading the (Num) class: newtype RelativeInteger = RI { unRI :: Integer -> Integer } instance (Num RelativeInteger) where ... mkAbsolute :: Integer -> RelativeInteger -> Integer mkAbsolute pointOfReference relint = unRI relint pointOfReference
(Some questionnaires start with a question number other than 1 because there is a post-processing step where various "front ends" are pasted onto variable "back ends"--another example of where a hierarchical approach would have made more sense, but couldn't be adopted because the database people couldn't cope.)
Uh, that doesn't sound good. I assume that the post-processing is not implemented in Haskell? Otherwise, you could incorporate this stuff into (process) and choose suitable interfaces. IMHO, dealing with some modestly expressive interface which still only offers medium abstraction (like object orientation) is a pain in a type system as powerful as Haskell's.
bands' = resolveCrossReferences bands questionCategories;
Questions are cross-referenced by question number. For example, question 4 might be in the "Sales" category, while question 22 might be "Detailed Sales." The last item of question 22 might be "Total; should equal the value reported in (4)." In order to make the layouts as reusable as possible, rather than hard-coding "(4)" in that last item in (22), there is a tag that looks something like this:
<text>Total; should equal the value reported in <question-ref category="Sales"/>.</text>
Fine, though I don't see exactly why this isn't done before after the questions have been transformed to printable things but before there are distributed across pages. So the references cannot refer to page numbers, yet must be processed after transforming questions to rectangles?
groupedBands = groupBands bands';
(can't guess on that)
In order to implement widow/orphan control, not every band is allowed to start a new page ("keep with previous" and "keep with next," in effect). Before being handed off to the paginator, the bands are grouped so that each group of bands begins with a band that _is_ allowed to start a page, followed by the next n bands that aren't allowed to start a page. Each grouped band is then treated by the paginator as an indivisible entity. (At this point, the grouped bands could be coalesced into single bands, but doing so adds a bit of unnecessary overhead to the rendering phase.)
Maybe (paginate) can be given a type along the lines of paginate :: Rectangle a => [a] -> Pages a and perhaps you could merge several bands into a single rectangle simply by saying instance Rectangle [Band] where ... To conclude, I think that (process) can be roughly factorized as follows: process = buildPages . questions2rectangles . expandMacros Now, you get 2/3 of TeX or another desktop publishing system for free, you only have to replace (questions2rectangles) by (text2rectangles). Regards, apfelmus Footnote: * Well, it is possible to "recover" insert, but only by introducing a contradiction into the logic of types with the help of (undefined): insert x bp = foo x (bp, (undefined :: map')) This is clearly unsafe and heavily depends on the implicit knowledge that the returned (BluePrint') ignores its arguments.