Re: [Haskell-cafe] Seeking advice on a style question

On Tue, 26 Dec 2006 20:21:45 -0800, you wrote:
How would this example look if you named only multiply-used expressions? I'd like to see it in a more conventional pointful style with nested expressions. I'm still wondering whether the awkwardness results from your writing style or is more inherent. Showing the real variable names may also help also.
This is what it looks like "for real":
process :: Item -> MediaKind -> MediaSize -> Language -> SFO process item mediaKind mediaSize language = let pagemaster = loadPagemaster item mediaKind mediaSize; questions = stripUndisplayedQuestions mediaKind $ appendEndQuestions item pagemaster $ coalesceParentedQuestions $ validateQuestionContent $ loadQuestions item; (numberedQuestions,questionCategories) = numberQuestions pagemaster questions; numberedQuestions' = coalesceNAQuestions numberedQuestions; (bands,sequenceLayouts) = buildLayout mediaKind language numberedQuestions'; bands' = resolveCrossReferences bands; groupedBands = groupBands bands'; pages = paginate item mediaKind mediaSize pagemaster groupedBands; pages' = combineRows pages; sfo = pages' sequenceLayouts; in sfo
These are the function signatures:
loadPagemaster :: Item -> MediaKind -> MediaSize -> Pagemaster loadQuestions :: Item -> [Question] validateQuestionContent :: [Question] -> [Question] coalesceParentedQuestions :: [Question] -> [Question] appendEndQuestions :: Item -> Pagemaster -> [Question] -> [Question] stripUndisplayedQuestions :: MediaKind -> [Question] -> [Question] numberQuestions :: Pagemaster -> [Question] -> ([NumberedQuestion],[QuestionCategory]) coalesceNAQuestions :: [NumberedQuestion] -> [NumberedQuestion] buildLayout :: MediaKind -> Language -> [NumberedQuestion] -> ([Band],[SequenceLayout]) resolveCrossReferences :: [Band] -> [Band] groupBands :: [Band] -> [[Band]] paginate :: Item -> MediaKind -> MediaSize -> Pagemaster -> [[Band]] -> [Page] combineRows :: [Page] -> [Page] createSFO :: [Page] -> [SequenceLayout] -> SFO
MediaKind, MediaSize and Language are simple enumerations; everything else is a complex structure. Steve Schafer Fenestra Technologies Corp. http://www.fenestra.com/

To get another perspective, let's eliminate some unnecessary naming and see
what linear pipelines emerge:
process item mediaKind mediaSize language =
let (numberedQuestions,questionCategories) =
numberQuestions pagemaster $
stripUndisplayedQuestions mediaKind $
appendEndQuestions item
(loadPagemaster item mediaKind mediaSize) $
coalesceParentedQuestions $
validateQuestionContent $
loadQuestions item
(bands,sequenceLayouts) =
buildLayout mediaKind language $
coalesceNAQuestions $
numberedQuestions
in
flip combineRows sequenceLayouts $
paginate item mediaKind mediaSize pagemaster $
groupBands $
resolveCrossReferences $
bands
Warning: I haven't tried to type-check and may have made a clerical error.
Since questionCategories isn't used, use "fst" & eliminate another let.
Then, for my personal preference, and just to mix things up, switch to
"where" style:
process item mediaKind mediaSize language =
flip combineRows sequenceLayouts $
paginate item mediaKind mediaSize pagemaster $
groupBands $
resolveCrossReferences $
bands
where
(bands,sequenceLayouts) =
buildLayout mediaKind language $
coalesceNAQuestions $
fst $
numberQuestions pagemaster $
stripUndisplayedQuestions mediaKind $
appendEndQuestions item
(loadPagemaster item mediaKind mediaSize) $
coalesceParentedQuestions $
validateQuestionContent $
loadQuestions item
Not quite a work of art yet, but the structure is getting clearer to me.
On 12/28/06, Steve Schafer
On Tue, 26 Dec 2006 20:21:45 -0800, you wrote:
How would this example look if you named only multiply-used expressions? I'd like to see it in a more conventional pointful style with nested expressions. I'm still wondering whether the awkwardness results from your writing style or is more inherent. Showing the real variable names may also help also.
This is what it looks like "for real":
process :: Item -> MediaKind -> MediaSize -> Language -> SFO process item mediaKind mediaSize language = let pagemaster = loadPagemaster item mediaKind mediaSize; questions = stripUndisplayedQuestions mediaKind $ appendEndQuestions item pagemaster $ coalesceParentedQuestions $ validateQuestionContent $ loadQuestions item; (numberedQuestions,questionCategories) = numberQuestions pagemaster questions; numberedQuestions' = coalesceNAQuestions numberedQuestions; (bands,sequenceLayouts) = buildLayout mediaKind language numberedQuestions'; bands' = resolveCrossReferences bands; groupedBands = groupBands bands'; pages = paginate item mediaKind mediaSize pagemaster groupedBands; pages' = combineRows pages; sfo = pages' sequenceLayouts; in sfo
These are the function signatures:
loadPagemaster :: Item -> MediaKind -> MediaSize -> Pagemaster loadQuestions :: Item -> [Question] validateQuestionContent :: [Question] -> [Question] coalesceParentedQuestions :: [Question] -> [Question] appendEndQuestions :: Item -> Pagemaster -> [Question] -> [Question] stripUndisplayedQuestions :: MediaKind -> [Question] -> [Question] numberQuestions :: Pagemaster -> [Question] -> ([NumberedQuestion],[QuestionCategory]) coalesceNAQuestions :: [NumberedQuestion] -> [NumberedQuestion] buildLayout :: MediaKind -> Language -> [NumberedQuestion] -> ([Band],[SequenceLayout]) resolveCrossReferences :: [Band] -> [Band] groupBands :: [Band] -> [[Band]] paginate :: Item -> MediaKind -> MediaSize -> Pagemaster -> [[Band]] -> [Page] combineRows :: [Page] -> [Page] createSFO :: [Page] -> [SequenceLayout] -> SFO
MediaKind, MediaSize and Language are simple enumerations; everything else is a complex structure.
Steve Schafer Fenestra Technologies Corp. http://www.fenestra.com/ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Conal Elliott wrote:
Warning: I haven't tried to type-check and may have made a clerical error. Since questionCategories isn't used, use "fst" & eliminate another let. Then, for my personal preference, and just to mix things up, switch to "where" style:
process item mediaKind mediaSize language = flip combineRows sequenceLayouts $ paginate item mediaKind mediaSize pagemaster $ groupBands $ resolveCrossReferences $ bands where (bands,sequenceLayouts) = buildLayout mediaKind language $ coalesceNAQuestions $ fst $ numberQuestions pagemaster $ stripUndisplayedQuestions mediaKind $ appendEndQuestions item (loadPagemaster item mediaKind mediaSize) $ coalesceParentedQuestions $ validateQuestionContent $ loadQuestions item
And just for the heck of it, trading parenthesis and layout for dollar signs... process item mediaKind mediaSize language = combineRows (paginate item mediaKind mediaSize pagemaster (groupBands (resolveCrossReferences bands))) sequenceLayouts where (bands,sequenceLayouts) = buildLayout mediaKind language (coalesceNAQuestions (fst (numberQuestions pagemaster (stripUndisplayedQuestions mediaKind (appendEndQuestions item (loadPagemaster item mediaKind mediaSize) (coalesceParentedQuestions (validateQuestionContent (loadQuestions item))))))))

I assume that your proper goal is not to structure pipeline processes in full generality, but to simplify the current one at hand.
No, I'm looking for full generality. ;) I have dozens of these kinds of "quasi-pipelines," all similar in overall appearance, but different in detail.
Ah, the names help a lot and they confirm my uneasy feeling about the quasi-pipeline: I think it's ad-hoc. What I want to say is that I suggest refactoring the pipeline to become expressible as a point-free function concatenation by decoupling data dependencies instead of trying to find a way to express arbitrary interlinked quasi-pipelines. So I'd eliminate the problem by switching to a different one :) Of course, this decoupling needs the concrete names and types, that's why I wanted to know them. So, let's have look on the data dependencies, taking the information from your web-site into account. I'll formulate some guesses and questions, you don't need to comment on them; they're just meant as hints. The point is that its *type* and not so much its name should guide what a functions does. This way, dependencies on unnecessary parameters automatically go away.
process :: Item -> MediaKind -> MediaSize -> Language -> SFO "Item" doesn't tell me anything. Seems to be an XML-File containing the questions and such.
process item mediaKind mediaSize language = let pagemaster = loadPagemaster item mediaKind mediaSize;
Mh, I cannot guess what a "pagemaster" might do, but from its arguments, it looks like the "ink guy" responsible for actual printing (or on-screen display). So he might know about graphics, colors and inches but not about content.
validateQuestionContent :: [Question] -> [Question] questions = stripUndisplayedQuestions mediaKind $
Ok, mediaKind is indispensable because on-screen and print forms are utterly different. Maybe one should write filter willBeDisplayedQuestion $ instead, but I think the name 'stripUndisplayedQuestions' says it all.
appendEndQuestions item pagemaster $
Uh, why do questions depend on pagemaster and thus on mediaSize? Are these some floating questions appearing on every page, like the name of the guy to be questioned? Those should be treated somewhere else. Here, one could also write (++ endquestions ...) $ so that everybody sees what's going on, but again 'appendEnd' is adequate. The dependency on the full item is too much, I think.
coalesceParentedQuestions :: [Question] -> [Question] coalesceParentedQuestions $
This makes me suspicious whether [Question] is the right type. Apparently, data Question = GroupedQuestions [String] or something like that, so that a Question may well be a tree of questions. Guessing that this function resolves some tree structure that got specified by explicitly naming nodes, I'd suggest a type '[QuestionTaggedWithLevel] -> Tree Question' instead. Note that one also has fold and filter on Trees for further processing.
validateQuestionContent :: [Question] -> [Question] validateQuestionContent $
Uh, I think the type is plain wrong. Doesn't the name suggest 'Question -> Bool' and a fatal error when a question content is invalid?
loadQuestions item;
'loadQuestions' is a strange name (too imperative) but that's personal taste and maybe a hint to the fact that 'item' is stored inside a file.
(numberedQuestions,questionCategories) = numberQuestions pagemaster questions;
Yet again the pagemaster. I don't think that mere numbering should depend on mediaSize, not even implicitly. Why must questionCategories be collected? Aren't they inherent in 'Tree Question', so that every Branch has a unique category? Automatic numbering is fine, though.
numberedQuestions' = coalesceNAQuestions numberedQuestions;
Does 'NA' mean not answered? Isn't that either a fatal error or a Maybe-Answer? 'coalesce' makes me suspicious, I could live with a 'filter'.
(bands,sequenceLayouts) = buildLayout mediaKind language numberedQuestions';
Ah, there's no pagemaster, only mediaKind and language, although the pagemaster would be tempting here. I guess that layout builds for 'endless paper' (band).
bands' = resolveCrossReferences bands;
Mh, cross reference for thing x on page y? But there aren't any pages yet. Likely that I just don't know what bands are.
groupedBands = groupBands bands';
(can't guess on that)
pages = paginate item mediaKind mediaSize pagemaster groupedBands;
Now, the dependence on mediaSize is fine. But it's duplicated by pagemaster.
pages' = combineRows pages; sfo = createSFO pages' sequenceLayouts; in sfo
(can't guess on that)
In summary, I think that the dependencies on the pagemaster are not adequate, he mixes too many concerns that should be separated. If he goes, there's much more function concatenation possible. If really necessary, the MediaKind stuff can go into a MonadReader. Btw, IMHO the explicitely named version is much clearer than any diagram with boxes and arrows. At least, you should keep names like 'questions', 'bands' and 'pages'. Only cumbersome derivations like 'groupedBands' or names with additional ticks are really redundant. Regards, apfelmus

On Fri, 29 Dec 2006 21:01:31 +0100, you wrote:
process :: Item -> MediaKind -> MediaSize -> Language -> SFO "Item" doesn't tell me anything. Seems to be an XML-File containing the questions and such.
The reason it's just "Item" is that it can be a number of different things. It can be a full-blown questionnaire, composed of a number of questions, but it could also be just one question (sometimes the users want to see what a question layout looks like before okaying its inclusion into the questionnaire stream). The functions are overloaded to handle the various different kinds of Items.
Mh, I cannot guess what a "pagemaster" might do, but from its arguments, it looks like the "ink guy" responsible for actual printing (or on-screen display). So he might know about graphics, colors and inches but not about content.
A pagemaster defines the sizes and locations of the various parts of the page (top and bottom margins, left and right sidebars, body region), as well as the content of everything except the body region (which is where the questions go). There are four different page definitions in the pagemaster: first page, last page, even page and odd page. The pagemaster also contains a couple of other bits of information that don't fit neatly anywhere else (discussed below).
Maybe one should write filter willBeDisplayedQuestion $ instead, but I think the name 'stripUndisplayedQuestions' says it all.
Sure. "stripUndisplayedQuestions" is indeed just a simple filter.
appendEndQuestions item pagemaster $
Uh, why do questions depend on pagemaster and thus on mediaSize? Are these some floating questions appearing on every page, like the name of the guy to be questioned? Those should be treated somewhere else.
End questions are questions that are inserted automagically at the end of (almost) every questionnaire. They depend on the Item because only questionnaires get them, and they depend on the pagemaster because not every questionnaire gets them. (This is one of those additional bits of information that is stored in the pagemaster. It may seem like it would be better stored in the questionnaire itself, but there are some complicated reasons why that doesn't work. Obviously, it would be possible to rearrange the data after it is retrieved from the database, although I'm not sure that there would be a net simplification.)
coalesceParentedQuestions :: [Question] -> [Question] coalesceParentedQuestions $
This makes me suspicious whether [Question] is the right type. Apparently, data Question = GroupedQuestions [String] or something like that, so that a Question may well be a tree of questions. Guessing that this function resolves some tree structure that got specified by explicitly naming nodes, I'd suggest a type '[QuestionTaggedWithLevel] -> Tree Question' instead. Note that one also has fold and filter on Trees for further processing.
Some questions are composed of multiple sub-questions that are treated as separate questions in the database. Because the people who created and maintain the database have difficulty fully grasping the concept of trees (or hierarchies in general, actually), I have to jump through a few hoops here and there to massage the data into something meaningful. While it's true that a parent question looks superficially like a tree of child questions, there's more to it than that; the visual layout of the parent question is not generated by a simple traversal over its children, for example. So, for all of the processing that follows, a parent question (one with child questions) looks just like any other question, and any parent question-specific details remain hidden inside.
validateQuestionContent :: [Question] -> [Question] validateQuestionContent $
Uh, I think the type is plain wrong. Doesn't the name suggest 'Question -> Bool' and a fatal error when a question content is invalid?
No. The idea is to never fail to assemble the questionnaire. If there is a question with invalid content, then it is replaced by a dummy question that contains some descriptive text explaining the problem. So "validateQuestionContent" might more loquaciously be called "inspectTheQuestionsAndReplaceAnyThatDontLookRightWithAnErrorMessageShapedLikeAQuestion." I haven't shown it here, but there is an accompanying Writer that accumulates a log of errors and warnings as well. The final step generates and prepends a "job ticket" page onto the output; the errors and warnings are listed on that page.
loadQuestions item;
'loadQuestions' is a strange name (too imperative) but that's personal taste and maybe a hint to the fact that 'item' is stored inside a file.
A database, actually. First, the item's details are retrieved, and depending on what kind of item it is, a list of questions associated with that item is retrieved. For example, if the item is a questionnaire, things like the questionnaire title, etc. are retrieved along with the list of questions contained within the questionnaire.
(numberedQuestions,questionCategories) = numberQuestions pagemaster questions;
Yet again the pagemaster. I don't think that mere numbering should depend on mediaSize, not even implicitly. Why must questionCategories be collected? Aren't they inherent in 'Tree Question', so that every Branch has a unique category? Automatic numbering is fine, though.
Another piece of miscellaneous information contained within the pagemaster is the starting question number. (Some questionnaires start with a question number other than 1 because there is a post-processing step where various "front ends" are pasted onto variable "back ends"--another example of where a hierarchical approach would have made more sense, but couldn't be adopted because the database people couldn't cope.) "Question categories" is a slight misnomer; it should be "Question category/question number associations"; they're used in cross-reference resolution, described below. It would be possible to separate the question numbering and question category association generation into two separate passes, but while doing so would eliminate the need to return a tuple, it wouldn't significantly "linearize" the data flow.
numberedQuestions' = coalesceNAQuestions numberedQuestions;
Does 'NA' mean not answered? Isn't that either a fatal error or a Maybe-Answer? 'coalesce' makes me suspicious, I could live with a 'filter'.
NA means "not applicable." In order to maintain parallel question numbers across a range of related questionnaires, some questions are marked "not applicable" in some questionnaires. The idea of NA question coalescence is that if there are two or more NA questions in a row, they are replaced by a single combined NA question. Thus, instead of (16) Not applicable (17) Not applicable (18) Not applicable we have (16)-(18) Not applicable
(bands,sequenceLayouts) = buildLayout mediaKind language numberedQuestions';
Ah, there's no pagemaster, only mediaKind and language, although the pagemaster would be tempting here. I guess that layout builds for 'endless paper' (band).
At this point, questions lose their identities as questions, and are replaced by their bands, which are the page body-wide rectangles that are painted, one after another, into the body regions of the pages. Each question consits of one or more bands (some questions go for many pages, and so contain dozens of bands). Each band also contains information that is used to automatically insert continuation headers and footers (e.g., "Question (45) continued on next page") whenever a page break occurs in the middle of a question. This is where "language" comes in, by the way. Although the item and pagemaster implicitly contain language information, this is the only place where the questionnaire assembler itself needs to know what language is being used, because it has to decide between "Continued" and "Continuación," etc.
bands' = resolveCrossReferences bands;
Mh, cross reference for thing x on page y? But there aren't any pages yet. Likely that I just don't know what bands are.
That's a typo, by the way; it should have been: bands' = resolveCrossReferences bands questionCategories; Questions are cross-referenced by question number. For example, question 4 might be in the "Sales" category, while question 22 might be "Detailed Sales." The last item of question 22 might be "Total; should equal the value reported in (4)." In order to make the layouts as reusable as possible, rather than hard-coding "(4)" in that last item in (22), there is a tag that looks something like this:
<text>Total; should equal the value reported in <question-ref category="Sales"/>.</text>
groupedBands = groupBands bands';
(can't guess on that)
In order to implement widow/orphan control, not every band is allowed to start a new page ("keep with previous" and "keep with next," in effect). Before being handed off to the paginator, the bands are grouped so that each group of bands begins with a band that _is_ allowed to start a page, followed by the next n bands that aren't allowed to start a page. Each grouped band is then treated by the paginator as an indivisible entity. (At this point, the grouped bands could be coalesced into single bands, but doing so adds a bit of unnecessary overhead to the rendering phase.)
pages = paginate item mediaKind mediaSize pagemaster groupedBands;
Now, the dependence on mediaSize is fine. But it's duplicated by pagemaster.
That's correct; it would be possible to simply copy some of the information that needs to be propagated into a "long-lived" structure like the pagemaster.
pages' = combineRows pages; sfo = createSFO pages' sequenceLayouts; in sfo
(can't guess on that)
SFO is the final XML-format document description that is then rendered to a variety of output devices (screen, printer, PS or PDF file, etc.).
In summary, I think that the dependencies on the pagemaster are not adequate, he mixes too many concerns that should be separated.
True, but then that's even more miscellaneous bits and pieces to carry around. I guess what makes me uncomfortable is that when I'm writing down a function like process1 (not its real name, as you might imagine), I want to concentrate on the high-level data flow and the steps of the transformation. I don't want to have to exposes all of the little bits and pieces that aren't really relevant to the high-level picture. Obviously, in the definitions of the functions that make up process1, those details become important, but all of that should be internal to those function definitions. Steve Schafer Fenestra Technologies Corp. http://www.fenestra.com/

In summary, I think that the dependencies on the pagemaster are not adequate, he mixes too many concerns that should be separated.
True, but then that's even more miscellaneous bits and pieces to carry around. I guess what makes me uncomfortable is that when I'm writing down a function like process1 (not its real name, as you might imagine), I want to concentrate on the high-level data flow and the steps of the transformation. I don't want to have to exposes all of the little bits and pieces that aren't really relevant to the high-level picture. Obviously, in the definitions of the functions that make up process1, those details become important, but all of that should be internal to those function definitions.
Yes, we want to get rid of the bits and pieces. Your actual code is between two extremes that both manage to get rid of them. One extreme is the "universal" structure like you already noted:
Alternatively, I can wrap all of the state up into a single universal structure that holds everything I will ever need at every step, but doing so seems to me to fly in the face of strong typing; at the early stages of processing, the structure will have "holes" in it that don't contain useful values and shouldn't be accessed.
Currently, (pagemaster) has tendencies to become such a universal beast. The other extreme is the one I favor: the whole pipeline is expressible as a chain of function compositions via (.). One should be able to write process = rectangles2pages . questions2rectangles This means that (rectangles2pages) comes from a (self written) layout library and that (questions2rectangles) comes from a question formatting library and both concern are completely separated from each other. If such a factorization can be achieved, you get clear semantics, bug reduction and code reuse for free. Of course, the main problem is: the factorization does not arise by coding, only by thinking. Often the situation is as following and I for myself encounter it again and again: one starts with an abstraction along function composition but it quickly turns out, as you noted, that "there are some complicated reasons why that doesn't work". To get working code, one creates some miniature "universal structure" that incorporates all the missing data that makes the thing work. After some time, the different concerns get more and more intertwined and soon, every data depends on everything else until the code finally gets unmaintainable, it became "monolithic". What can be done? The original problem was that the solutions to the originally separated concerns (layout library and questions2rectangles) simply were not powerful, not general enough. The remedy is to separately increase the power and expressiveness of both libraries until the intended result can be achieved by plugging them together. Admittedly, this is not an easy task. But the outcome is rewarding: by thinking about the often ill-specified problems, one understands them much better and it most often turns out that some implementation details were wrong and so on. In contrast, the ad-hoc approach that introduces miniature "universal structures" does not make the libraries more general, but tries to fit them together by appealing to the special case, the special problem at hand. In my experience, this only makes things worse. The point is: you have to implement the functionality anyway, so you may as well grab some free generalizations and implement it once and for all in an independent and reusable library. I think that the following toy example (inspired by a discussion from this mailing list) shows how to break intertwined data dependencies: foo :: Keyvalue -> (Blueprint, Map') -> (Blueprint', Map) foo x (bp,m') = (insert x bp, uninsert x bp m') The type for (foo) is much too general: it says that foo may mix the (Blueprint) and the (Map') to generate (Blueprint'). But this is not the case, the type for foo introduces data dependencies that are not present at all. A better version would be foo' :: Keyvalue -> Blueprint -> (Blueprint', Map' -> Map) foo' x bp = (insert x bp, \m' -> uninsert x bp m') Here, it is clear that the resulting (Map) depends on (blueprint) and (Map'), but that the resulting (Blueprint') does not depend on (map'). The point relevant to your problem is that one can use (foo') in more compositional ways than (foo) simply because the type allows it. For instance, you can recover (insert) from (foo'): insert :: Keyvalue -> Blueprint -> Blueprint' insert x bp = fst $ foo' x bp but this is impossible with (foo).* In the original problem, the type signature for (foo') was that best one could get. But here, the best type signature is of course foo'' :: ( Keyvalue -> Blueprint -> Blueprint' , Keyvalue -> Blueprint -> Map' -> Map ) foo'' = (insert, uninsert) because in essence, (foo) is just the pair (insert, uninsert). One morale from the above example is that functions returned as result (as in the signature of (foo')) are your friends when tackling the problem of making libraries more expressive while keeping them independent. In summary, I think that your question about style of pipelines roots in questions far deeper and I think that the "high level only" wish is an illusion: you simply have to write down every dependency you introduce, there is no way around this law of nature. But IMHO and compared to imperative languages, Haskell is the first programming language that really offers the possibility to specify data dependencies exactly as they are because Haskell is pure, higher order and has a powerful type system. Concerning your code, I wish to thank you for its detailed explanation. The post already got quite long, so I'm adding only some remarks. Of course, they are my personal opinion and you don't need to incorporate or comment on them, because it's your code after all.
process :: Item -> MediaKind -> MediaSize -> Language -> SFO The reason it's just "Item" is that it can be a number of different things. It can be a full-blown questionnaire, composed of a number of questions, but it could also be just one question (sometimes the users want to see what a question layout looks like before okaying its inclusion into the questionnaire stream). The functions are overloaded to handle the various different kinds of Items.
If there are only the cases of some single question or a full questionnaire, you could always do blowup :: SingleQuestion -> FullQuestionaire preview = process (blowup a_question) ... In general, I think that it's the task of (process) to inspect (Item) and to plug together the right steps. For instance, a single question does not need page breaks or similar. I would avoid overloading the (load*) functions and (paginate) on (Item).
A pagemaster defines the sizes and locations of the various parts of the page (top and bottom margins, left and right sidebars, body region), as well as the content of everything except the body region (which is where the questions go). [...]
The pagemaster also contains a couple of other bits of information that don't fit neatly anywhere else (discussed below).
As you may guess, I'd throw out these other bits from (pagemaster) and reserve him for arranging rectangles on a page only. I suspect that he can be fully absorbed by (paginate) afterwards for (buildLayout) does not use it (?).
Maybe one should write filter willBeDisplayedQuestion $ instead, but I think the name 'stripUndisplayedQuestions' says it all.
Sure. "stripUndisplayedQuestions" is indeed just a simple filter.
Writing (filter willBeDisplayedQuestion) has the minor advantage that it is absolutely clear that this step in the pipeline will only filter stuff. The name (stripUndisplayedQuestions) suggests that, too, but names are no proofs and the type does not prove it either in this case.
appendEndQuestions :: Item -> Pagemaster -> [Question] -> [Question] End questions are questions that are inserted automagically at the end of (almost) every questionnaire. [...] It may seem like it would be better stored in the questionnaire itself, but there are some complicated reasons why that doesn't work. Obviously, it would be possible to rearrange the data after it is retrieved from the database, although I'm not sure that there would be a net simplification.
I'd go for a rearrange because my experience is that while taking over foreign data structures eases import, it most often makes the actual algorithm extremely cumbersome. The algorithm dictates the data structure. Btw, the special place "end" suggests that the "question markup language" does not incorporate all of: "conditional questions", "question groups", "group templates"? Otherwise, I'd just let the user insert <if media="print"> <template-instance ref="endquestions.xml" /> </if> at the end of every questionnaire. If you use such a tiny macro language (preferably with sane and simple semantics), you can actually merge (stripUndisplayedQuestions) and (appendEndQuestions) into a function (evalMacros) without much fuss. I think that this will even make the code simpler. Numbering and cross-references could be implemented as macro expansion, too. Perhaps it is also advisable to do (validateQuestionContent) before macro expansion. And, best of all, the macro language is completely independent of the question formatting task, you can easily outsource this into a library.
coalesceParentedQuestions :: [Question] -> [Question] [...] Some questions are composed of multiple sub-questions that are treated as separate questions in the database. Because the people who created and maintain the database have difficulty fully grasping the concept of trees (or hierarchies in general, actually), I have to jump through a few hoops here and there to massage the data into something meaningful.
While it's true that a parent question looks superficially like a tree of child questions, there's more to it than that; the visual layout of the parent question is not generated by a simple traversal over its children, for example. So, for all of the processing that follows, a parent question (one with child questions) looks just like any other question, and any parent question-specific details remain hidden inside.
Again, I'd say that the algorithm and now more than ever the meaning dictates the data structure. Assuming that processing children of different parents is independent and that processing children of the same parent is *not* independent, I'd group "families" right together in a data structure. Whether it's a simple traversal (I interpret this as "independent"?) or not, at some point you have to mess with the whole group at once anyway, so you can put it together right now.
validateQuestionContent :: [Question] -> [Question] Uh, I think the type is plain wrong. Doesn't the name suggest 'Question -> Bool' and a fatal error when a question content is invalid?
No. The idea is to never fail to assemble the questionnaire. If there is a question with invalid content, then it is replaced by a dummy question > [...]
Ah, of course you are right, I didn't think of enhanced error processing. I guess that (validateQuestionContent) is not a filter, because you have to check "non-local" parent-child relations as well? If so, then I suggest grouping them beforehand to make it a filter.
(numberedQuestions,questionCategories) = numberQuestions pagemaster questions;
Another piece of miscellaneous information contained within the pagemaster is the starting question number.
You can still automatically "number" questions in dependence of a first number by overloading the (Num) class: newtype RelativeInteger = RI { unRI :: Integer -> Integer } instance (Num RelativeInteger) where ... mkAbsolute :: Integer -> RelativeInteger -> Integer mkAbsolute pointOfReference relint = unRI relint pointOfReference
(Some questionnaires start with a question number other than 1 because there is a post-processing step where various "front ends" are pasted onto variable "back ends"--another example of where a hierarchical approach would have made more sense, but couldn't be adopted because the database people couldn't cope.)
Uh, that doesn't sound good. I assume that the post-processing is not implemented in Haskell? Otherwise, you could incorporate this stuff into (process) and choose suitable interfaces. IMHO, dealing with some modestly expressive interface which still only offers medium abstraction (like object orientation) is a pain in a type system as powerful as Haskell's.
bands' = resolveCrossReferences bands questionCategories;
Questions are cross-referenced by question number. For example, question 4 might be in the "Sales" category, while question 22 might be "Detailed Sales." The last item of question 22 might be "Total; should equal the value reported in (4)." In order to make the layouts as reusable as possible, rather than hard-coding "(4)" in that last item in (22), there is a tag that looks something like this:
<text>Total; should equal the value reported in <question-ref category="Sales"/>.</text>
Fine, though I don't see exactly why this isn't done before after the questions have been transformed to printable things but before there are distributed across pages. So the references cannot refer to page numbers, yet must be processed after transforming questions to rectangles?
groupedBands = groupBands bands';
(can't guess on that)
In order to implement widow/orphan control, not every band is allowed to start a new page ("keep with previous" and "keep with next," in effect). Before being handed off to the paginator, the bands are grouped so that each group of bands begins with a band that _is_ allowed to start a page, followed by the next n bands that aren't allowed to start a page. Each grouped band is then treated by the paginator as an indivisible entity. (At this point, the grouped bands could be coalesced into single bands, but doing so adds a bit of unnecessary overhead to the rendering phase.)
Maybe (paginate) can be given a type along the lines of paginate :: Rectangle a => [a] -> Pages a and perhaps you could merge several bands into a single rectangle simply by saying instance Rectangle [Band] where ... To conclude, I think that (process) can be roughly factorized as follows: process = buildPages . questions2rectangles . expandMacros Now, you get 2/3 of TeX or another desktop publishing system for free, you only have to replace (questions2rectangles) by (text2rectangles). Regards, apfelmus Footnote: * Well, it is possible to "recover" insert, but only by introducing a contradiction into the logic of types with the help of (undefined): insert x bp = foo x (bp, (undefined :: map')) This is clearly unsafe and heavily depends on the implicit knowledge that the returned (BluePrint') ignores its arguments.

[Apologies for the long delay in replying; I've been traveling, etc.] On Sun, 31 Dec 2006 20:11:47 +0100, you wrote:
The other extreme is the one I favor: the whole pipeline is expressible as a chain of function compositions via (.). One should be able to write
process = rectangles2pages . questions2rectangles
This means that (rectangles2pages) comes from a (self written) layout library and that (questions2rectangles) comes from a question formatting library and both concern are completely separated from each other. If such a factorization can be achieved, you get clear semantics, bug reduction and code reuse for free.
I favor that approach, too. ;) The problem is that when there is a multi-step process, and various bits of information get propagated throughout, as required by the various steps in the process, the overall decomposition into a series of steps a . b . c . ... can become brittle in the face of changing requirements. Let's say, for example, a change request comes in that now requires step 13 to access information that had previously been discarded back at step 3. The simple approach is to propagate that information in the data structures that are passed among the intervening steps. But that means that all of the steps are "touched" by the change--because the relevant data structures are redefined--even though they're just passing the new data along. The less simple (and not always feasible) approach is to essentially start over again and re-jigger all of the data structures and subprocesses to handle the new requirement. But this can obviously become quite a task.
If there are only the cases of some single question or a full questionnaire, you could always do
blowup :: SingleQuestion -> FullQuestionaire preview = process (blowup a_question) ...
In general, I think that it's the task of (process) to inspect (Item) and to plug together the right steps. For instance, a single question does not need page breaks or similar. I would avoid overloading the (load*) functions and (paginate) on (Item).
A single question can be several pages long, so it does need to be paginated. The reason for the decomposition as it now stands is that any item (and there are more kinds of items than just questions and questionnaires) can be decomposed into a pagemaster and a list of questions. Once that has occurred, all items acquire essentially the same "shape." That's why loading the pagemaster and loading the questions are the first two steps in the process.
Btw, the special place "end" suggests that the "question markup language" does not incorporate all of: "conditional questions", "question groups", "group templates"? Otherwise, I'd just let the user insert
<if media="print"> <template-instance ref="endquestions.xml" /> </if>
at the end of every questionnaire. If you use such a tiny macro language (preferably with sane and simple semantics), you can actually merge (stripUndisplayedQuestions) and (appendEndQuestions) into a function (evalMacros) without much fuss.
If only I had the power to impose those kinds of changes.... Unfortunately, I have little control over the logical organization of questions, questionnaires and all of the other little bits and pieces. (I assure you I would have done it quite differently if I could.) Instead, I have to deal with an ad hoc pseudo-hierarchical quasi-relational database structure, and to settle for occasional extra columns to be added to the tables in order to specify information that I can't synthesize any other way.
Uh, that doesn't sound good. I assume that the post-processing is not implemented in Haskell?
Not even remotely so. ;) In the paper world, post-processing consists of semi-automated collation and stapling of the actual printed pages. In the electronic world, during previous survey periods, an analogous process was used (a "front" questionnaire and a "back" questionnaire would be figuratively stapled together); we're looking to make the merging a bit smoother and more automatic this time around. As is often the case, the motivation for the rather arcane post-processing is human, rather than technical. Let's say I have ten different questionnaires, where the first five pages of each questionnaire are identical, and these are followed by six additional pages that differ from one questionnaire to another. That's a total of 10 * 11 = 110 pages, but only 5 + 10 * 6 = 65 _distinct_ pages. As hard as it may be to believe, the people who are responsible for approving the questionnaires see it like this: If the system produces one 5-page "front" questionnaire and ten 6-page "back" questionnaires, then that's 65 pages that they have to inspect. But if the system were to produce ten 11-page questionnaires, even though the first five pages of each questionnaire are generated from exactly the same data using exactly the same software, that's 110 pages that they have to inspect.
Fine, though I don't see exactly why this isn't done before after the questions have been transformed to printable things but before there are distributed across pages. So the references cannot refer to page numbers, yet must be processed after transforming questions to rectangles?
It's not until you get to the "rectangles" level that you can see the text and tokens that need to be replaced. -------- Thanks for all of the discussion. I think I have a lot to ponder.... Steve Schafer Fenestra Technologies Corp. http://www.fenestra.com/

Steve Schafer wrote:
[Apologies for the long delay in replying; I've been traveling, etc.] [never mind]
The other extreme is the one I favor: the whole pipeline is expressible as a chain of function compositions via (.). One should be able to write
process = rectangles2pages . questions2rectangles
This means that (rectangles2pages) comes from a (self written) layout library and that (questions2rectangles) comes from a question formatting library and both concern are completely separated from each other. If such a factorization can be achieved, you get clear semantics, bug reduction and code reuse for free.
I favor that approach, too. ;) The problem is that when there is a multi-step process, and various bits of information get propagated throughout, as required by the various steps in the process, the overall decomposition into a series of steps a . b . c . ... can become brittle in the face of changing requirements.
Let's say, for example, a change request comes in that now requires step 13 to access information that had previously been discarded back at step 3. The simple approach is to propagate that information in the data structures that are passed among the intervening steps. But that means that all of the steps are "touched" by the change--because the relevant data structures are redefined--even though they're just passing the new data along.
Ah, I forgot to point it out, polymorphism is your dear friend, of course. For example, 'rectangles2pages' should be fully polymorphic in the stuff that's inside the rectangles, just like for instance 'nub :: Eq a => [a] -> [a]' is polymorphic in the list elements. One possibility is something like class Rectangle a where width, height :: a -> Integer type Pagenumber = Integer data Rectangle a => Pages a = Pages { stickyboxes :: [(Position, a)] -- appear on every page , pagenumberpos :: Position -- absolute numbering later , pages :: Data.Map Pagenumber [(Position, a)] -- actual contents } data Position = Position { x :: Integer, y :: Integer } data Footer a = Footer { content :: a, position :: HAlign } data HAlign = Left | Center | Right rectangles2pages :: Rectangle a => Footer a -> [a] -> Pages a but there are many others. The type of 'rectangles2pages' dictates that it can only rearrange but not alter the data inside 'a' (this is due to _parametric_ polymorphism). Now, you may use it for normal text processing via instance Rectangle Paragraph where Or you can abuse the pagination algorithm to align bread, buns and cookies on several tablets for baking in the stove: instance Rectangle Cookie where -- in the sense of bounding box however you like it :) The point is that you will have minimal trouble when requirements change if you somehow managed to keep 'rectangles2pages' as general as possible. The code above does not need to be changed if you need to carry extra information around in 'a'. That's what I meant with "(self written) layout library": for me, a "library" is necessarily polymorphic. Later on, you can specialize the types. So printing will be print :: Pages Paragraph -> Graphic and it is clear that 'Pages Cookie' cannot be printed. A pipeline could look like process = print . rectangles2pages footer . ... where footer = loadFooter item Note that I chose to still plumb a footer around, which makes sense in case it is fully specified in the item. Of course, it can also be put in pair '(Footer a, [a])': process = print . uncurry rectangles2pages . ... which is the way to go if it is generated on the fly by the previous step in the pipeline. Given suitable generality, the right choice is often natural and satisfying. And who says that there isn't an even better generalization that incorporates the footer more elegantly? Of course, the difficult thing is to discover the right generalization. This can be quite an art. Oh, I can spent days of thinking without writing a single line of code and I have never, ever encountered a situation where it wasn't worth the effort. The point is: you have to implement the corresponding general functionality anyway because often, even the special case needs the full power in some way or another. Implementing it for the special case only is like coding with a blindfold. Still, the generalization can turn out to be inadequate, but because we don't need to worry about the type 'a', things will be easier. And, at some point, the interface doesn't need to change anymore: every change makes it more general, there is a maximum generality and we know that monotone sequences converge :)
Btw, the special place "end" suggests that the "question markup language" does not incorporate all of: "conditional questions", "question groups", "group templates"? Otherwise, I'd just let the user insert
<if media="print"> <template-instance ref="endquestions.xml" /> </if>
at the end of every questionnaire. If you use such a tiny macro language (preferably with sane and simple semantics), you can actually merge (stripUndisplayedQuestions) and (appendEndQuestions) into a function (evalMacros) without much fuss.
If only I had the power to impose those kinds of changes....
Unfortunately, I have little control over the logical organization of questions, questionnaires and all of the other little bits and pieces. (I assure you I would have done it quite differently if I could.) Instead, I have to deal with an ad hoc pseudo-hierarchical quasi-relational database structure, and to settle for occasional extra columns to be added to the tables in order to specify information that I can't synthesize any other way.
Argh. Dealing with badly designed external interfaces is a nightmare. But there may be a way out: while you unfortunately cannot change their database design, they fortunately cannot control how your code represents their database. Who says that you import the database as it is? Who forbids to rearrange it internally? I think I'd interleave a processing step saneImport :: RawDatabase -> (QuestionHierarchy,PageTemplate) that fetches all present information and groups and massages the data until it becomes organized. This is much like your 'questions' and 'pagemaster' combined, but it only depends on the data base content, not on parameters like 'mediaKind' or 'Item' which determine how and what content is to be processed. This way, all troubles arising from a badly designed database are banned into 'saneImport' and don't affect the actual processing step.
Fine, though I don't see exactly why this isn't done before after the questions have been transformed to printable things but before there are distributed across pages. So the references cannot refer to page numbers, yet must be processed after transforming questions to rectangles?
It's not until you get to the "rectangles" level that you can see the text and tokens that need to be replaced.
I don't quite understand why, but let's have another example of polymorphism that may be used before the "rectangle" level but also after. We tackle macros (the end questions can be transformed to them when fetching them from the database) and cross references inside some other data structure. One possibility to transform questions with macros and references to plain text is -- very simple macros data Macro a = Return a | If (MediaKind -> Bool) a execute1 :: MediaKind -> Macro a -> Maybe a execute1 _ (Return x) = Just x execute1 media (If b x) = if b media then Just x else Nothing -- execute can be formulated with an ApplicativeFunctor -- but we'll keep it easy here executeMacros :: MediaKind -> [Macro a] -> [a] executeMacros media xs = concatMap (maybeToList . execute1 media) -- references data Ref a = Here a | Ref CrossRef resolve :: Data.Map CrossRef a -> Ref a -> a resolve = ... -- for simplicity, the lookup always succeeds -- questions whose question text is of type a data Question a = FreeForm a (Answer String) | SingleChoice (Answer Index) [a] instance Functor Question where fmap f (FreeForm x a) = FreeForm (f x) a fmap f (SingleChoice a xs) = SingleChoice a (map f xs) -- now comes the interesting type type RawQuestion = Macro (Question (Ref String)) -- and we want to flatten it to (Question String) textonly :: MediaKind -> Data.Map CrossRef String -> [RawQuestion] -> [Question String] textonly media refs = fmap (resolve refs) . executeMacros media The basic idea is the following: it is clear how to resolve a single cross reference, that's what resolve :: Data.Map CrossRef a -> Ref a -> a does. Now, we simply want to lift this into other data structures: resolveQs :: Data.Map CrossRef a -> Question (Ref String) -> Question String resolveQs refs = fmap (resolve refs) The lifting itself can be made polymorphic in the sourrounding structure: resolveInside :: Functor f => Data.Map CrossRef a -> f (Ref String) -> f String resolveInside refs = fmap (resolve refs) with the help of type constructors and functors. In more complicated cases, one may need other type classes like those from 'Data.Foldable', 'Data.Applicative', 'Data.Traversable'. Depending on the functions that need to be lifted, custom (multi parameter) type classes may quickly show up. In general, the discipline of lifting functions systematically to many data types is called "generic programming". In a sense, not only the pipeline's functionality needs to be assembled from smaller functions, but their types need to be assembled from smaller pieces as well!
As hard as it may be to believe, the people who are responsible for approving the questionnaires see it like this: If the system produces one 5-page "front" questionnaire and ten 6-page "back" questionnaires, then that's 65 pages that they have to inspect. But if the system were to produce ten 11-page questionnaires, even though the first five pages of each questionnaire are generated from exactly the same data using exactly the same software, that's 110 pages that they have to inspect.
X-)
-------- Thanks for all of the discussion. I think I have a lot to ponder....
May the λ guide your path ;) And of course, you can always outsource some pondering to the mailing list. Regards, apfelmus
participants (4)
-
apfelmus@quantentunnel.de
-
Conal Elliott
-
Greg Buchholz
-
Steve Schafer