[Haskell-cafe] Literate haskell format unclear (implementation and specification inconsistencies)

28 Feb 2007

      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Trying to implement literate haskell[*], I realized several
ways in which the correct behavior for unliterating (especially with
regard to errors) was unclear.  I have several cases which ghc, hugs
and Haskell 98 have differing opinions on!  The Report as it stands
is far from a clear and complete specification (and I didn't find
anything in the Haskell' wiki/trac about literate haskell).

[*](particularly, to make DrIFT able to deal with TeX-style lhs
      - there's unfinished work in darcs repo
     http://isaac.cedarswampstudios.org/2007/DrIFT/ )

testing with:
ghc: 6.4.2, 6.6(some)
hugs: Hugs Version 20050308
nhc98: recent darcs (1.19)
report: Haskell 98 (The Revised Report: December 2002), section 9.4,
  http://www.haskell.org/onlinereport/syntax-iso.html#sect9.4

A full set of .lhs test files for all the issues:
darcs get http://isaac.cedarswampstudios.org/2007/LiterateHaskellTests
or download
http://isaac.cedarswampstudios.org/2007/LiterateHaskellTests-1.tar.gz
or you can try just prefixing all examples with
\begin{code}
module Main where
main = print str
\end{code}
or
...
module Main where
main = print str
as appropriate... (please don't get mangled by mail programs,
                     initial '>'s... ):
1.[UnmatchedBegin]
If a \begin{code} starts a section of code, is \end{code}
_required_ before the end of the file?
       report: unclear
          ghc: required
  hugs, nhc98: not required
The report says "entirely enclosed between", but goes on to say
"More precisely:" and give a description that is not at all precise
in the matter of this question.

2.[AfterBeginOrEnd/{BeginWhite,EndWhite,BeginPrint,EndPrint}]
Can a line beginning \begin{code} or \end{code} have additional
stuff on the end, where the directive is understood and the
additional stuff is ignored?
  report:[yes]
  hugs:[yesIffAdditionalStuffIsInvisible]
  ghc:[case beginningOfLine of
        "\end{code}" -> yes
        "\begin{code}" -> yesIffAdditionalStuffIsInvisible]
  nhc98:[UNLIT_IGNORED]
   where
    yesIffAdditionalStuffIsInvisible =
      if (all isSpace additionalStuff) then yes else UNLIT_IGNORED
    UNLIT_IGNORED means that if it was inside a code block then
      the line is treated as program text (so it's probably
      a syntax error) and if it was in a literate comment section
      it is treated as a non-empty literate comment line.
Note that it takes a careful reading of the report: for begin,
program code only begins on the _following_ line.  Most seem to agree
that it shouldn't mess up your program to have trailing whitespace
on such a line (but at least nhc98 doesn't currently implement this).
Is there any reason to allow NON-whitespace in that location?

3.[IgnoringStringLiterals/{A,B}]
what does "(ignoring string literals, of course)" mean?
that the following(A) makes str = "string gap:end{code}" and an
unended code block(A), or that it makes an ended code block(B)?
(A)---------
\begin{code}
str = "string gap:\
\end{code}"
- ---------
report:unclear, hugs:A, ghc:B, nhc98:A
This works for ghc, the result being "string gap:string gap ends":
(B)---------
\begin{code}
str = "string gap:\
\end{code}"

\begin{code}
\string gap ends"
\end{code}
- -----------
Note that behavior 1 requires a detailed knowledge of Haskell's syntax
in order to unliterate a file, for a dubious benefit (if a string literal
with string gaps is used like that, the programmer could just indent
the second line!)

4.[ExtraBeginEnd/{ExtraBegin,ExtraEnd}]
What happens if \begin{code} appears after another \begin{code}
before an \end{code}; and what happens if an \end{code} appears
without a code block previously having been started by a \begin{code}?
stray end:
   ghc, nhc98:[UNLIT_IGNORED (-> probable successful compile)]
         hugs:[error "\end{code} encountered outside code block"]
stray begin:
    ghc, nhc98:[UNLIT_IGNORED (-> probable syntax error)]
          hugs:[error "\begin{code} encountered inside code block"]

5.[LexicalUnitAcrossLiterateComment/{StringGap,BlockComment}]
Can lexical units jump across literate comment gaps?
report, ghc, hugs, nhc98: yes...
Note that the Report specifies it by removing all non-program lines,
rather than converting them to blank lines, but an additional blank
line in the middle of a Haskell program NEVER makes a difference
(except for line numbering, of course).
- ----------
...
str = "string gap:\
This might be a literate comment.
...
\ends here"

ghc, hugs, nhc98: "string gap:ends here"
or
- --------
...
str = "string"
{- a comment
This might be a literate comment -} with weird character sequences.
...
ends here -}

ghc, hugs, nhc98: think it's a fine comment
I mention this because allowing these makes it complicated to preserve
literate comments in a translation to .hs, because, other than cases
like these, prefixing literate comment lines with "--  " works fine.[*]
However, banning these could make processing that wants to report errors
end up more complicated.  Maybe the report could/should say that it
is "not advisable", as it does for mixing '>' and {code} styles?
(Also it's confusing to the programmer - I wondered
   "can I (and should I) really do that?!" sometimes..)

[*]Haddock style is a nuisance too, which is why there are two spaces
added -- Haddock seems not to recognize such comments then, as desired.
Or would it be better to take the other approach and say those should
count as haddock comments?

6.[TeXBirdtrack/]
I understand that
"It is not advisable to mix these two styles in the same file."
and the report doesn't even talk about how they mix, but now that
I've gotten started on the implementation inconsistencies...
Actually, despite the Report's advice against it, there seems to be
a consensus on what the meaning of mixing the two styles is, which
I'll describe below:

Sensibly, ghc, hugs and nhc98 treat begin/end{code} lines as blank
for the purposes of '>'-style comment checking (which is that
a code and a non-blank literate comment line can't be adjacent);
this works:
[TeXBirdtrack/NoLayout]------------
...
module Main where
{main = print str
\begin{code}
;str = "string"}
\end{code}
ok

Note I didn't rely on the layout rule. This should work:
[TeXBirdtrack/AlignedLayout]------------
...
module Main where
main = print str
\begin{code}
  str = "string"
\end{code}
ok

It does in hugs and nhc98, and according to
http://hackage.haskell.org/trac/ghc/ticket/210
it does in GHC HEAD now (6.7) as well.
As another example, this doesn't work, for the same reason
that you can't start a line with '>' in a .hs file:
[TeXBirdtrack/Wrong]------------
...
module Main where
main = print str
\begin{code}
str = "string"
\end{code}
ok

Hoping to start some discussion,
Isaac

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF5gaoHgcxvIWYTTURAoF4AJwIjQ3hJ9jpwUgHiYgTB7IhN2so4QCdGCKU
96q4YIeakWtlBKOdAiFM+vU=
=qzCQ
-----END PGP SIGNATURE-----

[Haskell-cafe] Literate haskell format unclear (implementation and specification inconsistencies)

Isaac Dupree