Proposal for a Unicode-safe layout rule

I propose that Haskell's layout rule be changed in the following simple way: * We identify a set of "layout-unsafe" Unicode characters which may occupy something other than one column in some fixed-width fonts. This would include (among other things) combining characters and full-width CJK characters. Explicit Unicode escape sequences, if any, should also count as layout-unsafe. Anything doubtful should be layout-unsafe. * A special unknown-column value is added to the set of possible column positions. * All characters following any layout-unsafe character on a source line are taken to be at position unknown-column. * Any time a layout decision requires comparing two column positions and one or both of them is unknown-column, the lexer will abort with a helpful error message. If TAB is treated as layout-unsafe (as it should be) then this rule change will break some existing code, but only code that deserves to be broken. If TAB is treated specially as it currently is, this change should not break any existing code. More importantly, the change is safe in the sense that any program which is correct under the new rule has the same meaning as it did under the old rule. This is true regardless of what characters end up in the set layout-unsafe. -- Ben

If TAB is treated as layout-unsafe (as it should be) then this rule change will break some existing code, but only code that deserves to be broken. If TAB is treated specially as it currently is, this change should not break any existing code.
Alternative: make the column value a partial order: identical white space is equal indent, additional white space increases the indent and less decreases it, but different white space is incomparable. In other words, you can use whatever mix of spaces and tabs you like, as long as you use it consistently. Eight spaces, four spaces, and one tab are all different, but if you always use one tab you're fine. Like this: -- suggested whitespace partial order for layout import Char import List -- |Opaque indent value type data Indent = Indent String deriving Show -- |Given a line, compute the indent of that line indentOf :: String -> Indent indentOf s = Indent (takeWhile isSpace s) -- |Compare two indents (NB: partial order) cmpIndent :: Indent -> Indent -> Maybe Ordering cmpIndent (Indent s1) (Indent s2) | s1 == s2 = Just EQ | isPrefixOf s1 s2 = Just LT | isPrefixOf s2 s1 = Just GT | otherwise = Nothing This way, it doesn't matter how wide any whitespace character is, or even what tab width you use. The system doesn't even need to know - and any Unicode whitespace character can be used, not just the Latin-1 "isSpace" ones. Comments? --KW 8-)

-- |Opaque indent value type data Indent = Indent String deriving Show
Oops... please delete "deriving Show"; that was debugging code I left in by accident.
--KW 8-)
--
Keith Wansbrough

Keith Wansbrough wrote:
If TAB is treated as layout-unsafe (as it should be) then this rule change will break some existing code, but only code that deserves to be broken. If TAB is treated specially as it currently is, this change should not break any existing code.
Alternative: make the column value a partial order: identical white space is equal indent, additional white space increases the indent and less decreases it, but different white space is incomparable. In other words, you can use whatever mix of spaces and tabs you like, as long as you use it consistently. Eight spaces, four spaces, and one tab are all different, but if you always use one tab you're fine. Like this:
-- suggested whitespace partial order for layout
import Char import List
-- |Opaque indent value type data Indent = Indent String deriving Show
-- |Given a line, compute the indent of that line indentOf :: String -> Indent indentOf s = Indent (takeWhile isSpace s)
-- |Compare two indents (NB: partial order) cmpIndent :: Indent -> Indent -> Maybe Ordering cmpIndent (Indent s1) (Indent s2) | s1 == s2 = Just EQ | isPrefixOf s1 s2 = Just LT | isPrefixOf s2 s1 = Just GT | otherwise = Nothing
This way, it doesn't matter how wide any whitespace character is, or even what tab width you use. The system doesn't even need to know - and any Unicode whitespace character can be used, not just the Latin-1 "isSpace" ones.
But note that your algorithm doesn't specify what would happen if you
had used tabs instead of spaces on the last three lines above. If you
allow tabs, you need to know exactly what they represent.
There is nothing wrong with the current layout rules[1]; they don't
need to change.
[1] Regarding tabs, that is; there *is* a problem when you start
dealing with characters which aren't necessarily "one character cell"
wide (e.g. CJK).
--
Glynn Clements

But note that your algorithm doesn't specify what would happen if you had used tabs instead of spaces on the last three lines above. If you allow tabs, you need to know exactly what they represent.
You're right. I was confused...
--KW 8-)
--
Keith Wansbrough

On Mon, 2003-08-04 at 02:16, Keith Wansbrough wrote:
If TAB is treated as layout-unsafe (as it should be) then this rule change will break some existing code, but only code that deserves to be broken. If TAB is treated specially as it currently is, this change should not break any existing code.
Alternative: make the column value a partial order: identical white space is equal indent, additional white space increases the indent and less decreases it, but different white space is incomparable. In other words, you can use whatever mix of spaces and tabs you like, as long as you use it consistently. Eight spaces, four spaces, and one tab are all different, but if you always use one tab you're fine. Like this:
-- suggested whitespace partial order for layout
import Char import List
-- |Opaque indent value type data Indent = Indent String deriving Show
-- |Given a line, compute the indent of that line indentOf :: String -> Indent indentOf s = Indent (takeWhile isSpace s)
-- |Compare two indents (NB: partial order) cmpIndent :: Indent -> Indent -> Maybe Ordering cmpIndent (Indent s1) (Indent s2) | s1 == s2 = Just EQ | isPrefixOf s1 s2 = Just LT | isPrefixOf s2 s1 = Just GT | otherwise = Nothing
This way, it doesn't matter how wide any whitespace character is, or even what tab width you use. The system doesn't even need to know - and any Unicode whitespace character can be used, not just the Latin-1 "isSpace" ones.
Comments?
Would f x = let y = 1 {- a -} z = 2 in 3 be legal (I believe it is currently)? What if that 'a' were a Unicode character? Carl Witty
participants (4)
-
Ben Rudiak-Gould
-
Carl Witty
-
Glynn Clements
-
Keith Wansbrough