Alex unicode trick

newer
testsuite change

older
Starting GHC development.

Mateusz Kowalczyk

7 Jan 2014 7 Jan '14

12:55 p.m.

Greetings, When looking at the GHC lexer (Lexer.x), there's:

...

$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetChar. $whitechar = [\ \n\r\f\v $unispace] $white_no_nl = $whitechar # \n $tab = \t

Scrolling down to alexGetChar and alexGetChar', we see the comments:

...

-- backwards compatibility for Alex 2.x alexGetChar :: AlexInput -> Maybe (Char,AlexInput)

-- This version does not squash unicode characters, it is used when -- lexing strings. alexGetChar' :: AlexInput -> Maybe (Char,AlexInput)

What's the reason for these? I was under the impression that since 3.0, Alex has natively supported unicode. Is it just dead code? Could all the hex $uni* functions be removed? If not, why not? -- Mateusz K.

Show replies by date

Carter Schonwald

7 Jan 7 Jan

1:06 p.m.

you're probably right, this could be regarded as dead code for ghc 7.8 (esp since alex and happy must be the recent versions to even build ghc HEAD ! ) On Tue, Jan 7, 2014 at 2:25 AM, Mateusz Kowalczyk wrote:

...

Greetings,

When looking at the GHC lexer (Lexer.x), there's:

...
$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetChar. $whitechar = [\ \n\r\f\v $unispace] $white_no_nl = $whitechar # \n $tab = \t

Scrolling down to alexGetChar and alexGetChar', we see the comments:

...
-- backwards compatibility for Alex 2.x alexGetChar :: AlexInput -> Maybe (Char,AlexInput)

-- This version does not squash unicode characters, it is used when -- lexing strings. alexGetChar' :: AlexInput -> Maybe (Char,AlexInput)

What's the reason for these? I was under the impression that since 3.0, Alex has natively supported unicode. Is it just dead code? Could all the hex $uni* functions be removed? If not, why not?

-- Mateusz K. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Krasimir Angelov

1:56 p.m.

Hi, I was recenly looking at this code to see how the lexer decides that a character is a letter, space, etc. The problem is that with Unicode there are hundreds of thousands of characters that are declared to be alphanumeric. Even if they are compressed into a regular expression with a list of ranges there will be still ~390 ranges. The GHC lexer avoids hardcoding this ranges by calling isSpace, isAlpha, etc and then converting this result to a code. Ideally it would be nice if Alex had a predefined macroses corresponding to the Unicode categories, but for now you have to either hard code the ranges with huge regular expressions or use the workaround used in GHC. Is there any other solution? Regards, Krasimir 2014/1/7 Carter Schonwald :

...

you're probably right, this could be regarded as dead code for ghc 7.8 (esp since alex and happy must be the recent versions to even build ghc HEAD ! )

On Tue, Jan 7, 2014 at 2:25 AM, Mateusz Kowalczyk wrote:

...
Greetings,

When looking at the GHC lexer (Lexer.x), there's:

...
$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetChar. $whitechar = [\ \n\r\f\v $unispace] $white_no_nl = $whitechar # \n $tab = \t

Scrolling down to alexGetChar and alexGetChar', we see the comments:

...
-- backwards compatibility for Alex 2.x alexGetChar :: AlexInput -> Maybe (Char,AlexInput)

-- This version does not squash unicode characters, it is used when -- lexing strings. alexGetChar' :: AlexInput -> Maybe (Char,AlexInput)

What's the reason for these? I was under the impression that since 3.0, Alex has natively supported unicode. Is it just dead code? Could all the hex $uni* functions be removed? If not, why not?

-- Mateusz K. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Simon Marlow

8:08 p.m.

Krasimir is right, it would be hard to use Alex's built-in Unicode support because we have to automatically generate the character classes from the Unicode spec somehow. Probably Alex ought to include these as built-in macros, but right now it doesn't. Even if we did have access to the right regular expressions, I'm slightly concerned that the generated state machine might be enormous. Cheers, Simon On 07/01/2014 08:26, Krasimir Angelov wrote:

...

Hi,

I was recenly looking at this code to see how the lexer decides that a character is a letter, space, etc. The problem is that with Unicode there are hundreds of thousands of characters that are declared to be alphanumeric. Even if they are compressed into a regular expression with a list of ranges there will be still ~390 ranges. The GHC lexer avoids hardcoding this ranges by calling isSpace, isAlpha, etc and then converting this result to a code. Ideally it would be nice if Alex had a predefined macroses corresponding to the Unicode categories, but for now you have to either hard code the ranges with huge regular expressions or use the workaround used in GHC. Is there any other solution?

Regards, Krasimir

2014/1/7 Carter Schonwald :

...
you're probably right, this could be regarded as dead code for ghc 7.8 (esp since alex and happy must be the recent versions to even build ghc HEAD ! )

On Tue, Jan 7, 2014 at 2:25 AM, Mateusz Kowalczyk wrote:

...
Greetings,

When looking at the GHC lexer (Lexer.x), there's:

...
$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetChar. $whitechar = [\ \n\r\f\v $unispace] $white_no_nl = $whitechar # \n $tab = \t

Scrolling down to alexGetChar and alexGetChar', we see the comments:

...
-- backwards compatibility for Alex 2.x alexGetChar :: AlexInput -> Maybe (Char,AlexInput)

-- This version does not squash unicode characters, it is used when -- lexing strings. alexGetChar' :: AlexInput -> Maybe (Char,AlexInput)

What's the reason for these? I was under the impression that since 3.0, Alex has natively supported unicode. Is it just dead code? Could all the hex $uni* functions be removed? If not, why not?

-- Mateusz K. _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Mateusz Kowalczyk

11:48 p.m.

On 07/01/14 14:38, Simon Marlow wrote:

...

Krasimir is right, it would be hard to use Alex's built-in Unicode support because we have to automatically generate the character classes from the Unicode spec somehow. Probably Alex ought to include these as built-in macros, but right now it doesn't.

Even if we did have access to the right regular expressions, I'm slightly concerned that the generated state machine might be enormous.

Cheers, Simon

On 07/01/2014 08:26, Krasimir Angelov wrote:

...
Hi,

I was recenly looking at this code to see how the lexer decides that a character is a letter, space, etc. The problem is that with Unicode there are hundreds of thousands of characters that are declared to be alphanumeric. Even if they are compressed into a regular expression with a list of ranges there will be still ~390 ranges. The GHC lexer avoids hardcoding this ranges by calling isSpace, isAlpha, etc and then converting this result to a code. Ideally it would be nice if Alex had a predefined macroses corresponding to the Unicode categories, but for now you have to either hard code the ranges with huge regular expressions or use the workaround used in GHC. Is there any other solution?

Regards, Krasimir

Ah, I think I understand now. If this is the case, at least the ‘alexGetChar’ could be removed, right? Is Alex 2.x compatibility necessary for any reason whatsoever? -- Mateusz K.

Simon Marlow

13 Jan 13 Jan

3:38 p.m.

On 07/01/2014 18:18, Mateusz Kowalczyk wrote:

...

Ah, I think I understand now. If this is the case, at least the ‘alexGetChar’ could be removed, right? Is Alex 2.x compatibility necessary for any reason whatsoever?

Yes, the backwards compatibility could be removed now that we require a very recent version of Alex. Cheers, Simon

4377

Age (days ago)

4383

Last active (days ago)

List overview

Download

5 comments

4 participants

participants (4)

Carter Schonwald
Krasimir Angelov
Mateusz Kowalczyk
Simon Marlow