[GHC] #15553: GHC.IO.Encoding not flushing partially converted input

#15553: GHC.IO.Encoding not flushing partially converted input -------------------------------------+------------------------------------- Reporter: msakai | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Core | Version: 8.4.3 Libraries | Keywords: | Operating System: Linux Architecture: | Type of failure: Incorrect result Unknown/Multiple | at runtime Test Case: | Blocked By: Blocking: | Related Tickets: Differential Rev(s): | Wiki Page: -------------------------------------+------------------------------------- Conversion by `GHC.IO.Encoding` produces incomplete output for some encodings because it does not flush ''partially converted input'' at the end of the string. [https://manpages.debian.org/stretch/manpages-dev/iconv.3 iconv(3)] provides API for the flushing.
In each series of calls to iconv(), the last should be one with inbuf or *inbuf equal to NULL, in order to flush out any partially converted input.
But `GHC.IO.Encoding` does not perform the flushing properly and it can cause incomplete conversion result. I found two cases that it actually produces incomplete output, but there might be more cases. = Case 1: EUC-JISX0213 For example, the following code is expected to output two bytes 0xa4 0xb1, but it outputs none. {{{#!hs enc <- mkTextEncoding "EUC-JISX0213" withFile "test.txt" WriteMode $ \h -> hSetEncoding h enc >> hPutStr h "\x3051" }}} The problem happens because of the following mapping between Unicode and EUC-JISX0213. ||Unicode||EUC-JISX0213|| ||U+3051 U+309A||0xa4 0xfa|| ||U+3051||0xa4 0xb1|| After seeing the codepoint U+3051, the converter is unable to determine which of the two byte sequence to output until it sees the next character or ''the end of the string''. But `GHC.IO.Encoding` does not call the above mentioned ''flushing'' API, therefore the converter is unable to recognize the end of the string. = Case 2: ISO-2022-JP Similarly, following code is expected to output byte sequence `0x1b 0x24 0x42` `0x24 0x22` `0x1b 0x28 0x42` but the last three bytes `0x1b 0x28 0x42` is not produced. {{{#!hs enc <- mkTextEncoding "ISO-2022-JP" withFile "test.txt" WriteMode $ \h -> hSetEncoding h enc >> hPutStr h "\x3042" }}} ISO-2022-JP is a stateful encoding and [https://www.ietf.org/rfc/rfc1468.txt RFC 1468] requires the state is reset to initial state at the end of the string. The missing three bytes `0x1b 0x28 0x42` are the escape sequence for that purpose. But again `GHC.IO.Encoding` does not call the above mentioned`flushing` API, therefore the converter cannot recognize the end of the string and cannot reset the state. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/15553 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC