Low level problem with Data.Text.IO

Hello,
I have a strange (low level) problem with the Data.Text library.
Running the simple program below on a certain text file causes a low level
error.
runghc ./ReadFiles.hs testfile
ghc(16402,0xb0103000) malloc: *** error for object 0x2501710: pointer being
freed was not allocated
*** set a breakpoint in malloc_error_break to debug
ghc(16402,0xb0103000) malloc: *** error for object 0x2501710: pointer being
freed was not allocated
*** set a breakpoint in malloc_error_break to debug
ReadFiles.hs: testfile: hGetContents: invalid argument (Illegal byte
sequence)
module Main where
import qualified Data.Text.IO as TI
import qualified Data.Text as T
import System
main = do
args <- getArgs
let fileName:_ = args
txt <- TI.readFile fileName
putStrLn $ show txt
Unfortunately I can 't post the specific data file.
But according to the file program it is a text file with :
Non-ISO extended-ASCII text, with very long lines, with CRLF, LF line
terminators encoding.
How can I debug this problem ? What would you guys do ? Trying gdb ?
thanks in advance,
Pieter
--
Pieter Laeremans

On Wed, Aug 25, 2010 at 1:34 PM, Pieter Laeremans
I have a strange (low level) problem with the Data.Text library.
Thanks for bringing this up.
Running the simple program below on a certain text file causes a low level error.
This only happens on a single file? Can you mail it to me directly, perhaps? I might be able to reproduce it based on the "Illegal byte sequence" error you are seeing, which makes me think it's not a UTF-8-encoded file, but that low-level crash should of course not happen. What version of GHC and the text library are you using?

For debugging the error, we'll need to know what your locale's encoding is. You can see this by echoing the $LANG environment variable. For example: $ echo $LANG en_US.UTF-8 means my encoding is UTF-8. Haskell doesn't currently have any decoding libraries with good error handling (that I know of), so you might need to use an external library or program. My preference is Python, since it has very descriptive errors. I'll load a file, attempt to decode it with my locale encoding, and then see what errors pop up: $ python
content = open("testfile", "rb").read() text = content.decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0x9d in position 1: unexpected code byte
The exact error will help us generate a test file to reproduce the problem.
If you don't see any error, then the bug will be more difficult to
track down. Compile your program into a binary (ghc --make
ReadFiles.hs) and then run it with gdb, setting a breakpoint in the
malloc_error_break procedure:
$ ghc --make ReadFiles.hs
$ gdb ./ReadFiles
(gdb) break malloc_error_break
(gdb) run testfile
... program runs ...
BREAKPOINT
(gdb) bt

On Wed, Aug 25, 2010 at 1:34 PM, Pieter Laeremans
I have a strange (low level) problem with the Data.Text library.
Thanks again for your bug report, Pieter. I've reproduced it, written a HUnit test for it, fixed it (the HUnit test is to ensure that it stays fixed), and pushed the fix to the main darcs repo ( http://code.haskell.org/text). The next version of text is almost ready to release. Until then, building and running from darcs should be fine.
participants (3)
-
Bryan O'Sullivan
-
John Millikin
-
Pieter Laeremans