
hi, i have files with lots (millions of rows) of data like this: chr6 chr10 96.96 3392 101 2 79030508 79033899 4160024 4163413 0.0 5894 chr6 chr10 93.19 4098 228 13 117152751 117156826 11355389 11359457 0.0 5886 chr6 chr10 95.82 3445 130 5 112422073 112425513 7785396 7788830 0.0 5666 and i'd like to read it into a type like this: data Blast = Blast { query :: S.ByteString , subject :: S.ByteString , hitlen :: Int , mismatch :: Int , gaps :: Int , qstart :: Int , qstop :: Int , sstart :: Int , sstop :: Int , pctid :: Double , evalue :: Double , bitscore :: Double } deriving (Show) where each of those fields corresponds to a column in the file. in python, i do something like:
line = [fn(col) for fn, col in zip([str, str, int, int, int, int, int, int, int, float, float], sline.split("\t")]
what's a fast, simple way to do this in haskell? is it something like: instance Read Blast where readsPrec s = ????? any pointers on where to look for simple examples of this type of parsing would be much appreciated. thanks, -brentp

My suggestion would be to look into writing a parser (via parsec) to handle this. Parsec is fairly easy to learn, and since your data is a pretty simple format, the parser won't be hard to write. Parsec will then give you a parser which you can run on the file, it'll catch parse errors, it's all around very lovely to use. There is a chapter of Real World Haskell on the subject, and I'm sure we'll be happy to help with whatever isn't covered. /Joe On Dec 7, 2009, at 10:43 PM, Brent Pedersen wrote:
hi, i have files with lots (millions of rows) of data like this: chr6 chr10 96.96 3392 101 2 79030508 79033899 4160024 4163413 0.0 5894 chr6 chr10 93.19 4098 228 13 117152751 117156826 11355389 11359457 0.0 5886 chr6 chr10 95.82 3445 130 5 112422073 112425513 7785396 7788830 0.0 5666
and i'd like to read it into a type like this:
data Blast = Blast { query :: S.ByteString , subject :: S.ByteString , hitlen :: Int , mismatch :: Int , gaps :: Int , qstart :: Int , qstop :: Int , sstart :: Int , sstop :: Int , pctid :: Double , evalue :: Double , bitscore :: Double } deriving (Show)
where each of those fields corresponds to a column in the file. in python, i do something like:
line = [fn(col) for fn, col in zip([str, str, int, int, int, int, int, int, int, float, float], sline.split("\t")]
what's a fast, simple way to do this in haskell?
is it something like:
instance Read Blast where readsPrec s = ?????
any pointers on where to look for simple examples of this type of parsing would be much appreciated. thanks, -brentp _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Joe Fredette wrote:
My suggestion would be to look into writing a parser (via parsec) to handle this. Parsec is fairly easy to learn, and since your data is a pretty simple format, the parser won't be hard to write.
While I'm all for using a proper parser, Brent Pedersen notes that his data will have millions of rows, so that Parsec is likely to run into memory problems. I think something along the lines of import Data.ByteString.Lazy.Char8 as B parse = map (zipWith ($) formats . B.split '\t') . B.lines where formats = [str, str, int, int, int, int, int, int, int, float, float] int = fst . fromJust . readInt float = \s -> read (unpack s) :: Double str = id will do just fine. (The implementation of float is a kludge, I think there's something on hackage for that, though?) Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Hi all,
On Tue, Dec 8, 2009 at 5:29 AM, Heinrich Apfelmus
Joe Fredette wrote:
My suggestion would be to look into writing a parser (via parsec) to handle this. Parsec is fairly easy to learn, and since your data is a pretty simple format, the parser won't be hard to write.
While I'm all for using a proper parser, Brent Pedersen notes that his data will have millions of rows, so that Parsec is likely to run into memory problems.
I think something along the lines of
import Data.ByteString.Lazy.Char8 as B
parse = map (zipWith ($) formats . B.split '\t') . B.lines where formats = [str, str, int, int, int, int, int, int, int, float, float] int = fst . fromJust . readInt float = \s -> read (unpack s) :: Double str = id
I've been looking at this example and I can't figure it out how it works. Seems to me that "formats" is a list of functions that return different types. How does this work? Patrick
will do just fine. (The implementation of float is a kludge, I think there's something on hackage for that, though?)
Regards, Heinrich Apfelmus
-- http://apfelmus.nfshost.com
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
-- ===================== Patrick LeBoutillier Rosemère, Québec, Canada

Patrick LeBoutillier wrote:
Heinrich Apfelmus wrote:
import Data.ByteString.Lazy.Char8 as B
parse = map (zipWith ($) formats . B.split '\t') . B.lines where formats = [str, str, int, int, int, int, int, int, int, float, float] int = fst . fromJust . readInt float = \s -> read (unpack s) :: Double str = id
I've been looking at this example and I can't figure it out how it works. Seems to me that "formats" is a list of functions that return different types. How does this work?
Oops. It doesn't. :D Mea culpa, the functions having different types is indeed a problem. Fortunately, the trick from Oliver Danvy's "Functional Unparsing" http://www.brics.dk/RS/98/12/BRICS-RS-98-12.pdf applies. Here's the code: import qualified Data.ByteString.Lazy.Char8 as B import Data.Maybe parse = map (convert format . B.split '\t') . B.lines where format = str . str . int . int . int . int . int . int . int . float . float int = lift $ fst . fromJust . B.readInt float = lift $ \s -> read (B.unpack s) :: Double str = lift $ id lift :: (a -> b) -> ([a] -> c) -> ([a] -> (b,c)) lift f k (x:xs) = (f x, k xs) convert k = k nil where nil [] = () This time, I've also tested it in GHCi. Try :type parse to see the magic type. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com
participants (4)
-
Brent Pedersen
-
Heinrich Apfelmus
-
Joe Fredette
-
Patrick LeBoutillier