
The other day at work an opportunity arose where I was hoping to sneak some Haskell into the pipeline of tools used to process call detail records (CDRs). In the telecommunications industry, CDRs are used for billing. Each CDR is a single line record of 30 comma-separated values. Each line is approximately 240 characters in length. The task at hand is to replace field number 10 if a new value can be found in a hashmap which is keyed using the contents of the field. My colleague was going to write a C program (that's all he knows), but I whipped up a trivial python program instead. I was curious if a haskell version could be faster and more elegant , but I have not been able to beat my python version in either case. So, I'm curious as to how you would go about this task in Haskell. The input files are generally 300-400MB, and the hashmap will contain perhaps 20-30 items. For those that know python, here is a very simple implementation that happens to be very fast compared to my Haskell version and very short: for line in sys.stdin: fields = line.split(',') fields[9] = tgmap.get(fields[9], fields[9]) print ",".join(fields), For each line in standard input: - Splits the string on the comma: "field0,field1,...,field29" => ["field0", "field1", ..., "field29"] to obtain a list of strings. - Gets the value associated with the key of field9 from tgmap, if it does not exist, it returns a default value which is the original value. I.e., if it's not in the map, then don't replace the field. - Joins the list of fields with a comma to yield a string again which is printed out to standard output. The join method on the string is a bit odd: ",".join([1,2,3]) => "1,2,3" Here is my first Haskell attempt: import Data.ByteString.Lazy.Char8 as B hiding (map,foldr) import Data.List (map) import Data.Map as M hiding (map) -- This is just a placeholder until I actually populate the map tgmap = M.singleton (B.pack "Pete") (B.pack "Kazmier") main = B.interact $ B.unlines . map doline . B.lines where doline = B.join comma . mapIndex fixup . B.split ',' fixup i s = if i==9 then M.findWithDefault s s tgmap else s comma = B.pack "," -- f is supplied the index of the current element being processed mapIndex f xs = m f 0 xs where m f i [] = [] m f i (x:xs') = f i x : m f (i+1) xs' After talking with dons on #haskell, he cleaned my version up and produced this version which gets rid of 'if' statement and makes mapIndex stricter: import Data.ByteString.Lazy.Char8 as B hiding (map,foldr) import Data.List (map) import Data.Map as M hiding (map) -- This will be populated from a file dict = M.singleton (B.pack "Pete") (B.pack "Kazmier") main = B.interact $ B.unlines . map doline . B.lines where doline = B.join comma . mapIndex fixup . B.split ',' comma = B.singleton ',' fixup 3 s = M.findWithDefault s s dict fixup n s = s -- f is supplied the index of the current element being processed mapIndex :: (Int -> ByteString -> ByteString) -> [ByteString] -> [ByteString] mapIndex f xs = m xs 0 where m [] _ = [] m (x:xs') i = f i x : (m xs' $! i+1) That helped things a bit, but I must confess I don't understand how the strictness improved things as I had assumed things were going to be evaluated in a reasonable amount of time due to the printing of output. I thought IO was interlaced with the execution and thus I wasn't going to have to concern myself over laziness. In addition, the function is able to generate new elements of the list on demand so I thought it was a good citizen in the lazy world. Could anyone help explain? And then he came up with another version to avoid the 'unlines', but that did not that really speed things up significantly. So, with all that said, is there a better approach to this problem? Perhaps a more elegant Haskell solution? Thanks, Pete

Pete Kazmier wrote:
import Data.ByteString.Lazy.Char8 as B hiding (map,foldr) import Data.List (map) import Data.Map as M hiding (map)
-- This will be populated from a file dict = M.singleton (B.pack "Pete") (B.pack "Kazmier")
main = B.interact $ B.unlines . map doline . B.lines where doline = B.join comma . mapIndex fixup . B.split ',' comma = B.singleton ',' fixup 3 s = M.findWithDefault s s dict fixup n s = s
-- f is supplied the index of the current element being processed mapIndex :: (Int -> ByteString -> ByteString) -> [ByteString] -> [ByteString] mapIndex f xs = m xs 0 where m [] _ = [] m (x:xs') i = f i x : (m xs' $! i+1)
How about import Data.ByteString.Lazy.Char8 as B hiding (map,foldr) import Data.List (map) import Data.Map as M hiding (map) dict = M.singleton (B.pack "Pete") (B.pack "Kazmier") main = B.interact $ B.unlines . map doline . B.lines where doline = B.join comma . zipWith ($) fixup9 . B.split ',' fixup9 = fixup 9 fixup n = replicate n id ++ [\s -> M.findWithDefault s s dict] ++ repeat id Note that fixup9 is shared intentionally across different invocations of doline. The index n starts at 0. Also note that because (compare :: (Byte)String -> ..) takes time proportional to the string length, the use of Map will inevitably introduce a constant factor. But I'm still happy you didn't use arrays or hash tables (urgh!) :) In any case, tries are *the* natural data structure for (in)finite maps in functional languages, see also Ralf Hinze. Generalizing generalized tries. Journal of Functional Programming, 10(4):327-351, July 2000 http://www.informatik.uni-bonn.de/~ralf/publications/GGTries.ps.gz Regards, apfelmus

On 2006-10-01, Pete Kazmier
For those that know python, here is a very simple implementation that happens to be very fast compared to my Haskell version and very short:
for line in sys.stdin: fields = line.split(',')
Of course, this doesn't handle quoted values that contain commas, which are among the many joys* of parsing CSV files. I might just point out the existance of MissingH.Str.CSV to you, which can read and write CSV files. See http://gopher.quux.org:70/devel/missingh/html/MissingH-Str-CSV.html for details. A one-line call gives you a [[String]]. This uses Parsec, so it will not be a top-notch performer, but it is elegant ;-) * I mean "joy" sarcastically, in case it wasn't obvious.
participants (3)
-
apfelmus@quantentunnel.de
-
John Goerzen
-
Pete Kazmier