
On Tue, Jan 31, 2012 at 6:05 AM, Marc Weber
I didn't say that I tried your code. I gave enumerator package a try counting lines which I expected to behave similar to conduits because both serve a similar purpose. Then I hit the the "sourceFile" returns chunked lines issue (reported it, got fixed) - ....
Anyway: My log files are a json dictionary on each line:
{ id : "foo", ... } { id : "bar", ... }
Now how do I use the conduit package to split a "chunked" file into lines? Or should I create a new parser "many json >> newline" ?
Currently there are two solutions. The first one is what I wrote earlier on this thread: jsonLines :: C.Resource m => C.Conduit B.ByteString m Value jsonLines = C.sequenceSink () $ do val <- CA.sinkParser json' CB.dropWhile isSpace_w8 return $ C.Emit () [val] This conduit will run the json' parser (from aeson) and then drop any whitespace after that. Note that it will correctly parse all of your files but will also parse some files that don't conform to your specification. I assume that's fine. The other solution is going to released with conduit 0.2, probably today. There's a lines conduit that splits the file into lines, so you could write jsonLines above as: mapJson :: C.Resource m => C.Conduit B.ByteString m Value mapJson = C.sequenceSink () $ do val <- CA.sinkParser json' return $ C.Emit () [val] which doesn't need to care about newlines, and then change main to main = do ... ret <- forM_ fileList $ \fp -> do C.runResourceT $ CB.sourceFile fp C.$= CB.lines C.$= -- new line is here mapJson C.$= CL.mapM processJson C.$$ CL.consume print ret I don't know which solution would be faster. Either way, both solutions will probably be faster with the new conduit 0.2.
Except that I think my processJson for this test should look like this because I want to count how often the clients queried the server. Probalby I should also be using CL.fold as shown in the test cases of conduit. If you tell me how you'd cope with the "one json dict on each line" issue I'll try to benchmark this solution as well.
This issue was already being coped with in my previous e-mail =).
-- probably existing library functions can be used here .. processJson :: (M.Map T.Text Int) -> Value -> (M.Map T.Text Int) processJson m value = case value of Ae.Object hash_map -> case HMS.lookup (T.pack "id") hash_map of Just id_o -> case id_o of Ae.String id -> M.insertWith' (+) id 1 m _ -> m _ -> m _ -> m
Looks like the perfect job for CL.fold. Just change those three last lines in main from ... C.$= CL.mapM processJson C.$$ CL.consume into ... C.$$ CL.fold processJson and you should be ready to go. Cheers! -- Felipe.