
Hi. I'm still working on the Netflix Prize; now I have managed to implement a parsing function for the qualifying data set (or quiz set). The quiz set is in the format: 1: 10 20 30 2: 100 200 3: 1000 2000 3000 4000 5000 Where 1, 2, 3 are movie ids, and the others are user ids. The parsing program is at: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2300 The program reads the file using lazy IO. One of the feature I want is, for the quiz function, to be a *good producer*. I'm quite satisfied with the result (this is the first "complex" parsing function I have written in Haskell), and I have also managed to avoid the use of an accumulator. However I'm interested to know it there is a more efficient solution. The qualifying.txt file is 51MB, 2834601 lines. On my laptop, the function performance are: real 1m14.117s user 0m2.496s sys 0m0.136s CPU usage is about 3%, system load is about 0.20, memory usage is 4956 KB. What I'm worried about is: quiz' ((id, ":") : l) = (id, quiz'' l) : quiz' l quiz' ((id, _) : l) = quiz' l the problem here is that the same elements in the list are processed multiple times. I have written alternate versions. The first using foldl http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2303 (that, however, builds the entire data structure in memory, and in reverse order) The latter using foldr http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2304 (that, however, is incorrect and I'm unable to fix). The performances of the foldr version are very similar to the performances of the first implementation (it make use, however, of 3704 KB, and it is about 3 seconds faster). P.S: the expected result for the sample quiz set I have posted is: [(1,[10,20,30]),(2,[100,200]),(3,[1000,2000,3000,4000,5000])] The foldl version produces: [(3,[5000,4000,3000,2000,1000]),(2,[200,100]),(1,[30,20,10])] The foldr version produces: [(1,[]),(2,[10,20,30]),(3,[100,200]),(5000,[1000,2000,3000,4000])] Thanks Manlio Perillo

2009/3/11 Manlio Perillo
Hi.
I'm still working on the Netflix Prize; now I have managed to implement a parsing function for the qualifying data set (or quiz set).
The quiz set is in the format:
1: 10 20 30 2: 100 200 3: 1000 2000 3000 4000 5000
Where 1, 2, 3 are movie ids, and the others are user ids.
The parsing program is at: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2300
The program reads the file using lazy IO. One of the feature I want is, for the quiz function, to be a *good producer*.
I'm quite satisfied with the result (this is the first "complex" parsing function I have written in Haskell), and I have also managed to avoid the use of an accumulator.
However I'm interested to know it there is a more efficient solution.
The qualifying.txt file is 51MB, 2834601 lines.
On my laptop, the function performance are:
real 1m14.117s user 0m2.496s sys 0m0.136s
CPU usage is about 3%, system load is about 0.20, memory usage is 4956 KB.
What I'm worried about is:
quiz' ((id, ":") : l) = (id, quiz'' l) : quiz' l quiz' ((id, _) : l) = quiz' l
the problem here is that the same elements in the list are processed multiple times.
I have written alternate versions. The first using foldl http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2303 (that, however, builds the entire data structure in memory, and in reverse order)
The latter using foldr http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2304 (that, however, is incorrect and I'm unable to fix).
Hi, I suggest you try an alternative strategy. That altenrative strategy is twofold, just like you have quiz' and quiz'. This alternate strategy avoid pattern matching on strings (which would be cumbersome for a bit more complex syntax). So you have to write two functions : one try to parse an 'id:' from the string. It either succeed and returns the Int value of the id, or it fails. You can use Either to encode success and failure. Furthermore that function has to return the remaining string (which is the same as received upon failure). The second function do a similar thing, this time failing on 'id:'. (And it still returns alos the remainstring). If you're familiar with the State monad, you can write the above two functions by using the string as the state. Now, given those two functions, try to apply them on your input string, feeding the next function application with the resulting string of the current application. What I proposed here is the very basics of a natural and very used parsing technique (google for parser combinators). That technic will scale well for more complex inputs, and, I believe, should provide you with sufficient performance. Cheers, Thu

minh thu ha scritto:
[...] I suggest you try an alternative strategy. That altenrative strategy is twofold, just like you have quiz' and quiz'. This alternate strategy avoid pattern matching on strings (which would be cumbersome for a bit more complex syntax).
But for this specific case it is very compact and elegant (IMHO).
[...] Now, given those two functions, try to apply them on your input string, feeding the next function application with the resulting string of the current application.
So, I should not split the string into lines? An useful feature of my program is that it parses both an input like: 1: 1046323,2005-12-19 and 1: 1046323 If I write a parser from scratch I need to implement two separate functions. I will give it a try, just to check if it has better performances. Thanks Manlio

2009/3/11 Manlio Perillo
minh thu ha scritto:
[...] I suggest you try an alternative strategy. That altenrative strategy is twofold, just like you have quiz' and quiz'. This alternate strategy avoid pattern matching on strings (which would be cumbersome for a bit more complex syntax).
But for this specific case it is very compact and elegant (IMHO).
I would say it is difficult to see what you're doing in the code without the desciption you gave in the mail. But you're right, it's not the string pattern matching which is the problem. It is more the pair (Int, rest of the bytestring which can begin or not with ':')... Why not have quiz' accepting just the bytestring (and not the id value), and returning the (Int,[Int]) ?
[...] Now, given those two functions, try to apply them on your input string, feeding the next function application with the resulting string of the current application.
So, I should not split the string into lines?
See below.
An useful feature of my program is that it parses both an input like:
1: 1046323,2005-12-19
and 1: 1046323
If I write a parser from scratch I need to implement two separate functions.
I didn't think to that but nothing prevent you to write the second function I suggested to account for that case, or for an end of line (if can 'eat' the ':' from the input, you can also eat a newline). Thu

2009/3/11 minh thu
2009/3/11 Manlio Perillo
: minh thu ha scritto:
[...] I suggest you try an alternative strategy. That altenrative strategy is twofold, just like you have quiz' and quiz'. This alternate strategy avoid pattern matching on strings (which would be cumbersome for a bit more complex syntax).
But for this specific case it is very compact and elegant (IMHO).
I would say it is difficult to see what you're doing in the code without the desciption you gave in the mail. But you're right, it's not the string pattern matching which is the problem.
It is more the pair (Int, rest of the bytestring which can begin or not with ':')...
Why not have quiz' accepting just the bytestring (and not the id value), and returning the (Int,[Int]) ?
[...] Now, given those two functions, try to apply them on your input string, feeding the next function application with the resulting string of the current application.
So, I should not split the string into lines?
See below.
An useful feature of my program is that it parses both an input like:
1: 1046323,2005-12-19
and 1: 1046323
If I write a parser from scratch I need to implement two separate functions.
I didn't think to that but nothing prevent you to write the second function I suggested to account for that case, or for an end of line (if can 'eat' the ':' from the input, you can also eat a newline).
Thu
Ok, The approach I suggested is a bit overkill. You can indeed use L.lines to split the input into lines then work on that. But still, avoid the pair (Int, Bytestring). Instead, you can basically map on each line the unsafeReadInt modified to : - return the id - return if it is one kind of id or the other kind. so : type UserId = Int type MovieId = Int unsafeReadInt :: Line -> Either MovieId UserId Now you have a nice list [Either MovieId UserId] that you need to transform into (MovieId, [UserId]). Sorry, for the previous response. Thu

minh thu ha scritto:
[...] The approach I suggested is a bit overkill. You can indeed use L.lines to split the input into lines then work on that.
But still, avoid the pair (Int, Bytestring). Instead, you can basically map on each line the unsafeReadInt modified to : - return the id - return if it is one kind of id or the other kind.
so : type UserId = Int type MovieId = Int unsafeReadInt :: Line -> Either MovieId UserId
Now you have a nice list [Either MovieId UserId] that you need to transform into (MovieId, [UserId]).
Thanks, this seems a much better solution. Manlio

Manlio Perillo ha scritto:
minh thu ha scritto:
[...] The approach I suggested is a bit overkill. You can indeed use L.lines to split the input into lines then work on that.
But still, avoid the pair (Int, Bytestring). Instead, you can basically map on each line the unsafeReadInt modified to : - return the id - return if it is one kind of id or the other kind.
so : type UserId = Int type MovieId = Int unsafeReadInt :: Line -> Either MovieId UserId
Now you have a nice list [Either MovieId UserId] that you need to transform into (MovieId, [UserId]).
Thanks, this seems a much better solution.
Done: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2309 real 1m15.220s user 0m4.816s sys 0m0.308s 3084 KB memory usage Previous version required 4956 KB of memory. Thanks again for the suggestion, Minh. Manlio

2009/3/11 Manlio Perillo
Manlio Perillo ha scritto:
minh thu ha scritto:
[...] The approach I suggested is a bit overkill. You can indeed use L.lines to split the input into lines then work on that.
But still, avoid the pair (Int, Bytestring). Instead, you can basically map on each line the unsafeReadInt modified to : - return the id - return if it is one kind of id or the other kind.
so : type UserId = Int type MovieId = Int unsafeReadInt :: Line -> Either MovieId UserId
Now you have a nice list [Either MovieId UserId] that you need to transform into (MovieId, [UserId]).
Thanks, this seems a much better solution.
One improvement you can do : In the line quiz' (Left id : l) = (id, quiz'' l) : quiz' l notice you use two times the 'l', and the next line of code pass through the Right case. Change your code so that quiz'' has a return type ([UserId],Bytestring). The above line becomes quiz' (Left id : l) = (id, ids) : quiz' rest where (ids,rest) = quiz'' l Thu
participants (2)
-
Manlio Perillo
-
minh thu