please critique my first stab at systems programming

The idea is to walk the disk looking for a signature, say NTFS or EXT. Since we do not know where the block containing this identifier is, we read the blocks in one at a time. Long term I would like to support command line arguments for the name of the file and the offset to start looking. Comments on style, idioms, etc. welcomed. I am particularly interested in the hIsEOF check and if there is a better way to handle that. import qualified Data.ByteString as B import qualified Data.ByteString.Char8 as BC import IO import System.IO chunkSize = 512 searchForPattern handle pat = searchForPattern' 0 handle pat searchForPattern' index handle pat = do eof <- hIsEOF handle if eof then return Nothing else do bytes <- B.hGet handle chunkSize case BC.breakSubstring pat bytes of (x, y) | BC.null y -> searchForPattern' (index + 1) handle pat | otherwise -> return (Just index) main = do fromHandle <- openBinaryFile "blocks" ReadMode result <- searchForPattern fromHandle (BC.pack "PART") case result of Nothing -> putStr "Not Found.\n" Just n -> putStr $ "Found at " ++ show n ++ ".\n" hClose fromHandle putStr "Done.\n"

On Thursday 14 April 2011 12:14:59, Sean Perry wrote:
The idea is to walk the disk looking for a signature, say NTFS or EXT. Since we do not know where the block containing this identifier is, we read the blocks in one at a time.
It may be more efficient to read the entire thing into a lazy ByteString instead of reading in very small strict chunks. That also makes the eof check unnecessary.
Long term I would like to support command line arguments for the name of the file and the offset to start looking.
Comments on style, idioms, etc. welcomed. I am particularly interested in the hIsEOF check and if there is a better way to handle that.
$ cabal install stringsearch -- faster searches for a pattern than provided by bytestring import qualified Data.ByteString.Char8 as BC import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.Search as L blockSize = 512 main = do stuff <- L.readFile "blocks" case L.indices (BC.pack "PART") stuff of [] -> putStrLn "Not Found." (i:_) -> putStrLn $ "Found at " ++ show (i `quot` blockSize) ++ "." putStrLn "Done."
import qualified Data.ByteString as B import qualified Data.ByteString.Char8 as BC import IO import System.IO
chunkSize = 512
searchForPattern handle pat = searchForPattern' 0 handle pat
searchForPattern' index handle pat = do eof <- hIsEOF handle if eof then return Nothing else do bytes <- B.hGet handle chunkSize case BC.breakSubstring pat bytes of (x, y) | BC.null y -> searchForPattern' (index + 1) handle pat
| otherwise -> return (Just index)
main = do fromHandle <- openBinaryFile "blocks" ReadMode result <- searchForPattern fromHandle (BC.pack "PART") case result of Nothing -> putStr "Not Found.\n" Just n -> putStr $ "Found at " ++ show n ++ ".\n" hClose fromHandle putStr "Done.\n"

On Thu, Apr 14, 2011 at 8:50 AM, Daniel Fischer
$ cabal install stringsearch -- faster searches for a pattern than provided by bytestring
import qualified Data.ByteString.Char8 as BC import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.Search as L
blockSize = 512
main = do stuff <- L.readFile "blocks" case L.indices (BC.pack "PART") stuff of [] -> putStrLn "Not Found." (i:_) -> putStrLn $ "Found at " ++ show (i `quot` blockSize) ++ "." putStrLn "Done."
This solves a different problem. The original looked for PART only on the beginning of each block, while your solution will find everything that has PART in it. Because this is I/O bound, probably the real time taken won't be very different, but the program will surely use more CPU time and report more false positives. Cheers, -- Felipe.

On Thursday 14 April 2011 14:30:59, Felipe Almeida Lessa wrote:
On Thu, Apr 14, 2011 at 8:50 AM, Daniel Fischer
wrote: $ cabal install stringsearch -- faster searches for a pattern than provided by bytestring
import qualified Data.ByteString.Char8 as BC import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.Search as L
blockSize = 512
main = do stuff <- L.readFile "blocks" case L.indices (BC.pack "PART") stuff of [] -> putStrLn "Not Found." (i:_) -> putStrLn $ "Found at " ++ show (i `quot` blockSize) ++ "." putStrLn "Done."
This solves a different problem. The original looked for PART only on the beginning of each block,
Well, I would have expected that, but
searchForPattern' index handle pat = do
eof <- hIsEOF handle if eof then return Nothing
else do
bytes <- B.hGet handle chunkSize case BC.breakSubstring pat bytes of
(x, y) | BC.null y -> searchForPattern' (index + 1) handle pat
| otherwise -> return (Just index)
searches the entire block and checks whether the second component is empty to see whether there's any match at all. So I thought that'd be the desired behaviour. If one wants to only look at the start of the blocks, isPrefixOf would be a much more efficient way than breakSubstring. It would probably be still more efficient to read larger chunks than 512 bytes and step through them with 512-byte steps to check for a "PART" there (unless the "PART" appears in the first few blocks).
while your solution will find everything that has PART in it. Because this is I/O bound, probably the real time taken won't be very different, but the program will surely use more CPU time and report more false positives.
Cheers,

searchForPattern' (and therefore searchForPattern) will read the entire file, indeed. To check only the first 'n' chunks in a file you'd have to 1) receive a lazy bytestring, check first n*chuckSize bytes. 2) receive the handle, loop over n chunks. El jue, 14-04-2011 a las 14:50 +0200, Daniel Fischer escribió:
On Thursday 14 April 2011 14:30:59, Felipe Almeida Lessa wrote:
On Thu, Apr 14, 2011 at 8:50 AM, Daniel Fischer
wrote: $ cabal install stringsearch -- faster searches for a pattern than provided by bytestring
import qualified Data.ByteString.Char8 as BC import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.Search as L
blockSize = 512
main = do stuff <- L.readFile "blocks" case L.indices (BC.pack "PART") stuff of [] -> putStrLn "Not Found." (i:_) -> putStrLn $ "Found at " ++ show (i `quot` blockSize) ++ "." putStrLn "Done."
This solves a different problem. The original looked for PART only on the beginning of each block,
Well, I would have expected that, but
searchForPattern' index handle pat = do
eof <- hIsEOF handle if eof then return Nothing
else do
bytes <- B.hGet handle chunkSize case BC.breakSubstring pat bytes of
(x, y) | BC.null y -> searchForPattern' (index + 1) handle pat
| otherwise -> return (Just index)
searches the entire block and checks whether the second component is empty to see whether there's any match at all. So I thought that'd be the desired behaviour.
If one wants to only look at the start of the blocks, isPrefixOf would be a much more efficient way than breakSubstring. It would probably be still more efficient to read larger chunks than 512 bytes and step through them with 512-byte steps to check for a "PART" there (unless the "PART" appears in the first few blocks).
while your solution will find everything that has PART in it. Because this is I/O bound, probably the real time taken won't be very different, but the program will surely use more CPU time and report more false positives.
Cheers,
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

On Thu, Apr 14, 2011 at 9:50 AM, Daniel Fischer
Well, I would have expected that, but [snip] searches the entire block and checks whether the second component is empty to see whether there's any match at all. So I thought that'd be the desired behaviour.
Oops, you're right =). I wonder what was OP's intended behaviour.
On Thu, Apr 14, 2011 at 7:14 AM, Sean Perry
The idea is to walk the disk looking for a signature, say NTFS or EXT. Since we do not know where the block containing this identifier is, we read the blocks in one at a time.
You may use an iteratee as well, which gives clean, efficient code that doesn't depend on IO (among other things). Using the "enumerator" package [1], (and assuming that you just want to check the beginning of each block), you could write: import qualified Data.Enumerator as E import qualified Data.Enumerator.Binary as EB searchForPattern :: Monad m => B.ByteString -> E.Iteratee B.ByteString m (Maybe Int) searchForPattern pat | restLen >= 0 = go 0 | otherwise = return Nothing where patL = L.fromChunks [pat] patLen = B.length pat restLen = chunkSize - patLen go i = do str <- EB.take (fromIntegral patLen) case (L.length str < fromIntegral patLen, str == patL) of (True, _) -> return Nothing (_, True) -> return (Just i) _ -> do EB.drop (fromIntegral restLen) go $! i+1 Using lazy bytestrings could make your code leak memory, leak handles and/or choke with exceptions. Using handles directly in IO (as in your original approach) makes code harder to test and leaves all gritty details up to you (such as reading blocks larger than 512 bytes). This iteratee is easily testable with pure code, does not leak, handles exceptions gracefully and you don't have to worry about how the file is being read. I've attached a full working example. In the example I read in blocks of 16 KiB, but you may easily adjust that (without compromising correctness). Cheers! =) [1] http://hackage.haskell.org/package/enumerator -- Felipe.

On Apr 14, 2011, at 6:17 AM, Felipe Almeida Lessa wrote:
On Thu, Apr 14, 2011 at 9:50 AM, Daniel Fischer
wrote: Well, I would have expected that, but [snip] searches the entire block and checks whether the second component is empty to see whether there's any match at all. So I thought that'd be the desired behaviour.
Oops, you're right =). I wonder what was OP's intended behaviour.
Well, I started with B.findSubstring, but the compiler told me that was deprecated and break was the preferred replacement. Looking at the docs, findSubstring was shown to be the equivalent of the case statement which you see in my code. I do not care how many times the pattern appears in a sector just whether or not it exists at all.
On Thu, Apr 14, 2011 at 7:14 AM, Sean Perry
wrote: The idea is to walk the disk looking for a signature, say NTFS or EXT. Since we do not know where the block containing this identifier is, we read the blocks in one at a time.
You may use an iteratee as well, which gives clean, efficient code that doesn't depend on IO (among other things). Using the "enumerator" package [1], (and assuming that you just want to check the beginning of each block), you could write:
Now that I am a little more awake, let me state my requirements more clearly to ensure this works as I expect. I am hunting through the raw sectors of the disk looking for a sector with a specific pattern. "PART" is just a standin for the real thing. Since disks are obviously bigger than a meg or two I need to worry about space usage. Time performance is important but not as important as not reading all 256G of a disk into memory (-: I need to output the sector number when found. This is why I started with the very simple sector by sector search instead of larger chunk based one. In my experience the OS tends to read in 4k multiples even when you just ask for 512 so there is a buffer somewhere helping with performance. I like your approach Felipe. Thanks. Definitely not something that would have been obvious from reading the library docs. I knew there was a better way hiding somewhere though. Had I been coding in Python or some other language I typically use I would have let Handle be the point of abstraction. Anything that looked like a File would work for me. The definition of 'go' looks nice and elegant. What does 'the $!i here in 'go $! i+1' do?

On Thu, Apr 14, 2011 at 12:29 PM, Sean Perry
I like your approach Felipe. Thanks. Definitely not something that would have been obvious from reading the library docs. I knew there was a better way hiding somewhere though. Had I been coding in Python or some other language I typically use I would have let Handle be the point of abstraction. Anything that looked like a File would work for me.
The Yesod Book contains a chapter about the enumerator package [1], although it isn't finished, yet. [1] http://www.yesodweb.com/book/enumerator
The definition of 'go' looks nice and elegant. What does 'the $!i here in 'go $! i+1' do?
As you know, f $ x is just f x. The $! operator, however, evaluates the second argument to weak head normal form (WHNF) before calling the function. In other words, f $! x = x `seq` f x If we just called "go (i+1)", a huge thunk would be created. For example, if the string was found in sector 1024, then it would return Just (((((···(1+1)···+1)+1)+1)+1)+1) instead of simply Just 1024. But, when we call "go $! i+1", we force the sum to be done *now*, so the funtion would return Just (1023+1) which doesn't cause any problems =). Cheers, -- Felipe.

On Thursday 14 April 2011 17:29:25, Sean Perry wrote:
Now that I am a little more awake, let me state my requirements more clearly to ensure this works as I expect.
I am hunting through the raw sectors of the disk looking for a sector with a specific pattern. "PART" is just a standin for the real thing. Since disks are obviously bigger than a meg or two I need to worry about space usage. Time performance is important but not as important as not reading all 256G of a disk into memory (-:
That's why you'd use a *lazy* ByteString or an Enumerator/Iteratee. You can have a space leak or leak a file handle with lazy ByteStrings if you hold on to parts of the ByteString unnecessary or the compiler doesn't detect that the file handle isn't referenced later anymore, but in simple situations that is unlikely. So in simple situations using the simpler code you get from lazy ByteStrings is reasonable. I suspect you can also cause leaks with Enumerators and Iteratees, but that'll be harder, since avoiding those is the/one point of Enumerators and Iteratees.
I need to output the sector number when found. This is why I started with the very simple sector by sector search instead of larger chunk based one. In my experience the OS tends to read in 4k multiples even when you just ask for 512 so there is a buffer somewhere helping with performance.
But I think you'd still have a system call with a context switch every 512 bytes, while if you read in larger chunks you'll have fewer context switches and better performance.
I like your approach Felipe. Thanks. Definitely not something that would have been obvious from reading the library docs. I knew there was a better way hiding somewhere though. Had I been coding in Python or some other language I typically use I would have let Handle be the point of abstraction. Anything that looked like a File would work for me.
The definition of 'go' looks nice and elegant. What does 'the $!i here in 'go $! i+1' do?
($!) is strict application, 'go $! i+1' evaluates (i+1) before doing the work in go itself, thus preventing the building of a huge thunk go ((...(0+1)...+1)+1)
participants (4)
-
Daniel Fischer
-
Felipe Almeida Lessa
-
MAN
-
Sean Perry