
Hello, I just uploaded new version of pdf-toolbox suite. Now it supports text extraction, see http://hackage.haskell.org/package/pdf-toolbox-document-0.0.2.0/docs/Pdf-Too... New library, pdf-toolbox-content, contains low level tools for text extraction. For example, one can extract glyphs with exact positions. It can be used e.g. to implement text selection in PDF viewer (see screenshots). Is anybody interested in that functionality? I tested it on all PDF files in my ~/Downloads, but there is a number of corner cases that are not handled because I never saw them in the wild. So, if you are interested, please try it out and report any issue. The easiest way is to install pdf-toolbox-viewer (not on Hackage, see https://github.com/Yuras/pdf-toolbox/tree/master/viewer , it depends on gtk2hs) and run it with path to PDF file as an argument. Or you can just use pageExtractText function directly: import System.IO import Pdf.Toolbox.Document main = withBinaryFile "input.pdf" ReadMode $ \handle -> runPdfWithHandle handle knownFilters $ do pdf <- document catalog <- documentCatalog pdf rootNode <- catalogPageNode catalog count <- pageNodeNKids rootNode liftIO $ print count -- the first page of the document page <- pageNodePageByNum rootNode 0 txt <- pageExtractText page liftIO $ print txt Few screenshots (please let me know if you can't access them): - render via ImageMagick: https://docs.google.com/file/d/0B0K_fl2fc1ZgcnVtZXhFTUx5ekE/edit?usp=sharing - render extracted text with correct positions: https://docs.google.com/file/d/0B0K_fl2fc1ZgZE52X0hMcVNnaG8/edit?usp=sharing - combined image: https://docs.google.com/file/d/0B0K_fl2fc1ZgaUE5Qkt6S19VQlE/edit?usp=sharing On Hackage: - pdf-toolbox-document: http://hackage.haskell.org/package/pdf-toolbox-document - pdf-toolbox-core: http://hackage.haskell.org/package/pdf-toolbox-core - pdf-toolbox-content: http://hackage.haskell.org/package/pdf-toolbox-content Thanks, Yuras