
Hi, I would like to reuse the module Distribution.Client.Tar from the cabal-install package. Are you willing to split up your package? Is it worth to have single module packages? Currently, Distribution.Client.Tar depends on Distribution.Client.Utils.writeFileAtomic, but this final call could be left to the user. Maybe writeFileAtomic should be in another package, too? Trying to avoid copying source code, Christian

On Wed, Feb 18, 2009 at 5:20 AM, Christian Maeder
Hi,
I would like to reuse the module Distribution.Client.Tar from the cabal-install package.
Are you willing to split up your package? Is it worth to have single module packages?
Currently, Distribution.Client.Tar depends on Distribution.Client.Utils.writeFileAtomic, but this final call could be left to the user. Maybe writeFileAtomic should be in another package, too?
There's a 'tar' package on Hackage: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar I've had good lusck using it before, but I'm not sure how it's different from what cabal-install uses. Antoine

Antoine Latter wrote:
On Wed, Feb 18, 2009 at 5:20 AM, Christian Maeder
wrote: Hi,
I would like to reuse the module Distribution.Client.Tar from the cabal-install package. [...]
There's a 'tar' package on Hackage: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar
Thanks, for pointing this out. This package depends (in contrast to cabal-install) on additional packages binary and unix-compat, which isn't a real problem, but only seems unnecessary. The tar package does also not support compression (via zlib), but maybe this could/should be a separate package. (There's also an older package "htar" with "compression".) The sources in cabal-install seem most up-to-date (because of cabal-install-0.6.2) and it would make sense to take this sources and replace those in the tar-package. Does the tar-package have any advantage (i.e. speed or portability) over Distribution.Client.Tar? The module structure Codec.Archive.Tar looks a bit nicer, but re-exporting the internal data structures seems unnecessary to me. Where are the actual repositories with the most recent sources? Cheers Christian

On Mon, 2009-02-23 at 12:03 +0100, Christian Maeder wrote:
Antoine Latter wrote:
On Wed, Feb 18, 2009 at 5:20 AM, Christian Maeder
wrote: Hi,
I would like to reuse the module Distribution.Client.Tar from the cabal-install package. [...]
There's a 'tar' package on Hackage: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/tar
Thanks, for pointing this out. This package depends (in contrast to cabal-install) on additional packages binary and unix-compat, which isn't a real problem, but only seems unnecessary.
Yes, I agree, I've already removed them.
The tar package does also not support compression (via zlib), but maybe this could/should be a separate package.
Yes it's easy to compose. The docs in the new code give examples. I don't think it's right to have the package depend on zlib and bzlib and lzma and ... etc when it is trivial to just compose (de)compression in the pipeline: Tar.unpack dir . Tar.read . GZip.decompress =<< BS.readFile tar or BS.writeFile tar . GZip.compress . Tar.write =<< Tar.pack base dir
(There's also an older package "htar" with "compression".)
Oh, that's really just a test program for the tar package. The current one uses the zlib and bzlib packages for compression.
The sources in cabal-install seem most up-to-date (because of cabal-install-0.6.2) and it would make sense to take this sources and replace those in the tar-package.
Yes, that's what I was doing over the weekend.
Does the tar-package have any advantage (i.e. speed or portability) over Distribution.Client.Tar?
It does now! :-) It's now better structured, has better error checking and is better documented.
The module structure Codec.Archive.Tar looks a bit nicer, but re-exporting the internal data structures seems unnecessary to me.
Where are the actual repositories with the most recent sources?
darcs get http://code.haskell.org/tar/ Let me know what you think about the API and documentation. You mention above about exporting internal data structures. As far as I can see everything that is exported in the current code is needed. Let me know if you think it is too much or too little. Currently I get round-trip byte-for-byte compatibility with about 50% of the .tar.gz packages on my system (I'm on gentoo so there's lots of those). The ones that are not byte-for-byte equal after reading/writing are still readable by other tools (and probably normalised and closer to standard compliant) but it needs investigating in more detail. The checking API is incomplete (security, tarbombs, portability) and there are no tests for the lazy streaming properties yet (ie that we can process arbitrary large archives in constant space). Duncan

Duncan Coutts wrote: [...]
Tar.unpack dir . Tar.read . GZip.decompress =<< BS.readFile tar
or
BS.writeFile tar . GZip.compress . Tar.write =<< Tar.pack base dir [...]
The sources in cabal-install seem most up-to-date (because of cabal-install-0.6.2) and it would make sense to take this sources and replace those in the tar-package.
Yes, that's what I was doing over the weekend.
thanks a lot!
darcs get http://code.haskell.org/tar/
Let me know what you think about the API and documentation. You mention above about exporting internal data structures. As far as I can see everything that is exported in the current code is needed. Let me know if you think it is too much or too little.
Ok, I think the api is too big (for a casual user). I don't want to know anything about the internals of an "Entry" or about a "TarPath". For refactoring cabal-install (using your tar package) the following interface was enough: create :: FilePath -> FilePath -> FilePath -> IO () extract :: FilePath -> FilePath -> IO () read :: ByteString -> Entries write :: [Entry] -> ByteString pack :: FilePath -> FilePath -> IO [Entry] unpack :: FilePath -> Entries -> IO () data Entry fileName :: Entry -> FilePath fileContent :: Entry -> ByteString data Entries = Next Entry Entries | Done | Fail String Maybe only a "isNormalFile" test-function for an Entry is missing. checkSecurity is not needed in the API, because it is done by unpack. (checkTarBomb does nothing currently). Tar entries should (usually will) not be constructed by the user. getDirectoryContentsRecursive does not really belong into this tar package. I would be happy, if the existence of TarPath (and all the other funny entry fields) could be hidden from the user. Manipulating Entries is also not a typical user task. (Maybe the type Entries should just be "[Either String Entry]", but the given type is fine, as it only allows a final failure string) So rather than re-exporting almost everything from the other modules in the top module, I suggest my API above and simply expose all other modules in case some wants the internals.
Currently I get round-trip byte-for-byte compatibility with about 50% of the .tar.gz packages on my system (I'm on gentoo so there's lots of those). The ones that are not byte-for-byte equal after reading/writing are still readable by other tools (and probably normalised and closer to standard compliant) but it needs investigating in more detail.
The checking API is incomplete (security, tarbombs, portability) and there are no tests for the lazy streaming properties yet (ie that we can process arbitrary large archives in constant space).
I can only suggest to release it soon, use it for cabal-install and make a new release of cabal-install for ghc-6.10.2 Thank you, Duncan Christian P.S. I could (darcs) send you my (humble) changes to cabal-install and tar

On Tue, 2009-02-24 at 10:34 +0100, Christian Maeder wrote:
Let me know what you think about the API and documentation. You mention above about exporting internal data structures. As far as I can see everything that is exported in the current code is needed. Let me know if you think it is too much or too little.
Ok, I think the api is too big (for a casual user). I don't want to know anything about the internals of an "Entry" or about a "TarPath". For refactoring cabal-install (using your tar package) the following interface was enough:
create :: FilePath -> FilePath -> FilePath -> IO () extract :: FilePath -> FilePath -> IO () read :: ByteString -> Entries write :: [Entry] -> ByteString pack :: FilePath -> FilePath -> IO [Entry] unpack :: FilePath -> Entries -> IO () data Entry fileName :: Entry -> FilePath fileContent :: Entry -> ByteString data Entries = Next Entry Entries | Done | Fail String
Maybe only a "isNormalFile" test-function for an Entry is missing.
isNormalFile entry = fileType entry == NormalFile No we really want one of these for every file type? isDirectory, isHardLink. I hope not. So perhaps that means you just want to expose fileType and FileType too in the top level API.
checkSecurity is not needed in the API, because it is done by unpack. (checkTarBomb does nothing currently).
It's needed if you're checking a tar file now because you expect to unpack it later, eg on hackage.
Tar entries should (usually will) not be constructed by the user.
I've got a use case where we do.
getDirectoryContentsRecursive does not really belong into this tar package.
True but it's so useful you'd have to implement it yourself if it was not provided. We can't get it into the System.Directory package very quickly.
I would be happy, if the existence of TarPath (and all the other funny entry fields) could be hidden from the user.
You'll have to trade that for some cases where you get error instead. Translating file paths to tar paths does not work in all cases. We can hide that in an IO error for functions that are in IO, like pack.
Manipulating Entries is also not a typical user task. (Maybe the type Entries should just be "[Either String Entry]", but the given type is fine, as it only allows a final failure string)
Yeah I think only one failure is good. Manipulating entries is more common than you think, if you are working with a tar file in memory and never extracting files to disk. For example the hackage-server both constructs and dissects tar files but never packs or unpacks them.
So rather than re-exporting almost everything from the other modules in the top module, I suggest my API above and simply expose all other modules in case some wants the internals.
Ok, so perhaps we should split the API into two, in the way you suggest above. I'll try that and see how it looks.
Currently I get round-trip byte-for-byte compatibility with about 50% of the .tar.gz packages on my system (I'm on gentoo so there's lots of those). The ones that are not byte-for-byte equal after reading/writing are still readable by other tools (and probably normalised and closer to standard compliant) but it needs investigating in more detail.
The checking API is incomplete (security, tarbombs, portability) and there are no tests for the lazy streaming properties yet (ie that we can process arbitrary large archives in constant space).
I can only suggest to release it soon, use it for cabal-install and make a new release of cabal-install for ghc-6.10.2
I'll not be using the tar package for cabal-install immediately. Keeping the number of dependencies low while we work out installation and packaging issues is fairly important. People already complain that cabal-install has any dependencies at all. Once it's bundled in the platform then that restriction will be lifted.
P.S. I could (darcs) send you my (humble) changes to cabal-install and tar
Please. Duncan

On Tue, 2009-02-24 at 12:00 +0000, Duncan Coutts wrote:
On Tue, 2009-02-24 at 10:34 +0100, Christian Maeder wrote:
checkSecurity is not needed in the API, because it is done by unpack. (checkTarBomb does nothing currently).
It's needed if you're checking a tar file now because you expect to unpack it later, eg on hackage.
Tar entries should (usually will) not be constructed by the user.
I've got a use case where we do.
So I should say what these use cases are. One is cabal-install of course. That's fairly easy it just needs to create and extract tar files. The other case is hackage. For that we want to upload and check the contents of tar files, without ever unpacking them to local files. We want to check the tar file itself to make sure it is a portable format (ie not containing any funky extensions that not all tar readers will grok) or things that are not portable between platforms, like file names that would be invalid on Windows. We also want to extract a single file in memory (the .cabal file). Another case within hackage is constructing the 00-index.tar.gz file. That is built in memory from an another internal representation of the package index. For that we really are constructing each entry ourselves, supplying all the appropriate info including file modification time, ownership etc. A final case in hackage is serving the contents of .tar files. This is to let users browse the contents of packages, eg to read the README without having to download the whole .tar.gz. Should also make all the code more easily googleable. It's also the method we want to use for serving haddock docs. Bots or package owners will upload .tar.gz bundles of documentation and we'll serve the contents directly without unpacking them. We'll do that by gunziping and storing the .tar file on disk, scanning it once to generate a file name -> (offset, length) index and then when we service a request we open the .tar file seek to the offset and return that length. I think that's it. Duncan

Duncan Coutts wrote: [...]
data Entry fileName :: Entry -> FilePath fileContent :: Entry -> ByteString [...] Maybe only a "isNormalFile" test-function for an Entry is missing.
isNormalFile entry = fileType entry == NormalFile
No we really want one of these for every file type? isDirectory, isHardLink. I hope not. So perhaps that means you just want to expose fileType and FileType too in the top level API.
I only suggested isNormalFile in order to test, if fileContent can be reasonably used. Directly working with FileType and Entry should be for experts only. Surely the file types Directory and SymbolicLink/HardLink (a single "isLink" test) may be of some user interest, too, but for these types fileContent has no meaning and additionally the linkTarget should be exported. For directories the entry does not reveal the contents. Therefore, I think, fileName, fileContent, and isNormalFile is just enough for a top level API.
checkSecurity is not needed in the API, because it is done by unpack. (checkTarBomb does nothing currently).
It's needed if you're checking a tar file now because you expect to unpack it later, eg on hackage.
It's perfect in Codec.Archive.Tar.Check
Tar entries should (usually will) not be constructed by the user.
I've got a use case where we do.
Ok, importing Codec.Archive.Tar.Types and Codec.Archive.Tar.Pack should be fine then.
getDirectoryContentsRecursive does not really belong into this tar package.
True but it's so useful you'd have to implement it yourself if it was not provided. We can't get it into the System.Directory package very quickly.
Keeping it for a while in Codec.Archive.Tar.Pack only, should be fine. (A separate module is overkill.) [...]
So rather than re-exporting almost everything from the other modules in the top module, I suggest my API above and simply expose all other modules in case some wants the internals.
Ok, so perhaps we should split the API into two, in the way you suggest above. I'll try that and see how it looks. [...]
Thanks again, Christian

On Tue, 2009-02-24 at 13:42 +0100, Christian Maeder wrote:
Ok, so perhaps we should split the API into two, in the way you suggest above. I'll try that and see how it looks. [...]
Ok, I've applied your patch with a slight variation. See the haddocks: http://haskell.org/~duncan/tar/docs/ Basically it's split into Codec.Archive.Tar for the high level bits of the api you suggested and Codec.Archive.Tar.Entry for the details of what an Entry is, how to construct one etc. Hopefully the documentation makes that split clear. I've exported fileType :: Entry -> FileType data FileType = ... from the high level api, rather than adding isNormalFile :: Entry -> Bool Thanks again for the feedback. Duncan

Duncan Coutts wrote:
Manipulating Entries is also not a typical user task. (Maybe the type Entries should just be "[Either String Entry]", but the given type is fine, as it only allows a final failure string)
Yeah I think only one failure is good.
data Entries = Entries [Entry] (Maybe String) or directly using "([Entry], Maybe String)" is an alternative, where "Nothing" indicates no failure. This would avoid the extra folding and mapping on Entries and maybe "unpack" could then work on [Entry] only. Cheers Christian

On Tue, 2009-02-24 at 13:57 +0100, Christian Maeder wrote:
Duncan Coutts wrote:
Manipulating Entries is also not a typical user task. (Maybe the type Entries should just be "[Either String Entry]", but the given type is fine, as it only allows a final failure string)
Yeah I think only one failure is good.
data Entries = Entries [Entry] (Maybe String) or directly using "([Entry], Maybe String)"
Sadly that does not account for streaming. We may not discover the error until we've processed 100 megabytes of tar file. So we cannot give top level access to an error that might happen at the end.
is an alternative, where "Nothing" indicates no failure. This would avoid the extra folding and mapping on Entries
The nice thing about the current Entries is that it models the failure modes and works lazily.
and maybe "unpack" could then work on [Entry] only.
True, though it's also convenient that unpack just raises any entry errors as IO errors. Converting Entries to [Entry] requires reading the whole thing (to check there are no errors), again defeating lazyness/streaming which would mean we could not work with large tar files in constant space. Duncan

On Tue, Feb 24, 2009 at 01:19:49PM +0000, Duncan Coutts wrote:
On Tue, 2009-02-24 at 13:57 +0100, Christian Maeder wrote:
Duncan Coutts wrote:
Manipulating Entries is also not a typical user task. (Maybe the type Entries should just be "[Either String Entry]", but the given type is fine, as it only allows a final failure string)
Yeah I think only one failure is good.
data Entries = Entries [Entry] (Maybe String) or directly using "([Entry], Maybe String)"
Sadly that does not account for streaming. We may not discover the error until we've processed 100 megabytes of tar file. So we cannot give top level access to an error that might happen at the end.
Something that does account for streaming is MList (Either String) Entry, where MList is from "ListT done right". [1] This would have the advantage of being able to use all the standard Monad/MList machinery. It would also have the disadvantage that MList isn't in any standard library, and adds some extra constructor (un)packing. Cheers, Remi [1] http://www.haskell.org/haskellwiki/ListT_done_right
participants (4)
-
Antoine Latter
-
Christian Maeder
-
Duncan Coutts
-
Remi Turk