RE: [Haskell-cafe] File path programme

[ moving to libraries@haskell.org ] On 26 January 2005 12:22, Malcolm Wallace wrote:
Could we just punt this library for this release. After all we can add libraries in a later point release (eg 6.4.1) you just can't change existing APIs.
FWIW, I agree with Duncan, Ben, and Peter, that the new System.FilePath interface is broken, and the implementation more so. It would be better to redesign FilePaths as an algebraic datatype.
Ok, I'll go with the concensus. System.FilePath in its current state will be removed from the base package. Isaac or Krasimir: can you make the required changes to Cabal? (the code is already duplicated in Distribution.Compat.FilePath, just remove the import for GHC). So let's aim for 6.6 to do it right, and design an abstract type for path names. We have to think about the migration path: Haskell 98 programs must continue to work, so at least IO.FilePath must still be a String, and IO.openFile must still take a String. We can therefore: (a) make System.IO.FilePath be the new type, which is different from, and incompatible with, IO.FilePath. Similarly for System.IO.openFile, System.Directory.removeFile, and so on. (b) or just define a new type, and force you to insert read/show to convert where necessary, eg. before calling openFile. (a) is kind of the right thing, but (b) is a lot less painful in the short term. Since we'll be migrating to a new IO library at some point, (b) is probably fine (the new IO library can use the new FilePaths exclusively), but we'll need to migrate System.Directory too. Would someone like to take up the reigns on the design for the new library? Cheers, Simon

On Wed, Jan 26, 2005 at 01:34:39PM -0000, Simon Marlow wrote:
... We can therefore:
(a) make System.IO.FilePath be the new type, which is different from, and incompatible with, IO.FilePath. Similarly for System.IO.openFile, System.Directory.removeFile, and so on.
(b) or just define a new type, and force you to insert read/show to convert where necessary, eg. before calling openFile.
(a) is kind of the right thing, but (b) is a lot less painful in the short term. Since we'll be migrating to a new IO library at some point, (b) is probably fine (the new IO library can use the new FilePaths exclusively), but we'll need to migrate System.Directory too.
One thing I've been wishing for some time (as long as we're discussing a replacement for FilePath) was to have a FilePath class, which would allow me to use my FastPackedStrings with the IO routines. It seems silly to have a byte-oriented filepath, convert it into a String and then have the IO library convert back again to a byte-oriented string to call the C library. I've wished there were a class FilePath f where toStringFilePath :: f -> String withCStringFilePath :: f -> (CString -> IO a) -> IO a or something like that (where the withCStringFilePath could have a default written in terms of toStringFilePath). It's a shame when I use ffi for some of my IO (which of course always requires CStrings) and haskell IO libraries which always require Strings to keep having to convert back and forth. Alas, darcs does enough "quick" calls to stat (doesFileExist, etc) that the cost isn't negligible. Eventually I'll rewrite a lot of this to just use the ffi (since I want to use lstat anyways, or its windows equivalent), but it would be nice (eventually) not to have to do this. But how painful would it be for the System.IO functions to have types such as readFile :: FilePath a => a -> String ? -- David Roundy http://www.darcs.net

After the discussion about file paths over the last several days I went home and put together a quick trial implementation for unix file paths, with the idea of adding windows, SMB and maybe VMS (why not?) paths. It is based on a Path class. I'll post it later when I get home. However, I will attempt to recreate the class definition from memory now to see if we can get some discussion going about the methods needed/desired, and whether or not this is a good approach. data PathRoot = UnixFilesystemRoot | WindowsDrive Char | SMBShare String String | VMSDevice String | ... -- whatever else we need class (Show p) => Path p where isAbsolute :: p -> Bool isRelative :: p -> Bool isRelative = not . isAbsolute basename :: p -> String parent :: p -> p pathAppend :: p -> p -> p -- 2nd arg must be a relative path pathSeparator :: p -> Char pathParser :: Parser p -- parsec parser parsePath :: String -> p parsePath str = case parse pathParser "" str of Left e = error (Show e) Right p = p Other operations I think might be worthwhile: pathRoot :: p -> Maybe PathRoot -- relative paths return Nothing pathCleanup :: p -> p -- remove .. and suchlike commonAncestors :: p -> p -> p pathFrom :: p -> p -> p -- returns a relative path from the 1st to the 2nd pathCompare :: p -> p -> ??? -- not sure what this should mean or return hasExtension :: p -> String -> Bool pathToForeign :: p -> IO (Ptr CChar) pathFromForeign :: Ptr CChar -> IO p

After the discussion about file paths over the last several days I went home and put together a quick trial implementation for unix file
robert dockins wrote: paths, with the idea of adding windows, SMB and maybe VMS (why not?) paths. This is great. Comments below.
data PathRoot = UnixFilesystemRoot | WindowsDrive Char | SMBShare String String | VMSDevice String | ... -- whatever else we need
I would say that all paths are relative to something, whether it's the Unix root, or the current directory, or whatever. Therefore I would call this something like PathStart, and add: | CurrentDirectory | CurrentDirectoryOfWindowsDrive Char | RootOfCurrentWindowsDrive What is a pathname, broadly speaking? Answer: it's a description of a path in a directed graph with labeled edges. It consists of a single node designator (the starting point) and a sequence of edge designators, i.e. data Pathname = Pathname { pathStart :: PathStart, pathEdges :: [String] } Most of the time all we care about is either the final node or the final edge that we reach by following the path. The only reason we specify the rest of the path is that there are only a few nodes that we can name directly; to refer to any other location on the graph we have to give "driving directions" from one of those nodes. There's no reason the OS couldn't make nodes and edges first-class entities--it would solve a multitude of problems--but most don't, so forget that. On Unix, there are two nodes we can name directly, the "root" and the "current directory". On Windows, there are 26 roots and 26 current directories which we can name directly; additionally we can name the root or current directory of the current drive, which is one of those 26, and there are an arbitrary number of network share roots, and \\.\, and perhaps some other stuff I don't know about. Symbolic links complicate things a bit, since they are followed like edges but are actually paths (so they may be affected by seemingly unrelated changes to the graph). They're rather like VPNs, actually, though I'm not sure how far I want to push that analogy. Whether we're talking about the final node or the final edge depends on the OS call; this is the usual pointer-vs-pointee confusion that's also found in most programming languages outside the ML family. Probably we can ignore it, with the exception of the "/foo" vs "/foo/" distinction, which we must preserve. This can probably be handled by parsing the latter as Pathname { pathStart = UnixFilesystemRoot, pathEdges = ["foo","."] }.
class (Show p) => Path p where
Okay, I'm not convinced that a Path class is the right approach. For the reasons given above, I think I'd rather have a single Path datatype, probably with its data constructors exported. What do we gain with the class approach? Well... (A) Functions that accept paths can be polymorphic on the path type (where String is a path type). (B) We can have different datatypes for the paths of different operating systems. It seems like these are two very different problems which should be solved with different typeclasses, if they're to be solved with typeclasses at all. I think (A) can be solved very simply, and independently of the specification of a path-handling library: class IsPath a where withCPath :: a -> (Ptr CChar -> IO b) -> IO b instance IsPath String where withCPath = withCString -- tricky i18n issues! instance IsPath [CChar] where withCPath = withArray0 0 instance IsPath PathADT where withCPath = withCString . pathToString instance IsPath (Ptr CChar) where withCPath = flip ($) openFile :: (IsPath p) => p -> ... I'm tentatively opposed to (B), since I think that the only interesting difference between Win32 and Posix paths is in the set of starting points you can name. (The path separator isn't very interesting.) But maybe it does make sense to have separate starting-point ADTs for each operating system. Then of course there's the issue that Win32 edge labels are Unicode, while Posix edge labels are [Word8]. Hmm.
isAbsolute :: p -> Bool
Definition: a path is absolute if its meaning is independent of (Posix: the current directory) (Win32: all current directories and the current drive).
pathCleanup :: p -> p -- remove .. and suchlike
This can't be done safely except in a few special cases (e.g. "/.." -> "/"). I'm not sure it should be here.
hasExtension :: p -> String -> Bool
This is really an operation on a single component of the path. I think it would make more sense to make it an ordinary function with type String -> String -> Bool and use the basename method to get the appropriate path component.
pathToForeign :: p -> IO (Ptr CChar) pathFromForeign :: Ptr CChar -> IO p
This interface is problematic. Is the pointer returned by pathToForeign a heap pointer which the caller is supposed to free? If so, a Ptr CChar instance would have to copy the pathname every time. And I don't understand exactly what pathFromForeign is supposed to do. -- Ben

I would say that all paths are relative to something, whether it's the Unix root, or the current directory, or whatever. Therefore I would call this something like PathStart, and add:
| CurrentDirectory | CurrentDirectoryOfWindowsDrive Char | RootOfCurrentWindowsDrive
This is true in a sense, but I think making the distinction explicit is helpful for a number of the operations we want to do. For example, what is the parent of the relative path "."? Answer is "..". What is the parent of "/." on unix? Answer is "/.". I would also argue that it only makes sense to append a relative path on the right (ie, we can't append "/tmp/foo" onto "/usr/local", but we can append "tmp/foo"). Relative paths can refer to different things in the filesystem depending on process-local state, whereas absolute paths will always refer to the same thing (until the filesystem changes, or if you do something esoteric like "chroot"). Relative paths are really "path fragments."
On Unix, there are two nodes we can name directly, the "root" and the "current directory". On Windows, there are 26 roots and 26 current directories which we can name directly; additionally we can name the root or current directory of the current drive, which is one of those 26, and there are an arbitrary number of network share roots, and \\.\, and perhaps some other stuff I don't know about.
There are a few others. I took a look at MSDN earlier and was astounded.
Whether we're talking about the final node or the final edge depends on the OS call; this is the usual pointer-vs-pointee confusion that's also found in most programming languages outside the ML family. Probably we can ignore it, with the exception of the "/foo" vs "/foo/" distinction, which we must preserve.
I've solved that as you suggested where "foo/" goes to "foo/."
class (Show p) => Path p where Okay, I'm not convinced that a Path class is the right approach.
I'm not convinced either, but it feels natural to me.
I'm tentatively opposed to (B), since I think that the only interesting difference between Win32 and Posix paths is in the set of starting points you can name. (The path separator isn't very interesting.) But maybe it does make sense to have separate starting-point ADTs for each operating system. Then of course there's the issue that Win32 edge labels are Unicode, while Posix edge labels are [Word8]. Hmm.
I think these differences make separate implementations worthwhile. The question then is wether to abstract them via a type class, or with a datatype like: data FilePath = POSIXFilePath POSIXPath | WinFilePath WinPath Disadvantage here is that the datatype is closed. Advantage is that pattern matching tells you what kind of path you have staticly.
pathCleanup :: p -> p -- remove .. and suchlike
This can't be done safely except in a few special cases (e.g. "/.." -> "/"). I'm not sure it should be here.
More than you would think, if you follow the conventions of modern unix shells. eg, "foo/.." is always equal to ".", and "foo/bar/../../.." is equal to "..", and "foo///bar" is equal to "foo/bar". This is the behavior that "cd" gives on modern posix shells (rather than doing a chdir on the ".." hardlink, which does strange things in the presence of symlinks). The operation is sufficently useful that I think it should be included. It lets us know, for example, that "/bar/../foo/tmp" and "/foo/tmp" refer to the same file, without resorting to any IO operations.
hasExtension :: p -> String -> Bool
This is really an operation on a single component of the path. I think it would make more sense to make it an ordinary function with type String -> String -> Bool and use the basename method to get the appropriate path component.
The problem is that String doesn't faithfully capture the representation of path edges. For POSIX it is a sequence of Word8 (except for 0x2F). In my implementation of UnixPaths, each path carries along an encoding component, which (theoreticly) tells you how to do [Word8] <-> [Char] translations. Eventually we will get a real IO layer complete with character encodings and this will be meaningful. The comparison needs to be done with encodings in mind.
pathToForeign :: p -> IO (Ptr CChar) pathFromForeign :: Ptr CChar -> IO p
This interface is problematic. Is the pointer returned by pathToForeign a heap pointer which the caller is supposed to free? If so, a Ptr CChar instance would have to copy the pathname every time. And I don't understand exactly what pathFromForeign is supposed to do.
Agree, I like the withCPath interface better. pathFromForeign takes a path representation directly from C land, without going through String first (again with encoding issues in mind). Although it should perhaps be: pathFromForeign :: Ptr () -> IO p instead (might be wide chars).

Here is my first cut at this. The unix implementation mostly works, the windows one just has some datatypes sketched out, but it typechecks.

On Wed, Jan 26, 2005 at 10:20:12PM -0500, Robert Dockins wrote:
class (Show p) => Path p where isAbsolute :: p -> Bool isRelative :: p -> Bool isRelative = not . isAbsolute basename :: p -> String parent :: p -> p pathAppend :: p -> p -> p pathExtend :: p -> String -> p pathSeparator :: p -> Char pathParser :: Parser p parsePath :: String -> p parsePath x = case parse pathParser "" x of Left e -> error $ show e Right x -> x
Warning: I'm not interested in a path parsing/combining library, so my criticisms are perhaps unrelated to your goals. One thing that I'd be interested in seeing for any Path class would be a simple instance for FilePath (or String, if you want to imagine FilePath will be changed). Not everyone will want the overhead of a massively heavyweight Path datatype. I'd actually rather have something lighter-weight than String (think PackedString), since FilePaths can take up a good chunk of darcs' memory. Another thing to consider is that any Path class *needs* to have a conversion to C string (probably of the "with" variety). Even on Windows, where apparently a FilePath is not a sequence of bytes, we'd like to be able to use the FFI to call the C standard library, and it would be nice to be able to access the same file both via the FFI and also via the haskell standard libraries. Of course, this means we'd want a similar conversion the other way. I guess it's just that I'm more concerned with making possible what is currently impossible (according to the library standards)--that is, using FFI and IO on the same file--rather than just adding utility features that application developers could have written themselves. I suppose we don't need a class for this, all we need is a couple of functions to convert between FilePath and CString. -- David Roundy http://www.darcs.net

I guess it's just that I'm more concerned with making possible what is currently impossible (according to the library standards)--that is, using FFI and IO on the same file--rather than just adding utility features that application developers could have written themselves. I suppose we don't need a class for this, all we need is a couple of functions to convert between FilePath and CString.
Except paths are different on different platforms... for example: /a/b/../c/hello\ there/test and: A:\a\b\ notice how the backslash is used to 'escape' a space or meta-character on unix, but is the path separator for windows. If you want to write portable applications, then you dont want to hard code the platform type. So converting from the datatype to a string is not simple: string = pathToString ... one way of doing this is to have pathToString call a function to determine the system type and construct the string accordingly. The problem here is that it is not extensible by the user, the supported platforms are determined by the library. By using a class we can let the user define translations for new platforms... Keean.

On Thu, Jan 27, 2005 at 11:33:21AM +0000, Keean Schupke wrote:
I guess it's just that I'm more concerned with making possible what is currently impossible (according to the library standards)--that is, using FFI and IO on the same file--rather than just adding utility features that application developers could have written themselves. I suppose we don't need a class for this, all we need is a couple of functions to convert between FilePath and CString.
Except paths are different on different platforms... for example:
/a/b/../c/hello\ there/test
and:
A:\a\b\
notice how the backslash is used to 'escape' a space or meta-character on unix, but is the path separator for windows.
The \ in your first example is not part of the filepath. If this was typed in a shell, the arguments passed to the executable wouldn't contain the backslash. User interfaces certainly do require "interesting" behavior, but the point I was trying to make is that a minimum requirement should be that a program *without* a user interface can work portably. And for me the FFI is sufficiently important that this means that we should be able to mix FFI access to files with the IO (System.IO or whatever) routines. It should be possible to write a program that reads the contents of the current directory via the haskell libraries, and then uses the FFI to access the contents of those files... perhaps to count the number of symbolic links in the current directory. With the current API, this isn't possible, since technically getDirectoryContents returns a unicode string that withCString can't reliably turn into a C string describing the same file as the FilePath describes. It's not a trivial question, since on posix systems the file names are *not* unicode strings, so it's not clear how to implement a getDirectoryContents or the Haskell 98 IO library in general. But whatever is done with FilePaths and unicode, it would be nice to have the ability to mix FFI and System.IO file and directory access routines. At least on most platforms, System.IO must convert FilePaths to CStrings internally anyways, so it's mostly just a question of exposing that conversion. -- David Roundy http://www.darcs.net

On 27 Jan 2005, at 11:33, Keean Schupke wrote:
Except paths are different on different platforms... for example:
/a/b/../c/hello\ there/test
and:
A:\a\b\
notice how the backslash is used to 'escape' a space or meta-character on
only it isn't. That's a property of a shell, the underlying OS allows spaces in file names with no need for an escaping mechanism. Jules

Jules Bean wrote:
only it isn't. That's a property of a shell, the underlying OS allows spaces in file names with no need for an escaping mechanism.
Okay, that was a mistake... but it does not change the point, that pathToString needs to work out what platform it is on, and doing it without typeclasses makes it not extensible. We need a way of allowing people to define new path printers (as members of a class)... whilst having the program determine which platform it is on, and choosing the correct instance (at compile time). Keean.

Keean Schupke wrote:
I guess it's just that I'm more concerned with making possible what is currently impossible (according to the library standards)--that is, using FFI and IO on the same file--rather than just adding utility features that application developers could have written themselves. I suppose we don't need a class for this, all we need is a couple of functions to convert between FilePath and CString.
Except paths are different on different platforms... for example:
/a/b/../c/hello\ there/test
and:
A:\a\b\
notice how the backslash is used to 'escape' a space or meta-character on unix,
That's Bourne-shell syntax, not Unix API syntax. So far as open() etc
are concerned, a backslash is just another character.
Also, Windows accepts both slash and backslash equally in most
situations. It's only really command-line parsing (where slash is
normally used to denote switches) where there's an issue.
--
Glynn Clements

Warning: I'm not interested in a path parsing/combining library, so my criticisms are perhaps unrelated to your goals.
One thing that I'd be interested in seeing for any Path class would be a simple instance for FilePath (or String, if you want to imagine FilePath will be changed). Not everyone will want the overhead of a massively heavyweight Path datatype.
I'm not convinced that this is massively heavyweight, but your criticism is heard. Perhaps we should realize that there are two separate things going on here. One is the ability to pass around path names as black boxes to the various IO routines without examining the path. The other is the ability to examine and manipulate the path. So perhaps we want something like this: class PathStorage p where fromBytes :: Int -> Ptr () -> IO p withBytes :: p -> (Ptr () -> IO a) -> IO a class (Show p,PathStorage p) => Path p where path manipulation routines.... path parsing routines... etc... Then we could have things like: instance PathStorage (Ptr ()) where fromBytes _ ptr = return ptr withBytes p f = f p or instance PathStorage (ForeignPtr ()) where fromBytes _ ptr = newForeignPtr finalizerFree ptr withBytes p f = withForeignPtr p f or instance (Storable p) => PathStorage [p] fromBytes n ptr = peekArray n (castPtr ptr) withBytes p f = withArray p f as well as the full ADT implementations.

Hello Guys, Let me propose another solution which is simpler (at least from my point of view)and will not break the existing. When I designed the API of the original System.FilePath library I looked at OS.Path modules from Python and ML. They both uses plain string to keep the file path but does precise parsing/printing of the path when it is manipulated. I haven't ever heard of any language that uses special FilePath type instead of plain string. I don't want to do any parsing/printing each time when I need to open file or create directory. In most cases the file path is passed as string from the outside world to the application and if we have special FilePath then we need each time to parse it. What I propose is the following: - Keep the existing System.IO API the same. openFile, createDirectory ... will take the file path as string. - Introduce two new modules System.Posix.FilePath and System.Win32.FilePath. Each of them will provide functions for parsing/printing of paths to/from some platform specific type: PosixFilePath and Win32FilePath. As you can see from Robert Dockins examples, these types can be completely different. - Introduce third module System.FilePath which will do some basic operations of path through parsing/printing. The API of this module can be similar to this which I wrote but its implementation can be more accurate if it works on some ADT instead of string. The module will use #ifdef in order to import the right from the above two modules. In most cases we do only simple manipulations on path and I don't think it is required and easy to explicitly parse/print the path only in order to change its extension. I prefer to invoke changeFileExt and don't care how the function will do its job. If someone would like to perform any more complicated operations on file path he can import the system specific module and use PosixFilePath or Win32FilePath. This is basically the way in which OS.Path is implemented in ML. The type class solution doesn't work very well. As Simon said it may require to have #ifdef-s in some places in order to choice the right type. Another disadvantage is that this will complicate some data types. Example: data FilePath a => MyFileInfo a = MyFileInfo { path :: a; size :: Integer } I don't want to add extra type parameters here only in order to specify the right FilePath type. Cheers, Krasimir

Hello, On Jan 27, 2005, at 10:46 AM, Krasimir Angelov wrote:
Hello Guys,
Let me propose another solution which is simpler (at least from my point of view)and will not break the existing. When I designed the API of the original System.FilePath library I looked at OS.Path modules from Python and ML. They both uses plain string to keep the file path but does precise parsing/printing of the path when it is manipulated. I haven't ever heard of any language that uses special FilePath type instead of plain string. I don't want to do any parsing/printing each time when I need to open file or create directory. In most cases the file path is passed as string from the outside world to the application and if we have special FilePath then we need each time to parse it. What I propose is the following:
Actually, Common Lisp specifies a special data type to handle logical filepaths, which are distinct from file path strings. Having had to debug common lisp code that uses this (written by other people) I've observed that this attempt to do the "Right Thing" almost certainly has caused more trouble than it has solved. While an abstract filepath isolates you from having to deal with the syntax of file paths on different systems, it does not provide an abstract view of the filesystem hierarchy. These differ greatly, even among unix-like systems. Handling differences in the file system hierarchy inevitably results in a lot of system specific code, for any program that has to use files scattered across a system.
- Keep the existing System.IO API the same. openFile, createDirectory ... will take the file path as string. - Introduce two new modules System.Posix.FilePath and System.Win32.FilePath. Each of them will provide functions for parsing/printing of paths to/from some platform specific type: PosixFilePath and Win32FilePath. As you can see from Robert Dockins examples, these types can be completely different. - Introduce third module System.FilePath which will do some basic operations of path through parsing/printing. The API of this module can be similar to this which I wrote but its implementation can be more accurate if it works on some ADT instead of string. The module will use #ifdef in order to import the right from the above two modules.
In most cases we do only simple manipulations on path and I don't think it is required and easy to explicitly parse/print the path only in order to change its extension. I prefer to invoke changeFileExt and don't care how the function will do its job. If someone would like to perform any more complicated operations on file path he can import the system specific module and use PosixFilePath or Win32FilePath. This is basically the way in which OS.Path is implemented in ML.
Your proposal above for a lightweight solution seems the right way to go. If there is really a need for a higher layer it could be built upon something like you suggest. One thing that the library shouldn't exclude is the manipulation of non-native file paths. For example, I on my unix system I may want to generate a win32 file path as part of some code that will be executed on Windows machine. The underlying os-specific modules should always be available, even if there is a module for file path manipulations specific to the host-OS. (If I understand correctly, this is what you've proposed with the System.FilePath.)
The type class solution doesn't work very well. As Simon said it may require to have #ifdef-s in some places in order to choice the right type. Another disadvantage is that this will complicate some data types. Example:
data FilePath a => MyFileInfo a = MyFileInfo { path :: a; size :: Integer }
I don't want to add extra type parameters here only in order to specify the right FilePath type.
Cheers, Krasimir _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Best Wishes, Greg

Gregory Wright
Actually, Common Lisp specifies a special data type to handle logical filepaths, which are distinct from file path strings. Having had to debug common lisp code that uses this (written by other people) I've observed that this attempt to do the "Right Thing" almost certainly has caused more trouble than it has solved.
See also http://www.gigamonkeys.com/book/practical-a-portable-pathname-library.pdf -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

- Keep the existing System.IO API the same. openFile, createDirectory ... will take the file path as string.
The problem is that "string" means different things in haskell and in C. A C "string" is really just a contiguous sequence of octets in memory. A haskell string has a particular interpretation, that of a list of unicode characters. Depending on how strings come into and leave the haskell world, there may OR MAY NOT be a one-to-one mapping between C strings and haskell strings, if non-trivial character encodings are involved (they will be eventually). Decoding may fail (no haskell representation for that string), or it might be that (deocde . encode) /= id, which is also bad (file name returned from a directory listing gives file not found error). The sad truth is that FilePath = String is BROKEN. FilePath = [Word8] would at least preserve filenames as they move across the boundaries of the haskell world, but then simple questions like "does this file have a .gz ending" become difficult (because they depend on the encoding). We need something else. Maybe ADTs aren't it, but String certainly isn't. I don't think "mostly works, if you only use ASCII" is good enough for something as basic as file IO.
In most cases we do only simple manipulations on path
Even simple manipulations break in the presence of encoding issues, or even just of unusual paths. What is the extension of "\\.\TAPE0" ? Its not "\TAPE0". BTW this is a valid path on Windows 2000 upwards. If you don't care about corner cases, then you have no worries. It would be nice to have correct handling for all valid paths on each supported OS though.

Even simple manipulations break in the presence of encoding issues, or even just of unusual paths. What is the extension of "\\.\TAPE0" ? Its not "\TAPE0". BTW this is a valid path on Windows 2000 upwards. If you don't care about corner cases, then you have no worries. It would be nice to have correct handling for all valid paths on each supported OS though.
Urk. This is a terrible example, sorry. Still, the point is that unusual paths can break simple seeming string manipulations.

I don't pretend to fully understand various unicode standard but it seems to me that these problems are deeper than file path library. The equation (decode . encode) /= id seems confusing for me. Can you give me an example when this happen? What can we do when the file name is passed as command line argument to program? We need to convert String to FilePath after all.
Even simple manipulations break in the presence of encoding issues, or even just of unusual paths. What is the extension of "\\.\TAPE0" ? Its not "\TAPE0". BTW this is a valid path on Windows 2000 upwards. If you don't care about corner cases, then you have no worries. It would be nice to have correct handling for all valid paths on each supported OS though.
Yes it isn't. If the library makes proper parsing it will return "no extension". Krasimir

I don't pretend to fully understand various unicode standard but it seems to me that these problems are deeper than file path library. The equation (decode . encode) /= id seems confusing for me. Can you give me an example when this happen?
I am pretty sure that ISO 2022 encoded strings can have multiple ways to express the same unicode glyphs. This means that any sensible relation between IS0 2022 strings and unicode strings maps more than one ISO 2022 string onto the same unicode string. The inverse is therefore not a function. To make it a function one of the possibly several encodings of the unicode string will have to be chosen. So you have a ISO 2022 string A which is decoded to a unicode string U. We reencode U to an ISO 2022 string B. It may be that A /= B. That is the problem. The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string. However, decoding is still a partial function and can fail. A discussion about this problem floated around on this list several months ago.
What can we do when the file name is passed as command line argument to program? We need to convert String to FilePath after all.
Then we can parse the unicode and hope that nothing bad happens; the majority of the time, we will be OK. Or we can make the RTS allow access to the raw bytes of the program arguments, env variables, etc, and actually do the right thing.

On Thu, 27 Jan 2005 16:31:21 -0500, robert dockins
I don't pretend to fully understand various unicode standard but it seems to me that these problems are deeper than file path library. The equation (decode . encode) /= id seems confusing for me. Can you give me an example when this happen?
I am pretty sure that ISO 2022 encoded strings can have multiple ways to express the same unicode glyphs. This means that any sensible relation between IS0 2022 strings and unicode strings maps more than one ISO 2022 string onto the same unicode string. The inverse is therefore not a function. To make it a function one of the possibly several encodings of the unicode string will have to be chosen. So you have a ISO 2022 string A which is decoded to a unicode string U. We reencode U to an ISO 2022 string B. It may be that A /= B. That is the problem.
The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string. However, decoding is still a partial function and can fail.
A discussion about this problem floated around on this list several months ago.
What can we do when the file name is passed as command line argument to program? We need to convert String to FilePath after all.
Then we can parse the unicode and hope that nothing bad happens; the majority of the time, we will be OK. Or we can make the RTS allow access to the raw bytes of the program arguments, env variables, etc, and actually do the right thing.
This means that all unicode languages, I have used before (Java,C#), are broken too. In this case I agree that special data type might be better. The development of the new FilePath should come together with the new unicode aware I/O library. I agree with David Roundy that the internal representation of FilePath should be compact as mush as posible. PackedString uses UArray Int Char to store strings and we can use UArray Int Word8 or even ByteArray#. Under Windows nearly all API functions have two versions: ANSI and Unicode (16-bit). Under WinNT+ each ANSI function is just a wrapper around its Unicode friend and the wrapper simply converts the passed strings. It was said that paths under Windows are [Word16] while in Posix they are [Word8]. This is true but in order to take advantages of this we need to use the native Windows API in the new I/O library. Another advantage of this is that in such way we can use the native non-blocking I/O under Windows. Cheers, Krasimir

The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string. However, decoding is still a partial function and can fail.
And while it is partly true, it is qualified by the problems relative to canonicalization (an "é" in Unicode can both be represented as "é" or as two chars (an e and an accent) and they should (ideally) compare equal). Stefan

Stefan Monnier
The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string. However, decoding is still a partial function and can fail.
And while it is partly true, it is qualified by the problems relative to canonicalization (an "é" in Unicode can both be represented as "é" or as two chars (an e and an accent) and they should (ideally) compare equal).
In what sense "equal"? They are supposed to be equivalent as far as the semantics of the text is concerned, but representations are clearly different and most programs distinguish them. In particular they are different filenames on both Unix and Windows. AFAIK MacOS normalizes filenames, but using a slightly different algorithm than Unicode (perhaps just an older version). IMHO it makes no sense to pretend that they are exactly the same when strings consist of code points or lower level units (and I don't believe another choice for the default string type would be practical). -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk wrote:
The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string. However, decoding is still a partial function and can fail.
And while it is partly true, it is qualified by the problems relative to canonicalization (an "-B�" in Unicode can both be represented as "�" or as two-A chars (an e and an accent) and they should (ideally) compare equal).
In what sense "equal"? They are supposed to be equivalent as far as the semantics of the text is concerned, but representations are clearly different and most programs distinguish them. In particular they are different filenames on both Unix and Windows. AFAIK MacOS normalizes filenames, but using a slightly different algorithm than Unicode (perhaps just an older version).
IMHO it makes no sense to pretend that they are exactly the same when strings consist of code points or lower level units (and I don't believe another choice for the default string type would be practical).
Well, at least you and I agree on that.
Once you start down the "semantic equivalence" route, you will quickly
run into issues like "�" == "ss", and it only gets worse from there
on.
--
Glynn Clements

Marcin 'Qrczak' Kowalczyk writes:
AFAIK MacOS normalizes filenames, but using a slightly different algorithm than Unicode (perhaps just an older version).
According to http://developer.apple.com/qa/qa2001/qa1173.html, Mac OS
X uses different forms depending on the file system.
| For example, HFS Plus uses a variant of Normal Form D in which
| U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through
| U+2FAFF are not decomposed (this avoids problems with round trip
| conversions from old Mac text encodings).
The big catch to watch out for is that Mac OS X supports UFS, which is
case sensitive, and HFS+, which is not. I've had at least one Haskell
program that didn't work properly because it tried to create two files
named "tags" and "TAGS" in the same directory.
--
David Menendez

robert dockins wrote:
I don't pretend to fully understand various unicode standard but it seems to me that these problems are deeper than file path library. The equation (decode . encode) /= id seems confusing for me. Can you give me an example when this happen?
I am pretty sure that ISO 2022 encoded strings can have multiple ways to express the same unicode glyphs. This means that any sensible relation between IS0 2022 strings and unicode strings maps more than one ISO 2022 string onto the same unicode string. The inverse is therefore not a function. To make it a function one of the possibly several encodings of the unicode string will have to be chosen. So you have a ISO 2022 string A which is decoded to a unicode string U. We reencode U to an ISO 2022 string B. It may be that A /= B. That is the problem.
Exactly. And it isn't a theoretical issue. E.g. in an environment where EUC-JP is used, filenames may begin with <ESC>$)B (designate JISX0208 to G1), or they may not (because G1 is assumed to contain JISX0208 initally). More generally, ISO-2022 strings frequently contain redundant character-set switching sequences, so conversion to unicode and back again typically won't yield the original sequence of bytes.
The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string.
Except that there are some ad-hoc extensions, e.g. the UTF-8 variant
used by both Java and Tcl permits NUL characters to be embedded in
NUL-terminated UTF-8 strings by encoding them as a two-byte sequence
(which is invalid in UTF-8 proper).
--
Glynn Clements

Glynn Clements
And it isn't a theoretical issue. E.g. in an environment where EUC-JP is used, filenames may begin with <ESC>$)B (designate JISX0208 to G1), or they may not (because G1 is assumed to contain JISX0208 initally).
I think such encodings are never used as default encodings of a Unix locale.
The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string.
BOM is a problem. Unfortunately Unicode mandates that FEFF at the start of a UTF-8 text stream is a mark which doesn't belong to the text. It provides variants of UTF-16/32 with and without a BOM, but UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful encoding. Unix ignores this, it doesn't use BOM in UTF-8 except individual applications for individual file formats. iconv() on Linux and in libiconv don't process a BOM in UTF-8 (although in libiconv this is because it's old, basing on and old RFC with 31-bit code points which didn't include a BOM). -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

On 2005-01-30, Marcin 'Qrczak' Kowalczyk
Glynn Clements
writes: And it isn't a theoretical issue. E.g. in an environment where EUC-JP is used, filenames may begin with <ESC>$)B (designate JISX0208 to G1), or they may not (because G1 is assumed to contain JISX0208 initally).
I think such encodings are never used as default encodings of a Unix locale.
The various UTF encodings do not have this particular problem; if a UTF string is valid, then it is a unique representation of a unicode string.
BOM is a problem. Unfortunately Unicode mandates that FEFF at the start of a UTF-8 text stream is a mark which doesn't belong to the text.
Right
It provides variants of UTF-16/32 with and without a BOM, but UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful encoding.
I think you mean "UTF-8 only has the variant without a BOM". Otherwise I'd like to see a citation in the standard for this. Because that's not the reading I get from http://www.unicode.org/faq/utf_bom.html. Instead, it seems that whether the BOM is included or not is a function of the protocol, and that the UTF-8 streams themselves do not include the BOM. -- Aaron Denney -><-

Aaron Denney
It provides variants of UTF-16/32 with and without a BOM, but UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful encoding.
I think you mean "UTF-8 only has the variant without a BOM".
No, unfortunately. Unicode standard section 3.10 defines encoding schemes: - UTF-8 (with a BOM) - UTF-16BE (without a BOM) - UTF-16LE (without a BOM) - UTF-16 (with a BOM) - UTF-32BE (without a BOM) - UTF-32LE (without a BOM) - UTF-32 (with a BOM) It says about UTF-8 BOM: "Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme." IMHO it would be fair if it had two variants of UTF-8 encoding scheme, just like it has three variants of UTF-16/32, so it would be unambiguous whether "UTF-8" in a particular context allows BOM or not. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

On 2005-01-30, Marcin 'Qrczak' Kowalczyk
Aaron Denney
writes: It provides variants of UTF-16/32 with and without a BOM, but UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful encoding.
I think you mean "UTF-8 only has the variant without a BOM".
...
IMHO it would be fair if it had two variants of UTF-8 encoding scheme, just like it has three variants of UTF-16/32, so it would be unambiguous whether "UTF-8" in a particular context allows BOM or not.
Ah. Okay. It's not that the BOM is always to be there, but that it's always ambiguous, which was not clear from your initial description. Better yet would be to have the standard never allow the BOM. Since some things can't handle it, on output we should never emit it, but still must handle it on input. Bah. -- Aaron Denney -><-

Aaron Denney
Better yet would be to have the standard never allow the BOM.
If I could decide, I would ban the BOM in UTF-8 altogetger, but I'm afraid the Unicode Consortium doesn't want to do this. Miscosoft Notepad puts a BOM in UTF-8 encoded files. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

I have been ruminating on the various responses my attempted file path implementation has generated. I have a design beginning to form in the back of my head which attempts to address the file path problem as I lay out below. Before I develop it any further, are there any important considerations I am missing? Here is my conception of the file name problem: 1) File names are abstract entities. There are a number of ways one might concretely represent a filename. Among these ways are: a) A contiguous sequence of octets in memory (C style string on most modern hardware) b) A sequence of unicode codepoints (Haskell style string) c) Algebraic datatypes supporting path manipulations (yet to be developed) 2) We would like these three representations to be isomorphic. Unfortunately, this cannot be. In particular, there are major issues with the translations between the (a) and (b) forms given above. One could imagine that translations issues involving the (c) form are also possible. 3) Translations between (a) and (b) must be parameterized by a character encoding. Translations to and from (c) will require some manner of description of the path syntax, which differs by OS. 4) In practice, the vast majority of file paths are portable between the various forms; the forms are "nearly" isomorphic, with corner cases being fairly rare. 5) Translations between the various forms cost compute cycles and memory, and are not necessarily bijective. Therefore, translations should occur _only_ if absolutely necessary. In particular, if a file name passes through a program as a black box (it is not examined or manipulated) it should undergo no transformation. 6) Different OSes handle file names differently. These differences should be accounted for, transparently where possible. These differences, however, should be exposed to developers for whom the difference matter. 7) Using simple file names should be easy. We don't want developers to have to worry too much about character encodings, path separators, and generally bizarre path syntax just to open files. The complexities of correct file name handling should be hidden from the casual programmer. However, developers interested in serious portability/internationalization should be able to get down into the muck if they need to.

Robert Dockins writes:
1) File names are abstract entities. There are a number of ways one might concretely represent a filename. Among these ways are:
a) A contiguous sequence of octets in memory (C style string on most modern hardware) b) A sequence of unicode codepoints (Haskell style string) c) Algebraic datatypes supporting path manipulations (yet to be developed)
The solution I have in mind uses algebraic data types which are parameterized over the actual representation. Thus, you can use them to represent any type of path (in any kind of representation). In the spirit of release early, release often: http://cryp.to/pathspec/PathSpec.hs darcs get http://cryp.to/pathspec The module currently knows only _relative_ paths. I am still experimenting with absolute paths because I have recently learned that on Windows something like "C:foo.txt" is actually relative -- not absolute. Very weird. There also is a function which changes a path specification into its canonic form, meaning that all redundant segments are stripped. So although two paths which designate the same target may not be equal, they can be tested for equivalence. Suggestions for enhancement are welcome, of course. Peter

Peter Simons wrote:
[...] There also is a function which changes a path specification into its canonic form, meaning that all redundant segments are stripped. So although two paths which designate the same target may not be equal, they can be tested for equivalence.
Hmmm, I'm not really sure what "equivalence" for file paths should mean in the presence of hard/symbolic links, (NFS-)mounted file systems, etc. Haskell's stateless (==) function doesn't really make sense IMHO, but perhaps I've missed something in this epic discussion... :-] Cheers, S.

Sven Panne writes:
Hmmm, I'm not really sure what "equivalence" for file paths should mean in the presence of hard/symbolic links, (NFS-)mounted file systems, etc.
Well, there is a sort-of canonic version for every path; on most Unix systems the function realpath(3) will find it. My interpretation is that two paths are equivalent iff they point to the same target. You (and the others who pointed it out) are correct, though, that the current 'canon' function doesn't accomplish that. I guess, I'll have to move it into the IO monad to get it right. And I should probably rename it, too. ;-) Ben Rudiak-Gould writes:
The Read and Show instances aren't inverses of each other. I don't think we should be using Read for path parsing, for this reason.
That's fine with me; I can change that.
I don't understand why the path ADT is parameterized by segment representation, but then the Posix and Windows parameter types are both wrappers for String.
No particular reason. I just wanted to make the library work with a simple internal representation before doing the more advanced stuff. It is experimental code.
It seems artificial to distinguish read :: String -> RelPath Windows from read :: String -> RelPath Posix in this way.
I think it's pretty neat, actually. You have a way to specify what kind of path you have -- and the type system distinguishes it, not a run-time error. Peter

Peter Simons wrote:
Sven Panne writes:
Hmmm, I'm not really sure what "equivalence" for file paths should mean in the presence of hard/symbolic links, (NFS-)mounted file systems, etc.
Well, there is a sort-of canonic version for every path; on most Unix systems the function realpath(3) will find it.
OK, but even paths which realpath normalizes to different things might be the same (hard links!). This might be OK for some uses, but not for all.
My interpretation is that two paths are equivalent iff they point to the same target. [...]
This would mean that they are equal iff stat(2) returns the same device/inode pair for them. But this leaves other questions open: * Do we have something stat-like on every platform? * What does this mean for network file systems, e.g. in the presence of the same files/directories exported under different NFS mounts? I don't have enough books/manual pages at hand to answer this currently... * What does this mean if the file path doesn't refer to an existing file/directory? IMHO we can provide something like realpath in the IO monad, but shouldn't define any equality via it. Cheers, S.

Sven Panne writes:
OK, but even paths which realpath normalizes to different things might be the same (hard links!).
Sure, but paths it normalizes to the same thing almost certainly _are_ the same. ;-) That's all I am looking for. In general, I think path normalization is a nice-to-have feature, not a must-have.
IMHO we can provide something like realpath in the IO monad, but shouldn't define any equality via it.
You are right; Eq shouldn't be defined on top of that. And couldn't even, if normalization needs the IO monad anyway. Peter

Well, there is a sort-of canonic version for every path; on most Unix systems the function realpath(3) will find it.
Here is the BUGS listing from 'man realpath' on my system: Never use this function. It is broken by design since it is impossible to determine a suitable size for the output buffer. According to POSIX a buffer of size PATH_MAX suffices, but PATH_MAX need not be a defined constant, and may have to be obtained using pathconf(). And asking pathconf() does not really help, since on the one hand POSIX warns that the result of pathconf() may be huge and unsuitable for mallocing memory. And on the other hand pathconf() may return -1 to signify that PATH_MAX is not bounded.
My interpretation is that two paths are equivalent iff they point to the same target.
You might do better (on *nix) to check if two paths terminate in the same filesystem and then see if the inode numbers match (with some stat variant). Even that may break down for networked filesystems or FAT wrappers or other things that may lie about the inode number. You could also unravel the path manually, but that seems error-prone and unportable. This strikes me as yet another case of a simple-seeming operation that simply cannot be implemented correctly on file names.

Peter Simons wrote:
Hmmm, I'm not really sure what "equivalence" for file paths should mean in the presence of hard/symbolic links, (NFS-)mounted file systems, etc.
Well, there is a sort-of canonic version for every path; on most Unix systems the function realpath(3) will find it. My interpretation is that two paths are equivalent iff they point to the same target.
I think that any definition which includes an "iff" is likely to be overly optimistic. More likely, you will have to settle for a definition such that, if two paths are considered equal, they refer to the same "file", but without the converse (i.e. even if they aren't equal, they might still refer to the same file). Even so, you will need to make certain assumptions. E.g. older Unices would allow root to replace the "." and ".." entries; you probably want to assume that can't happen.
You (and the others who pointed it out) are correct, though, that the current 'canon' function doesn't accomplish that. I guess, I'll have to move it into the IO monad to get it right. And I should probably rename it, too. ;-)
A version in the IO monad would allow for a "tighter" definition (i.e.
more likely to correctly identify that two different path values
actually refer to the same file).
[Certainly, you have to use the IO monad if you want to allow for case
sensitivity, as that depends upon which filesystems are mounted
where.]
Within the IO monad, the obvious approach is to stat() both pathnames
and check whether their targets have the same device/inode pairs.
That's reasonably simple, and probably about as good as you can get.
That still won't handle the case where you mount a single remote
filesystem via both NFS and SMB though. I doubt that anything can
achieve that.
There are also issues of definition, e.g. is "/dev/tty" considered
"equivalent" to the specific "/dev/ttyXX" device for the current
process?
--
Glynn Clements

Glynn Clements writes:
Well, there is a sort-of canonic version for every path; on most Unix systems the function realpath(3) will find it. My interpretation is that two paths are equivalent iff they point to the same target.
I think that any definition which includes an "iff" is likely to be overly optimistic.
I see your point. I guess it comes down to how much effort is put into implementing a realpath() derivate in Haskell.
Even so, you will need to make certain assumptions. E.g. older Unices would allow root to replace the "." and ".." entries; you probably want to assume that can't happen.
My take on things is that it is hopeless to even try and cover all this weird behavior. I'd like to treat paths as something abstract. What I'm aiming for is that my library can be used to manipulate file paths as well as URLs, namespaces, and whatnot else; so I'll necessarily lose some functionality that an implementation specifically designed for file paths could provide. If you want to be portable, you cannot use any esoteric functionality anyway.
There are also issues of definition, e.g. is "/dev/tty" considered "equivalent" to the specific "/dev/ttyXX" device for the current process?
No, because the paths differ. ;-) Peter

Peter Simons wrote:
The module currently knows only _relative_ paths. I am still experimenting with absolute paths because I have recently learned that on Windows something like "C:foo.txt" is actually relative -- not absolute. Very weird.
"\foo.txt" is also relative on Win32. And "con.txt" is absolute.
There also is a function which changes a path specification into its canonic form, meaning that all redundant segments are stripped. So although two paths which designate the same target may not be equal, they can be tested for equivalence.
Again, while this transformation may be useful in some cases, it is not a canonicalization operation. "foo/../bar" and "bar" do not in general refer to the same file, and "foo" and "foo/." are not in general equivalent. We shouldn't encourage these misconceptions in the library, even if we do provide a path-collapsing transformation along these lines. Other comments: The Read and Show instances aren't inverses of each other. I don't think we should be using Read for path parsing, for this reason. I don't understand why the path ADT is parameterized by segment representation, but then the Posix and Windows parameter types are both wrappers for String. It seems artificial to distinguish read :: String -> RelPath Windows from read :: String -> RelPath Posix in this way. In general, this library doesn't seem to deal with any of the hard cases. The devil's in the details. -- Ben

Peter Simons
There also is a function which changes a path specification into its canonic form, meaning that all redundant segments are stripped.
It's incorrect: canon (read "x/y/.." :: RelPath Posix) gives "x", yet on Unix they aren't equivalent when y is a non-local symlink or doesn't exist. Also, "x/." is not equivalent to "x": rmdir can be used with "x" but not with "x/.". -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

On 2005-01-31, Marcin 'Qrczak' Kowalczyk
Peter Simons
writes: There also is a function which changes a path specification into its canonic form, meaning that all redundant segments are stripped.
It's incorrect: canon (read "x/y/.." :: RelPath Posix) gives "x", yet on Unix they aren't equivalent when y is a non-local symlink or doesn't exist.
True, but most people want x when they construct x/y/.., in makefiles, install scripts, etc. It's not "OS thinks is the same", and shouldn't be marketed as such, but it is useful as "what people generally want to refer to". -- Aaron Denney -><-

This is a very good summary, and I'm interested to see what you come up with. robert dockins wrote:
1) File names are abstract entities. There are a number of ways one might concretely represent a filename. Among these ways are:
a) A contiguous sequence of octets in memory (C style string on most modern hardware) b) A sequence of unicode codepoints (Haskell style string)
b') A sequence of octets (Haskell style string, in real life)
4) In practice, the vast majority of file paths are portable between the various forms; the forms are "nearly" isomorphic, with corner cases being fairly rare.
I don't think they're so rare. I have files on my XP laptop which can't be represented in the system code page. It's easy for me to tell which programs are Unicode-aware and which aren't. -- Ben

I have not been following this thread too closely, but I have looked at the various proposed implementations floating around and have a few comments. I noticed some have #ifdefs for platforms, which seems not very useful. I write not just cross-platform, but cross platform at the same time programs. i.e. a linux program which might be controlling slaves running on windows or linux, meaning it will have to deal with paths from either operating system in a common way. We should make sure all functionality is available on all systems as much as is feasable. Another common use is a program running under cygwin, where you generally want to let the user work with unix or windows style paths. We need to support paths as black boxes, what is returned from directory routines should be able to be passed back to system routines without modification or canonicalization. This will allow people to write file-chooser type apps, so even if the user visible display name of a file has been changed by charset conversion/whatnot, we can still pass the black box gotten out of the directory listing back to an open call and be assured of getting the right file. I don't think this is a good use of typeclasses mainly because I don't think we should use different types for different platforms. I would like a single abstract Path type which can represent paths from any platform or an encapsulated black box from the system. I liked the root-relative formulaton someone mentioned. data FilePath = Path Root [Relative] data Root = WindowsDrive Char | UnixRoot | CurrentDir | RootBB BlackBox data Relative = Dot | DotDot | Sub String | SubBB BlackBox type BlackBox = UArray Int Word8 note that black boxes can be used as path components as well as the entire path. Examples of where these would be used would be getting the current directory would return an opaque black box representing the whole path, while a directory listing would return relative black boxen. Or imagine implementing cp -r, you'd want to create the same file names in a different directory. just some comments... sorry for the hit-n-run. John -- John Meacham - ⑆repetae.net⑆john⑈

At 23:39 30/01/05 +0100, Marcin 'Qrczak' Kowalczyk wrote:
Aaron Denney
writes: It provides variants of UTF-16/32 with and without a BOM, but UTF-8 only has the variant with a BOM. This makes UTF-8 a stateful encoding.
I think you mean "UTF-8 only has the variant without a BOM".
No, unfortunately. Unicode standard section 3.10 defines encoding schemes:
- UTF-8 (with a BOM) - UTF-16BE (without a BOM) - UTF-16LE (without a BOM) - UTF-16 (with a BOM) - UTF-32BE (without a BOM) - UTF-32LE (without a BOM) - UTF-32 (with a BOM)
It says about UTF-8 BOM: "Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence does not affect conformance to the UTF-8 encoding scheme."
IMHO it would be fair if it had two variants of UTF-8 encoding scheme, just like it has three variants of UTF-16/32, so it would be unambiguous whether "UTF-8" in a particular context allows BOM or not.
I haven't been following this thread in detail, so I may be missing something, but... How can it make sense to have a BOM in UTF-8? UTF-8 is a sequence of octets (bytes); what ordering is there here that can sensibly be varied? #g ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

Graham Klyne
How can it make sense to have a BOM in UTF-8? UTF-8 is a sequence of octets (bytes); what ordering is there here that can sensibly be varied?
The *name* "BOM" doesn't make sense when applied to UTF-8, but some software uses UTF-8 encoded U+FEFF it as a marker that the file is encoded in UTF-8 rather than some other encoding. And Unicode seems to support this usage, even if it doesn't recommend it. I know only of Microsoft Notepad, and suspect other Microsoft tools (Notepad assumes UTF-8 with the marker and the current Windows codepage without). The HTML at http://www.microsoft.com/ begins with a BOM, but other pages linked from there do not. I think XML used to be silent about this, but later got amended to explicitly say that optional U+FEFF at the beginning is allowed and not treated as a part of document contents. OTOH various other sofrware, in particular generic Unix tools, don't treat UTF-8 BOM specially, and de facto implement the "non-standard" UTF-8 without a BOM. Technically in UTF-16/32 the BOM is handled in the translation between encoding form (sequence of 16- or 32-bit code units) and encoding scheme (these words serialized into bytes). I think it's supposed to be the same in UTF-8, i.e. the analogous translation is *almost* trivial - it translates bytes to the same bytes - except that initial BOM must be stripped on decoding, and it must be added on encoding when the first character of the contents is U+FEFF (and optionally in other cases). I mean that it is supposed to happen on decoding UTF-8 on the level of bytes, not after decoding on the level of code points. Anyway, on Unix it just doesn't happen at all, except in software which explicitly handles it. iconv() doesn't handle UTF-8 BOM. If I could decide about it, I would ban UTF-8 BOM at all. But perhaps Unicode Consortium can be at least persuaded to recognize that some software doesn't accept BOM in UTF-8, and could be conforming to the variant of UTF-8 without the BOM rather than non-conforming at all. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

On 2005 January 31 Monday 04:56, Graham Klyne wrote:
How can it make sense to have a BOM in UTF-8? UTF-8 is a sequence of octets (bytes); what ordering is there here that can sensibly be varied?
Correct. There is no order to be varied. A BOM came to be permitted because it uses the identical code as NBSP (non-breaking space). Earlier versions of Unicode permit NBSP just about anywhere in the character sequence. Unicode 4 deprecates this use of NBSP. If I read it correctly, Unicode 4 says that a BOM at the beginning of a UTF-8 encoded stream is not to be taken as part of the text. The BOM has no effect. The rationale for this is that some applications put out a BOM at the beginning of the output regardless of the encoding. Other occurrences of NBSP in a UTF-8 encoded stream are significant.

On Mon, 31 Jan 2005 10:30:58 -0500, you wrote:
A BOM came to be permitted because it uses the identical code as NBSP (non-breaking space).
Not quite. It's the same code (U+FEFF) as ZERO WIDTH NO-BREAK SPACE. This is _not_ the same thing as NO-BREAK SPACE (U+00A0), which is what you frequently see in HTML as . Steve Schafer Fenestra Technologies Corp http://www.fenestra.com/

Robert Dockins wrote:
This is true in a sense, but I think making the distinction explicit is helpful for a number of the operations we want to do. For example, what is the parent of the relative path "."? Answer is "..". What is the parent of "/." on unix? Answer is "/.".
While true, I don't see what this has to do with the choice between PathStart and Maybe PathRoot. The types are isomorphic; we can detect and simplify the /.. case either way.
Relative paths can refer to different things in the filesystem depending on process-local state, whereas absolute paths will always refer to the same thing (until the filesystem changes, or if you do something esoteric like "chroot"). Relative paths are really "path fragments."
Okay, this is a good point. There is a difference between a path fragment (i.e. a path with no starting point specified) and a path which explicitly starts in the process's default directory. You're right that the pathname "foo/bar" can only sensibly be put in the former category. The problem is that where Posix has just Absolute | Relative Win32 has Absolute / \ / \ Rel:Abs Abs:Rel \ / \ / Relative where "Rel:Abs" means something like "\foo" and "Abs:Rel" means something like "c:foo". I never realized before what a nightmare it is to handle this sensibly. The problem is that pathAppend "c:\foo\bar" "d:." == "d:.", but pathAppend "d:\foo\bar" "d:." == "d:\foo\bar\.". Therefore, pathAppend "\foo\bar" "d:." doesn't have a value at all, since its meaning depends on the current drive in a way that can't be expressed in the Win32 path syntax. Because of the above problem, I'm willing to treat path fragments (Relative in both lattices) as a special case. But we still need to be able to round-trip rel:abs and abs:rel pathnames, meaning that the PathRoot type won't necessarily be a genuine cwd-independent root any more.
There are a few others. I took a look at MSDN earlier and was astounded.
Is there an MSDN page that actually gives a grammar, or at least a taxonomy, of Win32 pathnames? That would be useful. Incidentally, NT doesn't do a perfect job of parsing its own pathnames. While experimenting I managed to create a file named "..", different from the directory ".." (both show up in the directory listing), which I was subsequently unable to read or delete. The command was something like "cat > ..:foo". I doubt that this behavior is by design.
pathCleanup :: p -> p -- remove .. and suchlike
This can't be done safely except in a few special cases (e.g. "/.." -> "/"). I'm not sure it should be here.
More than you would think, if you follow the conventions of modern unix shells.
It's not a general convention even in the shell, just a peculiarity of the cd builtin: GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu) # mkdir /bar # mkdir /bar/baz # ln -s /bar/baz /foo # echo /file > /file # echo /bar/file > /bar/file # cat /foo/../file /bar/file # ls /foo/.. baz file # cd /foo # cat ../file /bar/file In the vast majority of cases it's not safe to collapse "/foo/.." to "/". On reflection I think the function should be provided, but with a name like pathCancel and a usage warning in the documentation. If it's not provided people will write it themselves without realizing that it's unsafe. -- Ben

While true, I don't see what this has to do with the choice between PathStart and Maybe PathRoot. The types are isomorphic; we can detect and simplify the /.. case either way.
True
Because of the above problem, I'm willing to treat path fragments (Relative in both lattices) as a special case. But we still need to be able to round-trip rel:abs and abs:rel pathnames, meaning that the PathRoot type won't necessarily be a genuine cwd-independent root any more.
So we'd like to treat path fragments as a special case (eg, one can only append a path fragment), but path fragments are ambiguous with paths rooted at the CWD (of the current drive, in windows). I'm not sure how best to deal with this. It would be nice to have a separate type for path fragments so you would not have to resort to 'error' or such to prevent invalid appends, but you want relative paths to have the same type as path fragments because they can't be distinguished....
There are a few others. I took a look at MSDN earlier and was astounded.
Is there an MSDN page that actually gives a grammar, or at least a taxonomy, of Win32 pathnames? That would be useful.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base... http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base... The descriptions in CreateFile are as close as I could find.
Incidentally, NT doesn't do a perfect job of parsing its own pathnames. While experimenting I managed to create a file named "..", different from the directory ".." (both show up in the directory listing), which I was subsequently unable to read or delete. The command was something like "cat > ..:foo". I doubt that this behavior is by design.
That doesn't surprise me; it does make things painful, however.

Ben Rudiak-Gould wrote:
Is there an MSDN page that actually gives a grammar, or at least a taxonomy, of Win32 pathnames? That would be useful.
It would also be longer than "War and Peace", once you start allowing
for MS-DOS "8.3" pathnames, codepages, the fact that anything anywhere
which contains "aux", "con", "lpt" etc refers to a device (sometimes),
the fact that "..." == "../.." (sometimes), the handling of incomplete
multibyte characters, ...
Search the BugTraq archives for issues related to IIS access-control
lists to discover the myriad different names which can be used to
refer to a given file for which the administrator is (unsuccessfully)
trying to restrict access.
--
Glynn Clements

Robert Dockins
More than you would think, if you follow the conventions of modern unix shells. eg, "foo/.." is always equal to ".",
For the OS it's not the same if foo is a non-local symlink. Shells tend to resolve symlinks themselves on cd, and "cd .." means to remove the last component of the unexpanded current directory, which may be different from the directory listed by "ls ..".
(rather than doing a chdir on the ".." hardlink, which does strange things in the presence of symlinks). The operation is sufficently useful that I think it should be included. It lets us know, for example, that "/bar/../foo/tmp" and "/foo/tmp" refer to the same file, without resorting to any IO operations.
I disagree. The point is they are *not* the same file. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Ben Rudiak-Gould wrote:
I'm tentatively opposed to (B), since I think that the only interesting difference between Win32 and Posix paths is in the set of starting points you can name. (The path separator isn't very interesting.) But maybe it does make sense to have separate starting-point ADTs for each operating system. Then of course there's the issue that Win32 edge labels are Unicode, while Posix edge labels are [Word8]. Hmm.
Several assumptions here... We might want more platforms than windows/unix. The separator for these systems is different (\ for windows / for unix - who knows what other obscure systems may use). It seems to me a type class would allow the user to add definitions for their platform (IE it is extensible)... datatypes tend to be hard to extend as you have to find every use in the code and modify it. For code to be portable it has to use a diffenernt path parser depending on the platform, but the code must not be different... One way of doing this would be to use a class... data Windows data Unix type System = Unix class ParsePath a where parsePath' :: a -> String -> Path instance ParsePath Windows where parsePath' _ a = ... instance ParsePath Unix where parsePath' _ a = ... If all paths can be expressed in a single type, it seems different path parsers and printers are required. All the other functions could operate on the standard datatype. This still leaves the problem of determining what system you are compiling on... I guess I still don't see the problem with having: #ifdef Unix type System = Unix #endif #ifdef Windows type System = Windows #endif In some library somewhere... Infact its the only way I can see of selecting the correct instance at compile time... and using classes is the only way I can think of making the system easily extensible (even if we use a single datatype for all paths) Keean.

Ben Rudiak-Gould wrote:
Symbolic links complicate things a bit, since they are followed like edges but are actually paths.
Actually, they're essentially arbitrary byte strings. The OS doesn't automatically require them to be valid paths. Most of the API functions to which you might pass a symbolic link will fail if they don't contain a valid path. But not all; e.g. lstat() and readlink() work regardless. So far as invalid symlinks are concerned, the main issue is that programs don't die horribly just because they "encounter" an invalid symlink (as opposed to trying to actually "use" one).
Then of course there's the issue that Win32 edge labels are Unicode, while Posix edge labels are [Word8]. Hmm.
Strictly speaking, they're [CChar], but I doubt that anyone will ever implement Haskell on a platform where a byte isn't 8 bits wide.
pathToForeign :: p -> IO (Ptr CChar) pathFromForeign :: Ptr CChar -> IO p
This interface is problematic. Is the pointer returned by pathToForeign a heap pointer which the caller is supposed to free? If so, a Ptr CChar instance would have to copy the pathname every time. And I don't understand exactly what pathFromForeign is supposed to do.
I presume that it's meant to be the fundamental un-marshalling
function for the Path class. But it seems Unix-specific; on Windows,
filenames wouldn't normally be converted to CChars.
--
Glynn Clements

On Sun, Jan 30, 2005 at 02:17:01PM +0000, Glynn Clements wrote:
Ben Rudiak-Gould wrote:
pathToForeign :: p -> IO (Ptr CChar) pathFromForeign :: Ptr CChar -> IO p
This interface is problematic. Is the pointer returned by pathToForeign a heap pointer which the caller is supposed to free? If so, a Ptr CChar instance would have to copy the pathname every time. And I don't understand exactly what pathFromForeign is supposed to do.
I presume that it's meant to be the fundamental un-marshalling function for the Path class. But it seems Unix-specific; on Windows, filenames wouldn't normally be converted to CChars.
No, it's not Unix-specific, it's portable. If you want to write portable C code, you have to use the standard library, which means that file names are represented as Ptr CChar. Even on Windows, one can program in C, so there does exist a mapping from windows file names to Ptr CChar. It may be that this mapping sometimes fails (i.e. some files may not be accessible via the C standard library), but if this happens one can just throw an exception. -- David Roundy http://www.darcs.net

David Roundy
No, it's not Unix-specific, it's portable. If you want to write portable C code, you have to use the standard library, which means that file names are represented as Ptr CChar.
I disagree. We are talking about portable Haskell, not portable C. The native Windows filename encoding (in WinNT) is UTF-16 (without validation that surrogates are correctly paired). It provides API which works with such filenames. It also provides a compatibility layer on top of that, which tries to translate filenames to some 8-bit encoding, but it should only be used when porting C programs which can't be adapted to use Windows filenames through their native type. Especially as Haskell already uses Unicode - it's pointless to convert it to an 8-bit encoding and back. I don't know what these 8-bit WinAPI functions do when they encounter filenames unrepresentable in the current 8-bit encoding. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

On Sun, Jan 30, 2005 at 09:25:00PM +0100, Marcin 'Qrczak' Kowalczyk wrote:
David Roundy
writes: No, it's not Unix-specific, it's portable. If you want to write portable C code, you have to use the standard library, which means that file names are represented as Ptr CChar.
I disagree. We are talking about portable Haskell, not portable C.
I guess I'd just like a portable way to interface portable Haskell code with portable C code. -- David Roundy http://www.darcs.net

Glynn Clements
Then of course there's the issue that Win32 edge labels are Unicode, while Posix edge labels are [Word8]. Hmm.
Strictly speaking, they're [CChar], but I doubt that anyone will ever implement Haskell on a platform where a byte isn't 8 bits wide.
On POSIX it's the same. It's not the same when only the C standard is concerned. There existed platforms with 9 bit chars (Unisys 1100). I don't know whether they are still in use. I doubt that any Haskell compiler will run on such a system. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
participants (20)
-
Aaron Denney
-
Ben Rudiak-Gould
-
David Menendez
-
David Roundy
-
Glynn Clements
-
Graham Klyne
-
Gregory Wright
-
John Meacham
-
Jules Bean
-
Keean Schupke
-
Krasimir Angelov
-
Marcin 'Qrczak' Kowalczyk
-
Peter Simons
-
robert dockins
-
Robert Dockins
-
Scott Turner
-
Simon Marlow
-
Stefan Monnier
-
Steve Schafer
-
Sven Panne