Proposal #3456: Add FilePath -> String decoder

Currently, FilePaths on POSIX systems are represented as raw bytes in a String. When this last came up on the mailing list, the general consensus was to make FilePath an abstract datatype representing paths as Strings on Windows and raw bytes on POSIX systems. However, such a change is sure to break some existing code. As a small step towards that goal, I propose adding the following two functions to the System.IO module: filePathToString :: FilePath -> IO String getFilePathToStringFunc :: IO (FilePath -> String) I've implemented those functions in a patch attached to the trac ticket. Haddock docs are here: http://code.haskell.org/~judah/new-io-docs/System-IO.html#v%3AfilePathToStri... On POSIX those functions decode according to the current locale, so they ought to be in the IO monad. I think that the latter function will be easier to integrate into existing pure code. On Windows, those functions are just `return` and `return id`, respectively, since GHC already treats FilePaths as Strings on that platform. Discussion deadline: September 9 Ticket: http://hackage.haskell.org/trac/ghc/ticket/3456 Best, -Judah

Hi Judah, A few comments: - I would spell 'filePathToString' as just 'toString' and use the module system to provide a namespace (i.e. stick it in System.FilePath). - A function that takes the encoding as a parameter instead of fetching it from the current locale seems more useful. However, creating such a function is a bit problematic since passing an explicit encoding doesn't really make sense on Windows where a path is already represented as Unicode. Perhaps the only solution is to have System.FilePath.Posix.toString and System.FilePath.Windows.toString with different type signatures. - As long as FilePath is just a type synonym the function is unsafe as it's equivalent to decodeWithCurrentLocale :: String -> IO String which fails if the argument string is actually used to store Unicode data (rather than bytes stored as code points < 256). This is probably intentional but it might be worth mentioning it in the Haddock documentation. Cheers, Johan

Johan Tibell wrote:
Perhaps the only solution is to have System.FilePath.Posix.toString and System.FilePath.Windows.toString with different type signatures.
I'm not sure there's any point. As Duncan pointed out, we are not just talking about the file system, we are talking about interaction between the file system and a user interface - how file paths should appear to users. So it also depends on what UI you are using. For example, GTK2 on Unix always uses UTF-8 to display file paths no matter what the current locale - unless you've set a certain environment variable. Most X terminals display file paths using the current locale. I'm not sure what the current situation is in Qt. On Mac OS X, HFS+ stores file names as UTF-16, and file paths in POSIX calls are interpreted as UTF-8. But canonical Unicode is used, so the actual file path might not be the same as what you provided if it includes combining characters. I think that Windows also converts the file path to (some kind of) canonical Unicode in the presence of combining characters. So we should probably add stringToFilePath as well - encode on vanilla POSIX, canoncialize and encode on Mac OS X, canonicalize on Windows. We need to research exactly which canonical form is used on each platform. Unfortunately, that may depend upon the file system. Also, based on past experience, I fear that on Windows "canonical" may mean something different than anything published. I am now beginning to lean towards Ketil's suggestion that on POSIX platforms we should always use UTF-8. We then need a prominent warning in the documentation that if you need something else, like the current locale, decode it yourself. Note that it is becoming increasingly rare for people to use non-UTF-8 locales anywhere in the world, and even then it's likely ignored by many UIs. So I'm inclined against cluttering the API with convenience functions for other encodings, as Johan is suggesting. As a way forward - I propose: 1. Accept Judah's patch, modified always to use UTF-8. 2. Add strident warnings in the documentation that: o If you need a different encoding on POSIX, do it yourself. o If FilePath does not come from the file system, it may not match the actual file path used in the file system due to Unicode canonicalization. 3. Open a feature request for stringToFilePath as described above. Regards, Yitz

On Wed, 2009-08-26 at 16:14 +0300, Yitzchak Gale wrote:
Johan Tibell wrote:
Perhaps the only solution is to have System.FilePath.Posix.toString and System.FilePath.Windows.toString with different type signatures.
I'm not sure there's any point. As Duncan pointed out, we are not just talking about the file system, we are talking about interaction between the file system and a user interface - how file paths should appear to users. So it also depends on what UI you are using.
Mmm, this stuff is complex :-( In general I like the idea of the proposal that we have functions for converting between String and FilePath. As it says in the proposal, it gets us closer to being able to treat FilePath as abstract. Of course the devil is in the detail. Getting it right, and making it portable and usable is hard.
I am now beginning to lean towards Ketil's suggestion that on POSIX platforms we should always use UTF-8. We then need a prominent warning in the documentation that if you need something else, like the current locale, decode it yourself.
That's nice in that it makes the function pure, or equivalently so that it does not need a locale parameter.
Note that it is becoming increasingly rare for people to use non-UTF-8 locales anywhere in the world, and even then it's likely ignored by many UIs. So I'm inclined against cluttering the API with convenience functions for other encodings, as Johan is suggesting.
As a way forward - I propose:
1. Accept Judah's patch, modified always to use UTF-8.
If we don't have the locale stuff then doesn't the API become a lot simpler? Instead of: filePathToString :: FilePath -> IO String getFilePathToStringFunc :: IO (FilePath -> String) We'd have: filePathToString :: FilePath -> String Presumably on POSIX we will follow the glib approach of using '?' replacement chars, since the conversion to string is aimed at human consumption. Doing this makes the function total but lossy. And I didn't notice anything in the proposal about the other direction, converting String to FilePath. Surely we need both. stringToFilePath :: String -> FilePath A nice thing about using UTF8 on POSIX is we know this function cannot fail, unlike conversions into a locale encoding. Presumably on POSIX this does not do any kind of Unicode canonicalisation, while on OSX and Windows it would do the appropriate kind. At this point I expect Johan to jump up and down and say these should be: import qualified System.FilePath as FilePath FilePath.toString :: FilePath -> String FilePath.fromString :: String -> FilePath In principle I guess it'd be ok to add versions in the System.FilePath.Posix module that take an extra encoding parameter, but it can't be the portable version since the encoding is fixed for OSX and Windows. It's also jolly inconvenient, and as you've pointed out, of diminishing importance.
2. Add strident warnings in the documentation that:
o If you need a different encoding on POSIX, do it yourself.
o If FilePath does not come from the file system, it may not match the actual file path used in the file system due to Unicode canonicalization.
Similar points apply to trying to round-trip via toString . fromString :: String -> String fromString . toString :: FilePath -> FilePath The String -> String transform would do some Unicode canonicalisation on Windows and OSX. The FilePath -> FilePath would be identity on Windows and OSX for strings coming from the file system. On POSIX however we can get utf8 decoding errors which will give us replacement chars. So the advice in this section of the documentation should probably be similar to the glib docs, where it says that you should keep both forms in some circumstances. You can present the file name to the user though a graphical or command line ui, but also so you can still access the same file later (eg to save it). Especially in document-oriented GUI apps, it's very annoying if you open, edit and save, but saving either fails because it cannot re-encode, or ends up writing a different file (different in Unicode canonicalisation or having replacement chars). Duncan

On Fri, Aug 28, 2009 at 3:50 PM, Duncan
Coutts
On Wed, 2009-08-26 at 16:14 +0300, Yitzchak Gale wrote:
I am now beginning to lean towards Ketil's suggestion that on POSIX platforms we should always use UTF-8. We then need a prominent warning in the documentation that if you need something else, like the current locale, decode it yourself.
That's nice in that it makes the function pure, or equivalently so that it does not need a locale parameter.
Note that it is becoming increasingly rare for people to use non-UTF-8 locales anywhere in the world, and even then it's likely ignored by many UIs. So I'm inclined against cluttering the API with convenience functions for other encodings, as Johan is suggesting.
I agree that this would make the API much simpler; but I'm wary of broad statements like the above. My (very vague) impression was that many Japanese users, for example, still use non-Unicode encodings. I think that glib is an interesting example. Its developers advocate pretty strongly for everyone to use utf-8 filenames; but even they provide a simple way for the user of any glib program to override that behavior by setting G_FILENAME_ENCODING=@locale. As another example, Python v.3, which recently redesigned its Unicode interface, also still uses the locale for filenames rather than solely utf-8. The following interview with Guido from January has a good take on why they did that (about halfway through the article): http://broadcast.oreilly.com/2009/01/the-evolution-of-python-3.html If we really want a pure FilePath->String conversion, then perhaps we could make the rts check the locale once at the start of the program, and have every subsequent conversion use that locale. This would be safe from order-of-operation changes; though it would be possible for the same pure code to behave differently in two different program runs...so I'm unsure about that solution. Best, -Judah

Judah Jacobson wrote:
Duncan Coutts wrote:
Note that it is becoming increasingly rare for people to use non-UTF-8 locales anywhere in the world, and even then it's likely ignored by many UIs. So I'm inclined against cluttering the API with convenience functions for other encodings, as Johan is suggesting.
I agree that this would make the API much simpler; but I'm wary of broad statements like the above. My (very vague) impression was that many Japanese users, for example, still use non-Unicode encodings.
Indeed, non-Unicode encodings are still commonplace in Japan (and, last I checked, China). FWIW. -- Live well, ~wren

That ticket is a duplicate of #3307, please reference your patch there. There has already been some discussion of this proposal in that bug, and in the referenced thread on Haskell Cafe. -Yitz

On Sun, Aug 23, 2009 at 09:27:15AM -0700, Judah Jacobson wrote:
Currently, FilePaths on POSIX systems are represented as raw bytes in a String. When this last came up on the mailing list, the general consensus was to make FilePath an abstract datatype representing paths as Strings on Windows and raw bytes on POSIX systems. However, such a change is sure to break some existing code.
Indeed, it would break an enormous amount of existing code. A more feasible migration path would be to introduce a new abstract type, say NativeString, with corresponding variants of the file operations, getArgs and getEnv, beside the Haskell 98 functions. That could be done now. Using Char to hold raw bytes (as GHC is doing, and this proposal would implicitly condone) is a really bad idea. It forces the programmer to keep track of whether a particular String has been converted or not. That's a job for the type system.
participants (6)
-
Duncan Coutts
-
Johan Tibell
-
Judah Jacobson
-
Ross Paterson
-
wren ng thornton
-
Yitzchak Gale