RE: Proposal for a new I/O library design

On Mon, 28 Jul 2003, Simon Marlow wrote:
I'm concerned about one implementation difficulty. Your File type is independent of the filesystem. That is, on Unix it corresponds to an inode. Creating a File must correspond to "opening" it (in Unix speak). Creating a stream corresponds to duplicating the file descriptor (you could probably avoid too many unnecessary dups by being clever). There's a potential implementation difficulty, though: lookupFileByPathname must open the file, without knowing whether the file will be used for reading or writing in the future.
I know; I'm hoping against hope that this isn't an insurmountable problem. If the OS provides a "reopen" function which is like open except that it takes a file handle instead of a pathname, then I think implementation is straightforward: a File contains a handle with minimal access permissions and maximal sharing permissions, and when a read or write operation is attempted we open a second handle based on the first with additional permissions. There's a Win32 function called "ReOpenFile" with this functionality, but it's only in Windows Server 2003. Sigh. If there's a way to open files by unique ID instead of pathname, that would also work. I think the NT API might provide something like this, but looking through the online documentation just now I can't find anything of the sort. If it's not possible to provide a guarantee of File identity then we should probably drop the whole idea of File values. See my comments under Directory below.
So I would suggest that operations which create a value of type File take a read/write flag too.
This would break the conceptual identity between a File value and a file, since read-only access is not a property of a file. (Well, it can be, but it isn't in this case.) Functions which allowed access rights to be specified would have to return a FileAccessPath instead of a File, and a FileAccessPath is basically a handle, so we're back where we started. All we need here is a way to change the access and sharing rights on an already-open handle. I find it hard to believe that after decades of use by millions of people, the UNIX file API provides no way to do this safely. Maybe there's an fcntl or something?
type FilePos = Word64 type BlockLength = Int
FilePos should be Integer.
Seems reasonable.
fCheckRead :: File -> FilePos -> BlockLength -> IO Bool fCheckWrite :: File -> FilePos -> BlockLength -> IO Bool
What do these do? If they're supposed to return True if the required data can be read/written without blocking, then I suspect that they are not useful.
They're supposed to return True if the data can be read/written successfully, the idea being that this is how you check whether you have read/write access to the file. Probably I should have omitted the second and third arguments.
I'd use the traditional 'isEOF' way of detecting end of file.
Seems reasonable. (Should be "EOS" though, I think.)
On naming: it's probably not a good idea to use the 'is' prefix, since it is already used for predicates (meaning literally 'is' rather than an abbreviation for 'InputStream').
I agree completely. Come up with something better and I'll second it. :-) (How about renaming Streams to Channels? Then we could use "ic" and "oc".)
You will also want a way to get back from an InputStream to the underlying object, eg. the (File,FilePos) pair if one exists.
Agreed.
It's not pretty, but you certainly want a way to close a stream. Finalizers aren't reliable enough.
What are the practical problems with relying on finalizers? As far as I can see, the "no more filehandles available" problem is completely solved by forcing a major GC and trying again when it occurs. The only other issue I see is leaving other processes unable to access the file for an indeterminate period of time. The right solution to this, if it can be implemented, is something like withExclusiveWriteAccess :: File -> IO a -> IO a, with write access being non-exclusive (or even disallowed?) otherwise.
How did you intend text encodings to work? I see several possibilities:
textDecode :: TextEncoding -> [Octet] -> [Char]
or
decodeInputStream :: TextEncoding -> InputStream -> TextInputStream getChar :: TextInputStream -> IO Char etc.
or
setInputStreamCoding :: InputStream -> TextEncoding -> IO () getChar :: InputStream -> IO Char
I was thinking of the second. It could easily be implemented as the third under the hood. But I was hoping someone else would worry about it. :-)
data Directory -- abstract
I don't see a reason for changing the existing Directory support (System.Directory). Could you give some motivation here? Is the idea to abstract away from the syntax of pathnames on the platform (eg. directory separator characters)? If so, I'm not sure it's worthwhile. There are lots of differences between pathname conventions: case sensitivity, arbitrary limits on the lengh of filenames, filename extensions, and so on.
Basically, the usual interface encourages programmers to treat pathnames as file/directory identifiers, even though they aren't. This is the root cause of a whole class of security vulnerabilities (not to mention some everyday annoyances). I want to avoid those vulnerabilities in the Haskell model by providing values that *really are* file and directory identifiers. Pathnames have one good property: they're human-readable and -writable. That's their only good property. Within an application, they should be converted immediately to a more secure internal representation. (And the conversion should be done exactly once -- any more and you're opening yourself to security exploits.) This is why I really don't want to use the File concept unless we can guarantee file-File identity. A system that appears to be secure but actually isn't is even worse than one which is obviously insecure. This idea isn't complete unless the model also supports persistence of File and Directory values, but I didn't even bother drawing up an API for this because I'm sure it's impossible to implement. Any sane OS would provide support for this, but I don't think any widespread OS does. There should also be a DirectoryEntry type, but, again, I'm pretty sure that this can't be implemented. On reflection I think the (Directory, Maybe String) return value is a mistake. The intent was to support creating a new file or directory by pathname, but that's probably better done by functions like those you propose below. (The return value was originally supposed to be Either DirectoryEntry (Directory,String), which made more sense.)
lookupFileByPathname :: String -> IO File
Here, I suggest we need
lookupFileByPathname :: FilePath -> IOMode -> IO File
If so, it should be called something like "newFileAccessPath" or at least "lookupFileAccessPath".
lookupInputStreamByPathname :: String -> IO InputStream -- at least as likely to succeed as lookupFileByPathname
and similarly
createFileOutputStream :: FilePath -> IO OutputStream appendFile :: FilePath -> IO OutputStream
Definitely. -- Ben

If it's not possible to provide a guarantee of File identity then we should probably drop the whole idea of File values.
I haven't been following in enough detail to know if File values are a good idea or not but it's maybe worth mentioning that if you're NFS, there's no way to test file identity. IIRC, the problem is that a machine might nfs-export overlapping parts of its filesystem so you might have two names for the same file. -- Alastair Reif ps (At least in FreeBSD) mmapping an NFS-mounted file can also lead to unhappiness. If you mmap the same NFS-mounted file multiple times, you can end up with multiple copies of the same page which leads to chaos when you start writing to the pages. (At least, that was the conclusion we came to when trying to understand why libelf produced incorrect results when used with NFS-mounted files if compiled with mmapping turned on but produced correct results for NFS-mounted files with mmapping turned off and with local files (mmapping on or off).

On Tue, 29 Jul 2003, Alastair Reid wrote:
Ben Rudiak-Gould wrote:
If it's not possible to provide a guarantee of File identity then we should probably drop the whole idea of File values.
I haven't been following in enough detail to know if File values are a good idea or not but it's maybe worth mentioning that if you're NFS, there's no way to test file identity. IIRC, the problem is that a machine might nfs-export overlapping parts of its filesystem so you might have two names for the same file.
I don't think this is fatal. The important part of File identity is that two File values which compare equal necessarily denote the same file, not the converse. I don't think there's any observable difference between File values which denote different files and File values which denote the same file but don't compare equal, since there could be a demon (not daemon -- I'm thinking Maxwell's demon here) watching everything you do to file x/y and immediately doing the same thing to file y/x. Now, if NFS doesn't even guarantee the forward implication (for e.g. filehandle comparison by value), then that's bad. But not bad for us. Bad for anyone foolish enough to use NFS.
ps (At least in FreeBSD) mmapping an NFS-mounted file can also lead to unhappiness. If you mmap the same NFS-mounted file multiple times, you can end up with multiple copies of the same page which leads to chaos when you start writing to the pages.
That's bad too. See previous paragraph. :-) -- Ben

Ben Rudiak-Gould wrote:
I don't think this is fatal. The important part of File identity is that two File values which compare equal necessarily denote the same file, not the converse. [...]
Huh? I thought it was the other way round. What is this your identity good for? Cheers, S.

On Tue, 29 Jul 2003, Sven Panne wrote:
Ben Rudiak-Gould wrote:
I don't think this is fatal. The important part of File identity is that two File values which compare equal necessarily denote the same file, not the converse. [...]
Huh? I thought it was the other way round. What is this your identity good for?
What I wrote above doesn't make much sense. Here's what I was trying to say. The keystone of this whole model is that File values are not file handles: they are files. It's essential that the library implementation maintain the link from the File value to the file by something more reliable than a pathname. For example, in the following code: do f1 <- lookupFileByPathname "/mydir/myfile" fRead f1 ... f2 <- lookupFileByPathname "/mydir/myfile" putStrLn (if f1 == f2 then "equal" else "inequal") fRead f2 ... f1 and f2 may compare equal or they may compare inequal (the latter being the case if someone has renamed the original file and put a new one in its place, for example), but if they compare equal then it *must* be the case that they refer to the same actual file. This would not hold true if File values were implemented internally as pathnames. Why does this matter? Because people use the File abstraction in their code whether the system provides it or not, and if the system doesn't provide it, the way they implement it is as, you guessed it, pathnames. This opens them up to various security exploits that involve renaming files or directories in between successive opens of the same pathname. In Haskell, I thought we could avoid this by implementing the File abstraction correctly once and for all. People would have to use the system-supplied abstraction rather than their own, but that would just be among the things you'd learn when learning Haskell programming, along with strong typing, side-effect isolation, pattern matching, and so on. Like those features, the File abstraction not only prevents a certain class of bugs, it's also conceptually more elegant than the alternatives. But the upshot of the discussion here is that there's no way to implement a well-behaved File abstraction on Win32 or Posix, so people are just going to have to continue writing insecure programs on a less-elegant API. Stop me if I start to sound bitter. :-) -- Ben (All of this also applies to the Directory abstraction, of course. Another example from the proposal: dCreateFileEntry must fail if an entry by that name already exists, and it must return IO File and not IO (). Anything else is exploitable. Even this can't be implemented on some systems, I think.)

Ben Rudiak-Gould wrote:
On Mon, 28 Jul 2003, Simon Marlow wrote:
[...] lookupFileByPathname must open the file, without knowing whether the file will be used for reading or writing in the future.
I know; I'm hoping against hope that this isn't an insurmountable problem.
Well, I fear it is, at least on POSIX...
If the OS provides a "reopen" function which is like open except that it takes a file handle instead of a pathname,
On POSIX, I'm not aware of anything like that, only dup/dup2, but you can't change the access mode after duplicating the fd (at least fcntl on Linux is not capable of doing it).
[...] a File contains a handle with minimal access permissions and maximal sharing permissions,
The next problem: How should one get a file descriptor on POSIX without knowing the access mode in advance? If the file is not readable O_RDONLY will fail, if it is only writeable O_WRONLY will fail, O_RDWR is even worse... OK, we could stat the file first, but there is no guarantee that the file permissions are still the same when we later want to "reopen" it.
[...] If there's a way to open files by unique ID instead of pathname, that would also work.
I'm not aware of this on POSIX (open a file by inode/fs?).
[...] All we need here is a way to change the access and sharing rights on an already-open handle. I find it hard to believe that after decades of use by millions of people, the UNIX file API provides no way to do this safely.
Personally, I think this is a sign that one is heading towards the wrong direction... :-)
[...] What are the practical problems with relying on finalizers? As far as I can see, the "no more filehandles available" problem is completely solved by forcing a major GC and trying again when it occurs.
But on quite a few systems there is an upper limit on the *global* number of open files, so you would be a "bad citizen" for such a system.
How did you intend text encodings to work? I see several possibilities:
textDecode :: TextEncoding -> [Octet] -> [Char]
or
decodeInputStream :: TextEncoding -> InputStream -> TextInputStream getChar :: TextInputStream -> IO Char etc.
or
setInputStreamCoding :: InputStream -> TextEncoding -> IO () getChar :: InputStream -> IO Char
I was thinking of the second. It could easily be implemented as the third under the hood. But I was hoping someone else would worry about it. :-)
In the non-IO versions you have a problem if the encoder/decoder encounters an error because of a malformed InputStream. In the IO case one can simply raise an IO exception. And using "Maybe TextInputStream" won't help, because this would essentially make the encoder/decoder strict in its InputStream argument. Cheers, S.

On Tue, Jul 29, 2003 at 10:19:21AM +0200, Sven Panne wrote:
Ben Rudiak-Gould wrote:
On Mon, 28 Jul 2003, Simon Marlow wrote:
[...] lookupFileByPathname must open the file, without knowing whether the file will be used for reading or writing in the future. I know; I'm hoping against hope that this isn't an insurmountable problem.
Well, I fear it is, at least on POSIX...
If the OS provides a "reopen" function which is like open except that it takes a file handle instead of a pathname,
On POSIX, I'm not aware of anything like that, only dup/dup2, but you can't change the access mode after duplicating the fd (at least fcntl on Linux is not capable of doing it).
fcntl(2) (a wonderful catch-all for everything files) can change the flags on an open file. however, it cannot change access modes on all systems.
[...] a File contains a handle with minimal access permissions and maximal sharing permissions,
The next problem: How should one get a file descriptor on POSIX without knowing the access mode in advance? If the file is not readable O_RDONLY will fail, if it is only writeable O_WRONLY will fail, O_RDWR is even worse... OK, we could stat the file first, but there is no guarantee that the file permissions are still the same when we later want to "reopen" it.
[...] If there's a way to open files by unique ID instead of pathname, that would also work.
I'm not aware of this on POSIX (open a file by inode/fs?).
yeah. this is not possible in POSIX and even if it did exist, i would imagine it would interact oddly with non-unixy and network filesystems.
[...] All we need here is a way to change the access and sharing rights on an already-open handle. I find it hard to believe that after decades of use by millions of people, the UNIX file API provides no way to do this safely.
Personally, I think this is a sign that one is heading towards the wrong direction... :-)
yeah. a part of the problem, is that with some filesystems, access rights are not a quality of the file itself, but of it's name. for instance /foo and /bar might point to the same file but one is read-only. this is generally not the case on traditional unix filesystems. but is sometimes the prefered method of access control on some systems. in fact, in EROS it was the ONLY method of access control. knowing somethings name gave you power over it and things could have several names :). what fun.
[...] What are the practical problems with relying on finalizers? As far as I can see, the "no more filehandles available" problem is completely solved by forcing a major GC and trying again when it occurs.
But on quite a few systems there is an upper limit on the *global* number of open files, so you would be a "bad citizen" for such a system.
also, with many OS architectures, operations involving fds get slower as the number of fds increase. they have to be looked up in some sort of data structure on the kernel side and searching for the lowest free one when allocating can be slow. treat fd's as precious. never hold something open longer than necisary. if nothing else it clutters up the lsof output and obscusiates what your program is doing when viewed by system tools.
How did you intend text encodings to work? I see several possibilities:
textDecode :: TextEncoding -> [Octet] -> [Char]
or
decodeInputStream :: TextEncoding -> InputStream -> TextInputStream getChar :: TextInputStream -> IO Char etc.
or
setInputStreamCoding :: InputStream -> TextEncoding -> IO () getChar :: InputStream -> IO Char
I was thinking of the second. It could easily be implemented as the third under the hood. But I was hoping someone else would worry about it. :-)
In the non-IO versions you have a problem if the encoder/decoder encounters an error because of a malformed InputStream. In the IO case one can simply raise an IO exception. And using "Maybe TextInputStream" won't help, because this would essentially make the encoder/decoder strict in its InputStream argument.
please can we figure out portable binary IO before worrying about i18n? the problems are relativly orthogonol, but the 'right thing to do' for i18n is not as clear, and arguably not as important since with portable binary IO one can implement any sort of character processing on top of it. it is my opinion that haskell dropped the ball big big time by specifying IO in terms of undefined OS character set encodings. always raw binary would have been so much more useful. you can always write a bit more code to do \r\n <-> \n or convert utf8 to Chars yourself, but there is no way to ever turn an undefined operation into anything useful. just some thoughts and a (hopefully small and somewhat relevant) rant. :) -John -- --------------------------------------------------------------------------- John Meacham - California Institute of Technology, Alum. - john@foo.net ---------------------------------------------------------------------------
participants (4)
-
Alastair Reid
-
Ben Rudiak-Gould
-
John Meacham
-
Sven Panne