RE: [Haskell-cafe] Re: I/O interface

On 12 January 2005 01:27, Ben Rudiak-Gould wrote:
First of all, I don't think any OS shares file pointers between processes. Otherwise it would be practically impossible to safely use an inherited filehandle via any API. Different threads using the same filehandle do share a file pointer (which is a major nuisance in my experience, because of the lack of an atomic seek-read/write), but a Posix fork duplicates the file pointer along with all other state. I can't believe I'm wrong about this, but someone please correct me if I am.
I'm afraid you're wrong. Believe me, I'm as surprised as you. See the definition of "Open File Description" in POSIX: http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#t ag_03_253
This limits the problem to a single process. If you're only using GHC's lightweight threads, there's no problem at all. If you're using OS threads, the worst thing that could happen is that you might have to protect handle access with a critical section. I don't think this would lead to a noticeable performance hit when combined with the other overhead of file read/write operations (or lazy evaluation for that matter).
pread requires that the file is seekable, which means that it can't be used for all file handles: not for pipes, sockets, terminals nor various other devices.
The file interface in this library is only used for files, which are always seekable (by definition). If you want to treat a file as a stream, you create an InputStream or OutputStream backed by the file. Such streams maintain internal (per-stream) file pointers.
Unfortunately, they don't (at least in my prototype implementation). I assumed that dup()'ing file descriptors would be enough to produce separate file pointers, but no. So you can only safely make a single stream from a File. Making multiple streams would require re-opening the file for each subsequent one, or keeping a cached copy of the file pointer and seeking when necessary. Cheers, Simon

Simon Marlow wrote:
I assumed that dup()'ing file descriptors would be enough to produce separate file pointers, but no.
Question (for qrczak or the group at large): is there *any* way to get, without an exploitable race condition, two filehandles to the same file which don't share a file pointer? Is there any way to pass a filehandle as stdin to an untrusted/uncooperative child process in such a way that the child can't interfere with your attempts to (say) append to the same file?
So you can only safely make a single stream from a File.
I think we just need more kinds of streams. With regard to file-backed streams, there are three cases: 1. We open a file and use it in-process. 2. We open a file and share it with child processes. 3. We get a handle at process startup which happens to be a file. In case 1 there are no problems, and we should support multiple streams on such files. In case 2 we could avoid OS problems by creating a pipe and managing our end in-process. This would allow attaching child processes to arbitrary streams (e.g. one with a gzip filter on it, if we ever implement such a thing). In certain cases it might be possible to rely on OS support, but it seems fragile (if we create two child processes tied to two streams on the same file). Case 3 is the most interesting. In an ideal world I would argue for treating stdin/out/err simply as streams, but that's not practical. Failing that, if we have pread and pwrite, we should provide two versions of stdin/out/err, one of type InputStream/OutputStream and the other of type Maybe File. We can safely layer other streams on top of these files (if they exist) without interfering with the stream operation. The only thing we can't do with this interface is screw up the parent process by seeking the inherited handles. Can anyone come up with a case for allowing that in the high-level library? It can always be done through System.Posix. If we don't have pread and pwrite, we're screwed, but so is every other application on this badly broken OS. If we punt on the interference problem, we can implement a pread and pwrite that are atomic within our process, and go from there. We're no worse off than anyone else here. Unfortunately, Win9x lacks pread and pwrite. But anyone running Win9x is clearly willing to deal with much worse problems than this.
Making multiple streams would require re-opening the file for each subsequent one,
Windows Server 2003 has ReOpenFile, but no prior version of Win32 can do this, as far as I know. I don't know the *ix situation. With ReOpenFile we could implement a lot more of my original proposal, including a File type that really denoted a file (instead of a file access path). -- Ben

Ben Rudiak-Gould
is there *any* way to get, without an exploitable race condition, two filehandles to the same file which don't share a file pointer?
AFAIK it's not possible if the only thing you know is one of the descriptors. Of course independent open() calls which refer to the same file have separate file pointers (I mean the true filename, not /proc/*/fd/*). On Linux the current file position is stored in struct file in the kernel. struct file includes "void *private_data" whose internals depend on the nature of the file, in particular they can be reference counted. Among polymorphic operations on files in struct file_operations there is nothing which clones the struct file. This means that a device driver would have no means to specify how private_data of its files should be duplicated (e.g. by bumping the reference count). If my understanding is correct, it implies that the kernel has no way to clone an arbitrary struct file. Just don't use the current position of seekable files if you don't like it: use pread/pwrite.
Is there any way to pass a filehandle as stdin to an untrusted/ uncooperative child process in such a way that the child can't interfere with your attempts to (say) append to the same file?
You can set O_APPEND flag to force each write to happen at the end of file. It doesn't prevent the process from clearing the flag. If it's untrusted, how do you know that it won't truncate the file or just write garbage to it where you would have written something? If the file is seekable, you can use pread/pwrite. If it's not seekable, the concept of concurrent but non-interfering reads or writes is meaningless.
I think we just need more kinds of streams. With regard to file-backed streams, there are three cases:
1. We open a file and use it in-process. 2. We open a file and share it with child processes. 3. We get a handle at process startup which happens to be a file.
I disagree. IMHO the only distinction is whether we want to perform I/O at the current position (shared between processes) or explicitly specified position (possible only in case of seekable files). Neither can be emulated in terms of the other.
In case 2 we could avoid OS problems by creating a pipe and managing our end in-process.
It's not transparent: it translates only read and write, but not sendto/recvfrom, setsockopt, ioctl, lseek etc., and obviously it will stop working when our process finishes but the other does not. A pipe can be created when the program really wants this, but it should not be created autimatically whenever we redirect stdin/stdout/stderr of another program to a file we have opened.
Case 3 is the most interesting. In an ideal world I would argue for treating stdin/out/err simply as streams, but that's not practical. Failing that, if we have pread and pwrite, we should provide two versions of stdin/out/err, one of type InputStream/OutputStream and the other of type Maybe File. We can safely layer other streams on top of these files (if they exist) without interfering with the stream operation.
I'm not sure what do you mean. Haskell should not use pread/pwrite for functions like putStr, even if stdout is seekable. The current file position *should* be shared between processes by default, otherwise redirection of stdout to a file will break if the program delegates some work with corresponding output to other programs it runs.
Indeed, file positions are exactly as evil as indices into shared memory arrays, which is to say not evil at all. But suppose each shared memory array came with a shared "current index", and there was no way to create additional ones.
Bad analogy: if you open() the file independently, the position is not shared. The position is not tied to a file with its shared contents but to the given *open* file structure. And there is pread/pwrite (on some OSes at least). It's not suitable as the basic API of all reads and writes though. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Ben Rudiak-Gould
writes: is there *any* way to get, without an exploitable race condition, two filehandles to the same file which don't share a file pointer?
In unix the traditional way to do this is to use a directory. Each
Marcin 'Qrczak' Kowalczyk wrote: process/thread opens its own file... and you have some kind of master index/ordering file to keep track of which file is doing what (for example Highly parallel mail software). At the end of the day IO is serial by nature (to one device anyway), so the way to do this into one file is to have one thread that reads and writes, and to 'send' read and write requests over channels from the threads that need the work done... Effectively the channels serialise the requests. This has the added advantage that is guarantees the transactional itegrity of the IO (for example database software) Keean.

Keean Schupke
At the end of the day IO is serial by nature (to one device anyway), so the way to do this into one file is to have one thread that reads and writes, and to 'send' read and write requests over channels from the threads that need the work done
Would the stream proposal make this possible and easy? I.e. could the IO thread provide (say) output streams to the other threads, and pass writes on to its own output stream? -kzm -- If I haven't seen further, it is by standing in the footprints of giants

No I meant Channels (from Data.Concurrent)... you can use a structure like: data Command = Read FileAddr (MVar MyData) | Write FileAddr MyData So to write you just do: writeChan iochan (Write address data) -- returns immediately -- write happens asynchronously later and to read: data <- newEmptyMVar writeChan iochan (Read address data) -- read not happend yet. myData <- readMVar data -- blocks until read completes. The forked thread (with forkIO) just reads the commands form the "iochan" and processes them one at a time. Keean Ketil Malde wrote:
Keean Schupke
writes: At the end of the day IO is serial by nature (to one device anyway), so the way to do this into one file is to have one thread that reads and writes, and to 'send' read and write requests over channels from the threads that need the work done
Would the stream proposal make this possible and easy? I.e. could the IO thread provide (say) output streams to the other threads, and pass writes on to its own output stream?
-kzm

Keean Schupke
No I meant Channels (from Data.Concurrent)... you can use a structure like:
Yes, I realize that (although I haven't yet used Data.Concurrent). It seemed to me, though, that streams are related to channels, and that it may be possible to use the same (or a slightly more generalized) abstraction? (I should perhasp experiment a bit with concurrent programming and streams, and it'll surely become apparent how and why I'm mistaken :-) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde wrote:
Keean Schupke
writes: No I meant Channels (from Data.Concurrent)... you can use a structure like:
Yes, I realize that (although I haven't yet used Data.Concurrent). It seemed to me, though, that streams are related to channels, and that it may be possible to use the same (or a slightly more generalized) abstraction? (I should perhasp experiment a bit with concurrent programming and streams, and it'll surely become apparent how and why I'm mistaken :-)
-kzm
I don't necessarily think you are mistaken, but why re-invent the wheel when channels are almost ideal for the job (inter-thread FIFO's)... At the end of the day streams between processes are channels... in effect (non seekable) streams are extending channels to IO. Keean.

Ketil Malde
It seemed to me, though, that streams are related to channels,
I'm not sure what exactly do you mean by streams (because they are only being designed), but differences are: - A stream is either an input stream or an output stream, while a single channel supports reading from one end and writing to the other end. - A stream passes around bytes, which are usually grouped in blocks for efficiency. A channel is polymorphic wrt. the element type and elements are always processed one by one. - A stream may be backed by an OS file, pipe, socket etc., while a channel exists purely in Haskell. - A channel is never closed. Reading more data than have been put blocks until someone puts more data. A stream can reach its end, which is a condition a reader can detect. A stream backed by a pipe is similar to a channel of bytes in that the reader blocks until someone puts more data, but it can be closed too, which causes the reader to observe end of file. A writer to a stream can block too when the internal buffer in the kernel is full. - A stream can be inherited by child processes, and it generally continues to work by being linked to the same data sink or source as before. A channel is inherited as a whole: there is no communication between the two versions of the channel in the two processes. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk
Ketil Malde
writes:
It seemed to me, though, that streams are related to channels,
I'm not sure what exactly do you mean by streams (because they are only being designed), but differences are:
Sorry for being unclear, I was thinking in relation to the new-io proposal Simon M. recently posted on the lists (I put all Haskell mail in the same folder, it could have been a ghc list or haskell@).
- A stream passes around bytes, which are usually grouped in blocks for efficiency. A channel is polymorphic wrt. the element type and elements are always processed one by one.
Perhaps I'm confused, but while Stream.StreamInputStream is a stream of Word8, Text.TextInputStream provides a stream of Chars. Thanks for the explanation! -kzm -- If I haven't seen further, it is by standing in the footprints of giants
participants (5)
-
Ben Rudiak-Gould
-
Keean Schupke
-
Ketil Malde
-
Marcin 'Qrczak' Kowalczyk
-
Simon Marlow