RE: [Haskell-cafe] Re: Hugs vs GHC (again)was: Re: Somerandomnewbiequestions

On 11 January 2005 14:15, Gracjan Polak wrote:
Simon Marlow wrote:
There's a big lock on File. If you want to do truly concurrent reading, > you can make multiple FileInputStreams, each of which has its own file > descriptor (the Unix implementation uses dup(2)).
Original and descriptor returned by dup or dup2 share file pointer.
*blink* You're right! Serves me right for assuming that POSIX would have sensible semantics. Perhaps this API isn't implementable, in its current state. Cheers, Simon

On 2005-01-11, Simon Marlow
On 11 January 2005 14:15, Gracjan Polak wrote:
Simon Marlow wrote:
There's a big lock on File. If you want to do truly concurrent reading, > you can make multiple FileInputStreams, each of which has its own file > descriptor (the Unix implementation uses dup(2)).
Original and descriptor returned by dup or dup2 share file pointer.
*blink*
You're right! Serves me right for assuming that POSIX would have sensible semantics. Perhaps this API isn't implementable, in its current state.
Others have pointed out pread() and pwrite(); Perhaps we can stick this function in an extension package. (Though it is required for Unix 98 conformance, so anything reasonable will have it. Hmm. Does open("/dev/fd/n") or ("/proc/self/fd/n") act as dup() or a fresh open() to underlying file?) I actually don't see the problem with interacting with other processes that we've forked. Other processes using our image and runtime can arrange to use something sharable for locking. Other processes not doing this can change the bytes in the file, so that we're not guaranteed that we get what we expect anyway, unless we have full control over that process. In which case we can have it do whatever we want. -- Aaron Denney -><-

Aaron Denney
Does open("/dev/fd/n") or ("/proc/self/fd/n") act as dup() or a fresh open() to underlying file?)
As a dup(), with a side effect of resetting the file pointer to the beginning. It would not help anyway: if it's a terminal or pipe, it *has* to act as a dup() (in this case the file pointer is obviously not reset), it's not seekable, and thus pread/pwrite fail. "wc -c" tries to measure the size in bytes by seeking, and proceeds to actually read the contents only if the file is not seekable (or if it's also told to count words or lines, obviously). This applies to stdin too. In this case it must remember to subtract the starting position from the size (and to bump the result to 0 if it would be negative), in order to give consistent results whether it manages to skip reading the contents or not.
I actually don't see the problem with interacting with other processes that we've forked.
File positions are not evil. They allow to treat files and devices in a uniform way. If you run a program which - writes some opening words to stdout - forks a subprocess, which executes another program, which writes something to stdout - the parent waits for the subprocess to finish - writes closing words to stdout then all output will appear in order: the opening words, then the part from the other program, and finally the closing words. This applies *also* when stdout has been redirected to a file. In this case it's essential that the file position is shared between all processes! So even if stdout is a regular file, output should be done using write(), not pwrite(). If processes don't synchronize their writes and write at the same time, then output will be intermixed. They can choose to synchronize themselves; in particular one process can avoid writing anything while the other process is running. So it's not true that allowing them to share a file position will necessarily be unsafe and will risk mangled data. Unix doesn't guarantee that programs will behave well, but it allows well-behaving programs to cooperate. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk wrote:
File positions are not evil. They allow to treat files and devices in a uniform way.
Indeed, file positions are exactly as evil as indices into shared memory arrays, which is to say not evil at all. But suppose each shared memory array came with a shared "current index", and there was no way to create additional ones. Suppose you couldn't index the array by a local variable: instead, you had to store the local variable into the shared index register first, overwriting whatever was there before. If you only wanted to use the array as a source or sink for a single stream, that would be fine. In every other case, it would be awful. Even read-only sharing would require the invention of some sort of cooperative locking discipline, and if some process didn't respect the locking and couldn't be changed, read-only sharing would become impossible. That's just silly. The way to solve this problem is to decouple the index from the shared memory array. You can easily simulate the single-index behavior if that's what you want, but you also get a lot of additional functionality. -- Ben

On Wed, Jan 12, 2005 at 12:21:25AM +0000, Aaron Denney wrote:
On 2005-01-11, Simon Marlow
wrote: On 11 January 2005 14:15, Gracjan Polak wrote:
Simon Marlow wrote:
There's a big lock on File. If you want to do truly concurrent reading, > you can make multiple FileInputStreams, each of which has its own file > descriptor (the Unix implementation uses dup(2)).
Original and descriptor returned by dup or dup2 share file pointer.
*blink*
You're right! Serves me right for assuming that POSIX would have sensible semantics. Perhaps this API isn't implementable, in its current state.
Others have pointed out pread() and pwrite(); Perhaps we can stick this function in an extension package. (Though it is required for Unix 98 conformance, so anything reasonable will have it. Hmm. Does open("/dev/fd/n") or ("/proc/self/fd/n") act as dup() or a fresh open() to underlying file?)
Actually, If I were writing new haskell libraries, I would use mmap whenever I could for accessing files. not only does it make the file pointer problem go away, but it can be drastically more efficient. of course, this can only be done on a limited type of file on some architectures, so it should be an optimization under the hood rather than an exposed interface. John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham wrote: I'm jumping into this discussion half way through, so my apologies if I mention something that's already been talked about!
of course, [mmap] can only be done on a limited type of file on some architectures, so it should be an optimization under the hood rather than an exposed interface.
In particular, you have to be careful not to run out of address space on 32-bit architectures. If you try to do file access by mapping the whole file into virtual memory, you won't be able to handle files larger than 2G or so. You also have to be careful not to map a number of file chunks, which in total exceed the address space available. Because of these constraints, building a general purpose I/O system around mmap would be quite difficult, wouldn't it? You would have to deal with situations where, for example, a chunk of a file is mapped and the user writes off the end of the chunk. Pete

Pete Chown <1@234.cx> writes:
of course, [mmap] can only be done on a limited type of file on some architectures, so it should be an optimization under the hood rather than an exposed interface.
In particular, you have to be careful not to run out of address space on 32-bit architectures. If you try to do file access by mapping the whole file into virtual memory, you won't be able to handle files larger than 2G or so. You also have to be careful not to map a number of file chunks, which in total exceed the address space available.
Which means that it *should* be explicit, rather than done under the hood. It's a tradeoff, the programmer should be aware when to apply it, the runtime should not impose a policy which would be hard to reverse. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

On Sat, Jan 15, 2005 at 07:56:03PM +0100, Marcin 'Qrczak' Kowalczyk wrote:
Pete Chown <1@234.cx> writes:
of course, [mmap] can only be done on a limited type of file on some architectures, so it should be an optimization under the hood rather than an exposed interface.
In particular, you have to be careful not to run out of address space on 32-bit architectures. If you try to do file access by mapping the whole file into virtual memory, you won't be able to handle files larger than 2G or so. You also have to be careful not to map a number of file chunks, which in total exceed the address space available.
Which means that it *should* be explicit, rather than done under the hood. It's a tradeoff, the programmer should be aware when to apply it, the runtime should not impose a policy which would be hard to reverse.
I was thinking of it as a better implementation of a stream interface (when available). a user-visible mmap facility would also be good but unrelated. In any case it is purely an implementation issue and not very relevant to the discussion (as long as we don't mandate anything that would rule out alternate implementations, which would be tough to do since we have to marshall between haskell and something else anyway). some papers on the subject for those that are interested are: Exploiting the advantages of mapped files for stream I/O http://www.cs.toronto.edu/pub/reports/csrg/267/267.ps IO-Lite: A Unified I/O Buffering and Caching System http://www.cs.rice.edu/CS/Systems/ScalaServer/io-lite-tocs.pdf (this one talks about an alternate OS interface to files which we might want to take advantage of if available) John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham
I was thinking of it as a better implementation of a stream interface (when available).
I'm not convinced that the stream interface (http://www.haskell.org/~simonmar/io/System.IO.html) works at all, i.e. whether it's complete, implementable and convenient. Convenience. I'm worried that it uses separate types for various kinds of streams: files, pipes, arrays (private memory), and sockets. Haskell is statically typed and lacks subsumption. This means that even though streams are unified by using a class, code which uses a stream of an unknown kind must be either polymorphic or use existential quantification. Completeness. Unless File{Input,Output}Stream uses {read,write}() rather than file{Read,Write}, openFile provides only a subset of the functionality of open(): it works only with seekable files, e.g. not with "/dev/tty". What is the type of stdin/stdout? They may be devices or pipes (not seekable), regular files (seekable), sockets... Note that even when they are regular files, emulating stream I/O in terms of either pread/pwrite or mmap does not yield the correct semantics of sharing the file pointer between processes. If we have a shell script which runs Haskell programs which write to stdout, it should be possible to redirect the output of the script as a whole.
Exploiting the advantages of mapped files for stream I/O http://www.cs.toronto.edu/pub/reports/csrg/267/267.ps
The advantage of reducing copying between buffers is lost in Haskell: file{Read,Write} use a buffer provided by the caller instead of giving a buffer for the caller to examine or fill. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk wrote:
Convenience. I'm worried that it uses separate types for various kinds of streams: files, pipes, arrays (private memory), and sockets. Haskell is statically typed and lacks subsumption. This means that even though streams are unified by using a class, code which uses a stream of an unknown kind must be either polymorphic or use existential quantification.
Or uses specialise pragmas to provide concrete implementations for a polymorphic function.
Exploiting the advantages of mapped files for stream I/O http://www.cs.toronto.edu/pub/reports/csrg/267/267.ps
The advantage of reducing copying between buffers is lost in Haskell: file{Read,Write} use a buffer provided by the caller instead of giving a buffer for the caller to examine or fill.
Eh? Surely that just depends on the API. The BlockIO library does exactly this (passes the blocks to a user provided callback function) Keean.

Marcin 'Qrczak' Kowalczyk wrote:
Convenience. I'm worried that it uses separate types for various kinds of streams: files, pipes, arrays (private memory), and sockets. Haskell is statically typed and lacks subsumption. This means that even though streams are unified by using a class, code which uses a stream of an unknown kind must be either polymorphic or use existential quantification.
Yes, this is a problem. In my original proposal InputStream and OutputStream were types, but I enthusiastically embraced Simon M's idea of turning them into classes. As you say, it's not without its disadvantages. I see several possibilities here. * We could adopt Avery Lee's suggestion (from the discussion in 2003) to use field labels instead of methods. Advantages: InputStream and OutputStream behave more like their OOP equivalents, with no loss of extensibility. Disadvantages: potentially less efficient (no specialization possible); loses some static type information. * We could use a single type for all input and output streams in the standard library, but retain the type classes also. * We could provide existential wrappers: data IStream = (InputStream a) => MkIStream !a instance InputStream IStream where ... A nice thing about the last approach is that it supports dynamic downcasting: case (x :: IStream) of MkIStream x -> case (Data.Dynamic.cast x :: UArrayInputStream) of Just x -> (getUArray x, getCurrentIndex x) Nothing -> ...
Completeness. Unless File{Input,Output}Stream uses {read,write}() rather than file{Read,Write}, openFile provides only a subset of the functionality of open(): it works only with seekable files, e.g. not with "/dev/tty".
What is the type of stdin/stdout? They may be devices or pipes (not seekable), regular files (seekable), sockets...
Simon M's current interface is incomplete, but the concept is fine. Again, to try to avoid confusion, what you call a "seekable file" the library calls a "file", and what you call a "file" I would call a "Posix filehandle". Roughly. It's hard to be precise because "file" is such a heavily overloaded term. (For example, is "/dev/tty" a file? Is the (major,minor) device number it might correspond to on a particular filesystem at a particular moment a file? Is the integer that's returned from open("/dev/tty", ...) a file? Is the tty device itself a file? I think you've used "file" in all four senses.) When I talk about a stream, I mean one end of a unidirectional pneumatic tube. If it's the ingoing end, you stick some data in the tube and it's carried away. If it's the outgoing end, you wait for some data to arrive and then take it. Tubes all look the same. No pneumatic tube is a storage device, but you may happen to know that it leads to a Frobozz Magic Storage Device at the other end. By the same token, stdin is never a file, but the data which appears through stdin may ultimately be coming from a file, and it's sometimes useful, in that case, to bypass stdin and access the file directly. The way to handle this is to have a separate stdinFile :: Maybe File. As for openFile: in the context of a certain filesystem at a certain time, a certain pathname may refer to * Nothing * A directory * A file (in the library sense); this might include things like /dev/hda and /dev/kmem * Both ends of a (named) pipe * A data source and a data sink which are related in some qualitative way (for example, keyboard and screen, or stdin and stdout) * A data source only * A data sink only * ... How to provide an interface to this zoo? The dynamic-typing approach is to return some sort of Thing with a complicated interface which is approximately the union of the interfaces for each thing in the above list. Unsupported methods fail when called. This is roughly what Posix does, except that directories are a special case, and Nothing is very special (as perhaps it should be, but I'm not sure). The Haskell approach is, I guess, to use an algebraic datatype, e.g. data FilesystemObject = Directory Directory | File File | InputOutput PosixInputStream PosixOutputStream | Input PosixInputStream | Output PosixOutputStream Here I'm using "Posix*Stream" for all streams backed by Posix filehandles. I'm unsure whether NoSuchPath should be in there too. You might say that this is annoyingly complicated. My first reaction is "tough--it's exactly as complicated as the reality it models". But there should presumably be helper functions of types FilesystemObject->IStream and FilesystemObject->OStream. The other complication is that Posix makes you specify access rights when you look up a path in the filesystem. This makes no sense, but it's something we have to live with. So I'd argue for replacing openFile with something like data FilesystemObject = ... openPath :: FilePath -> IOMode -> IO FilesystemObject filesystemInputStream :: FilesystemObject -> (IO?) IStream data OutputMode = Append | Overstrike | Replace filesystemOutputStream :: FilesystemObject -> OutputMode -> (IO?) OStream
Note that even when they are regular files, emulating stream I/O in terms of either pread/pwrite or mmap does not yield the correct semantics of sharing the file pointer between processes.
You're right, and the solution is to have two kinds of file I/O streams, one based on File (File*Stream) and one based on the Posix file pointer (Posix*Stream). -- Ben

On Mon, 2005-01-17 at 16:27 -0800, Ben Rudiak-Gould wrote:
Marcin 'Qrczak' Kowalczyk wrote:
Convenience. I'm worried that it uses separate types for various kinds of streams: files, pipes, arrays (private memory), and sockets. Haskell is statically typed and lacks subsumption. This means that even though streams are unified by using a class, code which uses a stream of an unknown kind must be either polymorphic or use existential quantification.
Yes, this is a problem. In my original proposal InputStream and OutputStream were types, but I enthusiastically embraced Simon M's idea of turning them into classes. As you say, it's not without its disadvantages.
I see several possibilities here.
* We could adopt Avery Lee's suggestion (from the discussion in 2003) to use field labels instead of methods. Advantages: InputStream and OutputStream behave more like their OOP equivalents, with no loss of extensibility. Disadvantages: potentially less efficient (no specialization possible); loses some static type information.
I've often thought it would be nice to have a class and it's most general instance, a record with the same fields as the class has methods. It would be even better if they could share the same name, eg: class IStream s where read :: s -> ... data IStream = IStream { read :: ... } instance IStream IStream where read s = read s --the field selector not the class method Obviously each instance of the IStream class can be converted to an IStream record (loosing type information) which is useful for heterogeneous collections of streams, and other "interface programming" techniques. This technique is perhaps a middle ground, it's a tad more complex that just having a single type for streams but it allows code which does not want to know to use a single type while allowing for static typing in other cases where it is desired for safety or for better performance by specialising. A downside (apart from naming issues) is that while there is an automatic conversion IStream data type -> IStream class instance, there is no automatic conversion the other way round. Compare this with Java interfaces for example, a Java IStream interface is like our IStream data type, but there is automatic conversion from the types implementing the interface to the interface type itself. In Haskell we normally go for the more strongly typed interfaces (Haskell classes) rather than the more dynamic interfaces (record of functions) so the language supports the former more naturally than the latter (eg automatic 'conversion' when accessing an object through a class interface). Duncan

Ben Rudiak-Gould
Yes, this is a problem. In my original proposal InputStream and OutputStream were types, but I enthusiastically embraced Simon M's idea of turning them into classes. As you say, it's not without its disadvantages.
This is my greatest single complaint about Haskell: that it doesn't support embedding either OO-style abstract supertypes, or dynamnic typing with the ability to use polymorphic operations on objects that we don't know the exact type. The Dynamic type doesn't count for the latter because you must guess the concrete type before using the object. You can't say "it should be something implementing class Foo, I don't care what, and I only want to use Foo's methods with it". Haskell provides only: - algebraic types (must specify all "subtypes" in one place), - classes (requires foralls which limits applicability: no heterogeneous lists, I guess no implicit parameters), - classes wrapped in existentials, or records of functions (these two approaches don't support controlled downcasting, i.e. "if this is a regular file, do something, otherwise do something else"). The problem manifests itself more when we add more kinds of streams: transparent compression/decompression, character recoding, newline conversion, buffering, userspace /dev/null, concatenation of several input streams, making a copy of data as it's passed, automatic flushing of a related output stream when an input stream is read, etc. A case similar to streams which would benefit from this is DB interface. Should it use separate types for separate backends? Awkward to write code which works with multiple backends. Should it use a record of functions? Then we must decide at the beginning the complete set of supported operations, and if one backend provides something that another doesn't, it's impossible to write code which requires the first backend and uses the capability (unless we decide at the beginning about all possible extensions and make stubs which throw exceptions in cases it's not supported). I would like to mix these two approaches: if some code uses only operations supported by all backends, then it's fully polymorphic, and when it starts using specific operations, it becomes limited. Without two completely different designs for these cases. I don't know how to fit it into Haskell's type system. This has led me to exploring dynamic typing.
Again, to try to avoid confusion, what you call a "seekable file" the library calls a "file", and what you call a "file" I would call a "Posix filehandle".
So the incompleteness problem can be rephrased: the interface doesn't provide the functionality of open() with returns an arbitrary POSIX filehandle.
By the same token, stdin is never a file, but the data which appears through stdin may ultimately be coming from a file, and it's sometimes useful, in that case, to bypass stdin and access the file directly. The way to handle this is to have a separate stdinFile :: Maybe File.
And a third stdin, as POSIX filehandle, to be used e.g. for I/O redirection for a process.
As for openFile: in the context of a certain filesystem at a certain time, a certain pathname may refer to
* Nothing * A directory * A file (in the library sense); this might include things like /dev/hda and /dev/kmem * Both ends of a (named) pipe * A data source and a data sink which are related in some qualitative way (for example, keyboard and screen, or stdin and stdout) * A data source only * A data sink only * ...
How to provide an interface to this zoo?
In such cases I tend to just expose the OS interface, without trying to be smart. This way I can be sure I don't make anything worse than it already is. Yes, it probably makes portability harder. Suitability of this approach depends on our goals: either we want to provide a nice and portable abstraction over the basic functionality of all systems, or we want to make everything implementable in C also implementable in Haskell, including a Unix shell. Perhaps Haskell is in the first group. Maybe its goal is to invent an ideal interface to the computer's world, even if this means doing things differently than everyone else. It's hard to predict beforehand how far in being different we can go without alienating users. For my language I'm trying to do the second thing. I currently concentrate on Unix because there are enough Windows-inspired interfaces in .NET, while only Perl and Python seem to care about providing a rich access to Unix API from a different language than C. I try to separate interfaces which should be portable from interfaces to Unix-specific things. Unfortunately I have never programmed for Windows and I can make mistakes about which things are common to various systems and which are not. Time will tell and will fix this. Obviously I'm not copying the Unix interface literally. A file is distinguished from an integer, and an integer is distinguished from a Unix signal, even though my language is dynamically typed. But when Unix makes some objects interchangeable, I retain this. So I'm using a single RAW_FILE type, which wraps an arbitrary POSIX file handle. Then there are various kinds of streams: mostly wrappers for other streams which perform transparent conversion, buffering etc. RAW_FILE is a stream itself. I distinguish input streams from output streams (RAW_FILE is both) and byte streams from character streams. Functions {Text,Binary}{Reader,Writer} put a default stack of stream converters, with default or provided parameters like the encoding. The last part in all of them is a buffering layer; only buffered streams provide reading by line. Input buffers provide unlimited lookahead and putback, output buffers provide automatic flushing (after every operation or after full lines). One controversial design decision is that read/write operations on blocks *move* data between the stream and the buffer (flexible array). That is, WriteBlock appends at the end, while ReadBlock cuts from the beginning. It makes passing data between streams less error-prone, at the cost of unnecessary copying of memory if we want to retain in memory what we have just written.
The Haskell approach is, I guess, to use an algebraic datatype, e.g.
data FilesystemObject = Directory Directory | File File | InputOutput PosixInputStream PosixOutputStream | Input PosixInputStream | Output PosixOutputStream
If openFile was to try to infer what kind of object it just opened, I'm worried that the kinds are not disjoint and that determining this is often unnecessary. For example if I intend to only read from a file sequentially, it doesn't matter whether it is really InputOutput (the path named a character device), Input (Haskell somehow determined that only reading will succeed), or a File on which an input stream can be wrapped. A Haskell runtime tries to determine what it is, even though it could just blindly use read()/write() and let the OS dispatch these calls to the appropriate implementation. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Have you read the OOHaskell paper? http://homepages.cwi.nl/~ralf/OOHaskell/ This shows how to encode many OO idioms in Haskell, without any extensions (beyond those that GHC already supports)... Here's some sample code (from the Shapes.hs example) to give you a flavor of it: A constructor function: rectangle x y width height self = do super <- shape x y self w <- newIORef width h <- newIORef height returnIO $ getWidth .=. readIORef w .*. getHeight .=. readIORef h .*. setWidth .=. (\neww -> writeIORef w neww) .*. setHeight .=. (\newh -> writeIORef h newh) .*. draw .=. do putStr "Drawing a Rectangle at:(" << self # getX << ls "," << self # getY << ls "), width " << self # getWidth << ls ", height " << self # getHeight << ls "\n" .*. super And an example of some objects in use: myShapesOOP = do -- set up array of shapes s1 <- mfix (rectangle (10::Int) (20::Int) 5 6) s2 <- mfix (circle (15::Int) 25 8) let scribble :: [Shape Int] scribble = [narrow s1, narrow s2] -- iterate through the array -- and handle shapes polymorphically mapM_ (\shape -> do shape # draw (shape # rMoveTo) 100 100 shape # draw) scribble -- call a rectangle specific function arec <- mfix (rectangle (0::Int) (0::Int) 15 15) arec # setWidth $ 30 arec # draw Regards, Keean. Marcin 'Qrczak' Kowalczyk wrote:
Haskell provides only:
- algebraic types (must specify all "subtypes" in one place), - classes (requires foralls which limits applicability: no heterogeneous lists, I guess no implicit parameters), - classes wrapped in existentials, or records of functions (these two approaches don't support controlled downcasting, i.e. "if this is a regular file, do something, otherwise do something else").

On 20/01/2005, at 3:42 AM, Keean Schupke wrote:
Have you read the OOHaskell paper?
http://homepages.cwi.nl/~ralf/OOHaskell/
This shows how to encode many OO idioms in Haskell, without any extensions (beyond those that GHC already supports)... Here's some sample code (from the Shapes.hs example) to give you a flavor of it: [..]
Just because you can encode the OO idioms in Haskell doesn't mean it's particularly straightforward to use them. As your example shows, getting the syntax right for these OOish constructs isn't easy (not to mention verbose), and even so, the type errors you face when you get things wrong are, well, long :). I guess my point is that in theory, Haskell can support OO right now. In practice, it's something that isn't very tasty. -- % Andre Pang : trust.in.love.to.save http://www.algorithm.com.au/

Andre Pang wrote:
Just because you can encode the OO idioms in Haskell doesn't mean it's particularly straightforward to use them. As your example shows, getting the syntax right for these OOish constructs isn't easy (not to mention verbose), and even so, the type errors you face when you get things wrong are, well, long :).
This is true enough... but it really isn't as dificault as it looks. Once you get used to the style it is really quite easy - and notice how you don't need class definitions, or types for the objects - it is all derived by GHC. This is an advance over current OO languages.
I guess my point is that in theory, Haskell can support OO right now. In practice, it's something that isn't very tasty.
I find it no harder than writing with monads for example... certainly there are some tricky things going on in both... but that doesn't stop people using monads for IO, state etc. Syntactic sugar over the top for instance and implementation definitions is something we are working on (using template-haskell) - so that end of things can certainly be made neater for the user. The big problem I guess is error messages - and that would require some user defined way of throwing compile time errors. Keean.

On 20/01/2005, at 11:06 PM, Keean Schupke wrote:
I find it no harder than writing with monads for example... certainly there are some tricky things going on in both... but that doesn't stop people using monads for IO, state etc.
Syntactic sugar over the top for instance and implementation definitions is something we are working on (using template-haskell) - so that end of things can certainly be made neater for the user.
The syntactic sugar is the killer. (Using monads is really no fun if you don't have do notation, for example. Doable: yes. Pretty: definitely not!) Even if you use Template Haskell to try to implement the syntactic sugar, you're very limited by the splice $(...) notation at the call site. I've always argued that Haskell really should have a full-blown macro system: it would really help with Haskell and EDSLs, and of course for integrating these kinds of libraries. TH is 90% of the way there, and with a bit more thought, those pesky splices could just magically disappear ... ;)
The big problem I guess is error messages - and that would require some user defined way of throwing compile time errors.
Yes, also agreed. I did some similar Haskell<->OO integration work, and the type errors which appeared when something went wrong are quite awesome. User-defined compile-time errors would be fantastic, but that would require quite a lot of effort. -- % Andre Pang : trust.in.love.to.save http://www.algorithm.com.au/

Andre Pang wrote:
The syntactic sugar is the killer. (Using monads is really no fun if you don't have do notation, for example. Doable: yes. Pretty: definitely not!) Even if you use Template Haskell to try to implement the syntactic sugar, you're very limited by the splice $(...) notation at the call site. I've always argued that Haskell really should have a full-blown macro system: it would really help with Haskell and EDSLs, and of course for integrating these kinds of libraries. TH is 90% of the way there, and with a bit more thought, those pesky splices could just magically disappear ... ;)
Its not that bad... the trick I am using is lifting existing haskell syntax, so an interface definition looks like: $(interface [d| data MyInterface = MyInterface { method1 :: ..., method2 :: ..., method3 :: ...} |]) So we define a normal haskell98 record, and the TH lifts it to an interface definition using extensible records. An implementation looks would possibly look like: $(implementation [MyInterface] [d| method1 = ... method2 = ... method3 = ... |])
Yes, also agreed. I did some similar Haskell<->OO integration work, and the type errors which appeared when something went wrong are quite awesome. User-defined compile-time errors would be fantastic, but that would require quite a lot of effort.
We can do something better than what we have at the moment, for a start TH can generate user defined compile time errors - but we don't want to have to implement our own typechecking, so we can supplement this with a class with no instance and empty types: class Fail a data Some_user_defined_error instance Fail Some_user_defined_error => Test ... So the compiler will report an undefined instance in Fail for your error type, but you can at least get some readable text, which is better than nothing. Keean.

John Meacham wrote:
Actually, If I were writing new haskell libraries, I would use mmap whenever I could for accessing files. not only does it make the file pointer problem go away, but it can be drastically more efficient.
I'm not sure this is a good idea, because GHC really needs non-blocking I/O to support its thread model, and memory-mapped I/O always blocks. In fact this is a problem even if we only memory-map files at the programmer's request. -- Ben

can't GHC do this using the threaded RTS? Keean.
John Meacham wrote:
Actually, If I were writing new haskell libraries, I would use mmap whenever I could for accessing files. not only does it make the file pointer problem go away, but it can be drastically more efficient.
I'm not sure this is a good idea, because GHC really needs non-blocking I/O to support its thread model, and memory-mapped I/O always blocks. In fact this is a problem even if we only memory-map files at the programmer's request.

On Mon, 2005-01-17 at 13:44 -0800, Ben Rudiak-Gould wrote:
John Meacham wrote:
Actually, If I were writing new haskell libraries, I would use mmap whenever I could for accessing files. not only does it make the file pointer problem go away, but it can be drastically more efficient.
I'm not sure this is a good idea, because GHC really needs non-blocking I/O to support its thread model, and memory-mapped I/O always blocks. In fact this is a problem even if we only memory-map files at the programmer's request.
Indeed, a new IO system that could transparently take advantage of the system's native asynchronous IO features might be a good fit to GHC's thread model. Note that just using lots of system threads and blocking IO is usually not as efficient as single(/few) threaded multiplexed IO like GHC uses currently. Some OSs that do not have kernel native async IO implement the POSIX async IO API using OS threads and tend to get poor performance and few users. Duncan

Ben Rudiak-Gould wrote:
Actually, If I were writing new haskell libraries, I would use mmap whenever I could for accessing files. not only does it make the file pointer problem go away, but it can be drastically more efficient.
I'm not sure this is a good idea, because GHC really needs non-blocking I/O to support its thread model, and memory-mapped I/O always blocks.
If, by "blocks", you mean that execution will be suspended until the
data has been read from the device into the buffer cache, then Unix
non-blocking I/O (i.e. O_NONBLOCK) also blocks.
Calling read() on a descriptor which corresponds to a regular file
won't report EAGAIN, even if the file is on a particularly slow device
(floppy, CD-ROM, NFS etc).
Essentially, reading data from regular files is always deemed to occur
"soon", so the usual mechanisms for dealing with "slow" I/O (i.e.
pipes, FIFOs, character devices, sockets) don't work. This applies
equally to non-blocking I/O (O_NONBLOCK), asynchronous I/O (O_ASYNC),
select(), poll() etc.
In that regard, mmap()ed I/O is no better or worse than conventional
non-blocking I/O.
--
Glynn Clements

On Tue, 2005-01-18 at 22:52 +0000, Glynn Clements wrote:
Ben Rudiak-Gould wrote:
Essentially, reading data from regular files is always deemed to occur "soon", so the usual mechanisms for dealing with "slow" I/O (i.e. pipes, FIFOs, character devices, sockets) don't work. This applies equally to non-blocking I/O (O_NONBLOCK), asynchronous I/O (O_ASYNC), select(), poll() etc.
Yes, there are very few systems that do genuine async IO. I believe that Solaris and the Win NT kernel are supposed to be able to do this, though the interfaces to these features are apparently rather hard to use. I've heard that Linux is going to gain proper async IO for ordinary buffered (as opposed to direct IO) files some time in the not too distant future. Duncan

Glynn Clements wrote:
Ben Rudiak-Gould wrote:
GHC really needs non-blocking I/O to support its thread model, and memory-mapped I/O always blocks.
If, by "blocks", you mean that execution will be suspended until the data has been read from the device into the buffer cache, then Unix non-blocking I/O (i.e. O_NONBLOCK) also blocks.
Okay, my ignorance of Posix is showing again. Is it currently the case, then, that every GHC thread will stop running while a disk read is in progress in any thread? Is this true on all platforms? There are two ways of reading from a file/stream in Win32 on NT. One is asynchronous: the call returns immediately and you receive a notification later that the read has completed. The other is synchronous but almost-nonblocking: it returns as much data as is "available", and the entire contents of a file is considered always available. But it always returns at least one byte, and may spend an arbitrary amount of time waiting for that first byte. You can avoid this by waiting for the handle to become signalled; if it's signalled then a subsequent ReadFile will not block indefinitely. Win32's synchronous ReadFile is basically the same as Posix's (blocking) read. For some reason I thought that Win32's asynchronous ReadFile was similar to Posix's non-blocking read, but I gather from [1] that they're completely different. (By the way, are the GHC folks aware that the description of Win32 I/O at [2] is wrong? It seems to assume that ReadFile doesn't return until the buffer is full.) -- Ben [1] http://www.opengroup.org/onlinepubs/007908799/xsh/read.html [2] http://www.cse.unsw.edu.au/~chak/haskell/ghc/comm/rts-libs/non-blocking.html

Nonblocking (asynchronous IO) is only one solution. Another is to use blocking IO and multiple threads. Each thread represents an IO dependency, in other words there is no point in a thread doing IO continuing because it needs the data it is waiting for. Unassociated threads do not get blocked and carry on. Threading has the advantage of using multiple CPUs if available - asynchronous IO does not do this. With the advent of multi-cored CPUs (due to heat problems CPUs will not be getting much faster in terms of raw GHz) threading is obviously the way forward. GHC already supports multiple threads (with the -threaded flag I think) ... where FFI actions are started in their own thread... Of course there are overheads with starting threads, so normally applications maintain a 'thread-pool' of ready to use, pre-initialised threads... I don't know if GHC does this. Keean. Ben Rudiak-Gould wrote:
Glynn Clements wrote:
Ben Rudiak-Gould wrote:
GHC really needs non-blocking I/O to support its thread model, and memory-mapped I/O always blocks.
If, by "blocks", you mean that execution will be suspended until the data has been read from the device into the buffer cache, then Unix non-blocking I/O (i.e. O_NONBLOCK) also blocks.
Okay, my ignorance of Posix is showing again. Is it currently the case, then, that every GHC thread will stop running while a disk read is in progress in any thread? Is this true on all platforms?
There are two ways of reading from a file/stream in Win32 on NT. One is asynchronous: the call returns immediately and you receive a notification later that the read has completed. The other is synchronous but almost-nonblocking: it returns as much data as is "available", and the entire contents of a file is considered always available. But it always returns at least one byte, and may spend an arbitrary amount of time waiting for that first byte. You can avoid this by waiting for the handle to become signalled; if it's signalled then a subsequent ReadFile will not block indefinitely.
Win32's synchronous ReadFile is basically the same as Posix's (blocking) read. For some reason I thought that Win32's asynchronous ReadFile was similar to Posix's non-blocking read, but I gather from [1] that they're completely different.
(By the way, are the GHC folks aware that the description of Win32 I/O at [2] is wrong? It seems to assume that ReadFile doesn't return until the buffer is full.)
-- Ben
[1] http://www.opengroup.org/onlinepubs/007908799/xsh/read.html [2] http://www.cse.unsw.edu.au/~chak/haskell/ghc/comm/rts-libs/non-blocking.html
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ben Rudiak-Gould wrote:
GHC really needs non-blocking I/O to support its thread model, and memory-mapped I/O always blocks.
If, by "blocks", you mean that execution will be suspended until the data has been read from the device into the buffer cache, then Unix non-blocking I/O (i.e. O_NONBLOCK) also blocks.
Okay, my ignorance of Posix is showing again. Is it currently the case, then, that every GHC thread will stop running while a disk read is in progress in any thread?
The kernel thread which called read() will be blocked. If GHC threads are userspace threads running within a single kernel thread, then they will all block. If GHC uses multiple kernel threads, the other kernel threads will continue to run.
Is this true on all platforms?
Some platforms (but, AFAIK, not linux) allow asynchronous I/O on regular files. NT has overlapped I/O, which is essentially the same thing.
There are two ways of reading from a file/stream in Win32 on NT. One is asynchronous: the call returns immediately and you receive a notification later that the read has completed. The other is synchronous but almost-nonblocking: it returns as much data as is "available", and the entire contents of a file is considered always available. But it always returns at least one byte, and may spend an arbitrary amount of time waiting for that first byte. You can avoid this by waiting for the handle to become signalled; if it's signalled then a subsequent ReadFile will not block indefinitely.
Win32's synchronous ReadFile is basically the same as Posix's (blocking) read. For some reason I thought that Win32's asynchronous ReadFile was similar to Posix's non-blocking read, but I gather from [1] that they're completely different.
They're similar, but not identical. Traditionally, Unix non-blocking
I/O (along with asynchronous I/O, select() and poll()) were designed
for "slow" streams such as pipes, terminals, sockets etc. Regular
files and block devices are assumed to return the data "immediately".
Essentially, for slow streams, you have to wait for the data to arrive
before it can be read, so waiting may take an indefinite amount of
time. For "fast" streams, the data is always "available", you just
have to wait for the system call to give it to you.
IOW, the time taken to read from a block device is amortised into the
execution time of the system call, rather than being treated as a
delay.
Also, even with blocking I/O, slow streams only block if no data is
available. If less data is available than was requested, they will
usually return whatever is available rather than waiting until they
have the requested amount. Non-blocking I/O only affects the case
where no data is available.
--
Glynn Clements

Glynn Clements
They're similar, but not identical. Traditionally, Unix non-blocking I/O (along with asynchronous I/O, select() and poll()) were designed for "slow" streams such as pipes, terminals, sockets etc. Regular files and block devices are assumed to return the data "immediately".
Indeed. Reading from a slow block device is also not interruptible by a signal; a signal usually causes reading from a pipe/socket/terminal to fail with EINTR. There is no non-blocking interface to various functions like readdir, mkdir, stat etc. OTOH close() is interruptible. It seems that the only way to parallelize them is to use a separate OS thread. gethostbyname, gethostbyaddr, getservbyname and getservbyport are mostly superseded by getaddrinfo and getnameinfo. They are all blocking and non-interruptible by signals (they restart their loops on receiving EINTR from low-level calls). Glibc provides getaddrinfo_a which is non-blocking (implemented using pthreads). Contrary to documentation it's not interruptible by a signal (its implementation expects pthread_cond_wait to fail with EINTR which is not possible) and it's not cancellable in a useful way (the interface allows for cancellation, which may nevertheless answer that it cannot be cancelled, and the glibc implementation is able to cancel a request only if it hasn't yet started being processed by the thread pool). There is no non-blocking counterpart of getnameinfo. Since asynchronous name resolution is quite important, implementation of my language uses pthreads and getaddrinfo / getnameinfo, if pthreads are available. For simplicity I just make one thread per request. A tricky API to parallelize is waitpid. Pthreads are supposed to be able to wait for child processes started by any thread, but according to man pages this was broken in Linux before version 2.4. Fortunately it's easy to avoid blocking other threads indefinitely without OS threads if we agree to waste CPU time (not CPU cycles), such that a thread waiting for a process takes as much time as if it was doing some useful work. Because waitpid *is* interruptible by signals. So it will either finish, or the timer signal will interrupt it and control can be passed to other threads. Leaving the timer signal interrupting syscalls can break libraries which don't expect EINTR. For example the Python runtime doesn't handle EINTR specially and it is translated to a Python exception. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
participants (10)
-
Aaron Denney
-
Andre Pang
-
Ben Rudiak-Gould
-
Duncan Coutts
-
Glynn Clements
-
John Meacham
-
Keean Schupke
-
Marcin 'Qrczak' Kowalczyk
-
Pete Chown
-
Simon Marlow