Re: [Haskell-cafe] Help on syntactic sugar for combining lazy & strict monads?

However, a lazy interpreter causes problems when trying to introduce *observation* statements (aka conditioning statements) into the monad [3]. For example,
run_lazy $ do x <- normal 0 1 y <- normal x 1 z <- normal y 1 2.0 `observe_from` normal z 1 return y
In the above code fragment, y will be forced because it is returned, and y will force x. However, the "observe_from" statement will never be forced, because it does not produce a result that is demanded.
I'm very confused. If the observe_from statement is never demanded, then what semantics should it have? What is the type of observe_from? It seems it is a -> m a -> m () for whatever monad m you are using. But conditioning usually is a function Observation a -> Dist a -> Dist a so you must use the result of the conditioning somehow. And isn't the principle of Monte Carlo to approximate the posterior by sampling from it? I tend to agree with your suggestion that observations and sampling can not be mixed (in the same do-notation) but the latter have to be collected in a prior, then conditioned by an observation. What is the semantic connection between your sample and obsersvation monad? What is the connection between both and the semantic probability distributions? I claim that once you have typed everything, it becomes clear where the problem is. Olaf P.S. It has always bugged me that probabilists use elements and events interchangingly, while this can only be done on discrete spaces. So above I would rather like to write (2.0==) `observe_from` (normal 0 1) which still is a non-sensical statement if (normal 0 1) is a continuous distribution where each point set has probability zero.

Hi Olaf, I think you need to look at two things: 1. The Giry monad, and how it deals with continuous spaces. 2. The paper "Practical Probabilistic Programming with Monads" - https://doi.org/10.1145/2804302.2804317 Also, observing 2.0 from a continuous distribution is not nonsensical. -BenRI On 7/21/21 11:15 PM, Olaf Klinke wrote:
However, a lazy interpreter causes problems when trying to introduce *observation* statements (aka conditioning statements) into the monad [3]. For example,
run_lazy $ do x <- normal 0 1 y <- normal x 1 z <- normal y 1 2.0 `observe_from` normal z 1 return y
In the above code fragment, y will be forced because it is returned, and y will force x. However, the "observe_from" statement will never be forced, because it does not produce a result that is demanded.
I'm very confused. If the observe_from statement is never demanded, then what semantics should it have? What is the type of observe_from? It seems it is a -> m a -> m () for whatever monad m you are using. But conditioning usually is a function Observation a -> Dist a -> Dist a so you must use the result of the conditioning somehow. And isn't the principle of Monte Carlo to approximate the posterior by sampling from it? I tend to agree with your suggestion that observations and sampling can not be mixed (in the same do-notation) but the latter have to be collected in a prior, then conditioned by an observation.
What is the semantic connection between your sample and obsersvation monad? What is the connection between both and the semantic probability distributions? I claim that once you have typed everything, it becomes clear where the problem is.
Olaf
P.S. It has always bugged me that probabilists use elements and events interchangingly, while this can only be done on discrete spaces. So above I would rather like to write (2.0==) `observe_from` (normal 0 1) which still is a non-sensical statement if (normal 0 1) is a continuous distribution where each point set has probability zero.

The idea of changing observation to look like `Observation a -> Dist a -> Dist a` is interesting, but I am not sure if this works in practice. Generally you cannot actually produce an exact sample from a distribution plus an observation. MCMC, for example, produces collections of samples that you can average against, and the error decreases as the number of samples increases. But you can't generate a single point that is a sample from the posterior. Maybe it would be possible to change use separate types for distributions from which you cannot directly sample? Something like `Observation a -> SampleableDist -> NonsampleableDist a`. I will think about whether this would solve the problem with laziness... -BenRI On 7/29/21 11:35 PM, Benjamin Redelings wrote:
Hi Olaf,
I think you need to look at two things:
1. The Giry monad, and how it deals with continuous spaces.
2. The paper "Practical Probabilistic Programming with Monads" - https://doi.org/10.1145/2804302.2804317
Also, observing 2.0 from a continuous distribution is not nonsensical.
-BenRI
On 7/21/21 11:15 PM, Olaf Klinke wrote:
However, a lazy interpreter causes problems when trying to introduce *observation* statements (aka conditioning statements) into the monad [3]. For example,
run_lazy $ do x <- normal 0 1 y <- normal x 1 z <- normal y 1 2.0 `observe_from` normal z 1 return y
In the above code fragment, y will be forced because it is returned, and y will force x. However, the "observe_from" statement will never be forced, because it does not produce a result that is demanded.
I'm very confused. If the observe_from statement is never demanded, then what semantics should it have? What is the type of observe_from? It seems it is a -> m a -> m () for whatever monad m you are using. But conditioning usually is a function Observation a -> Dist a -> Dist a so you must use the result of the conditioning somehow. And isn't the principle of Monte Carlo to approximate the posterior by sampling from it? I tend to agree with your suggestion that observations and sampling can not be mixed (in the same do-notation) but the latter have to be collected in a prior, then conditioned by an observation.
What is the semantic connection between your sample and obsersvation monad? What is the connection between both and the semantic probability distributions? I claim that once you have typed everything, it becomes clear where the problem is.
Olaf
P.S. It has always bugged me that probabilists use elements and events interchangingly, while this can only be done on discrete spaces. So above I would rather like to write (2.0==) `observe_from` (normal 0 1) which still is a non-sensical statement if (normal 0 1) is a continuous distribution where each point set has probability zero.

On Thu, 29 Jul 2021, Benjamin Redelings wrote:
The idea of changing observation to look like `Observation a -> Dist a -> Dist a` is interesting, but I am not sure if this works in practice. Generally you cannot actually produce an exact sample from a distribution plus an observation. MCMC, for example, produces collections of samples that you can average against, and the error decreases as the number of samples increases. But you can't generate a single point that is a sample from the posterior.
Isn't that a problem of the concrete implementation of the probability monad (MCMC)? Certainly not something that I, as the modeller, would like to be concerned with. Other implementations might not have this limitation. Are we talking about the same thing, the mathematical conditional probability? This has the type I described, so it should not be an unusual design choice. Moreover, it is a partial function. What happens if I try to condition on an impossible observation in Bali-phy?
Maybe it would be possible to change use separate types for distributions from which you cannot directly sample? Something like `Observation a -> SampleableDist -> NonsampleableDist a`.
I haven't seen types of much else in your system, so I can not provide meaningful insights here. My concern would be that parts of models become tainted as Nonsampleable too easily, a trapdoor operation. But this is just guesswork, I don't grasp sampling well enough, apparently. Here is my attempt, please correct me if this is wrong. Sampling is a monad morphism from a "true" monad of probability distributions (e.g. the Giry monad) to a state monad transforming a source of randomness, in such a way that any infinite stream of samples has the property that for any measurable set U, the probability measure of U equals the limit of the ratio of samples inside and outside of U, as the prefix of the infinite stream grows to infinity. Conditioning on an observation O should translate to the state monad in a way so that the sample-producer is now forbidden to output anything outside O. Hence conditioning on an impossible observation must produce a state transformer that never outputs anything: the bottom function.
I will think about whether this would solve the problem with laziness...
-BenRI
On 7/29/21 11:35 PM, Benjamin Redelings wrote:
Hi Olaf,
I think you need to look at two things:
1. The Giry monad, and how it deals with continuous spaces.
2. The paper "Practical Probabilistic Programming with Monads" - https://doi.org/10.1145/2804302.2804317
Also, observing 2.0 from a continuous distribution is not nonsensical.
-BenRI
On 7/21/21 11:15 PM, Olaf Klinke wrote:
However, a lazy interpreter causes problems when trying to introduce *observation* statements (aka conditioning statements) into the monad [3]. For example,
run_lazy $ do x <- normal 0 1 y <- normal x 1 z <- normal y 1 2.0 `observe_from` normal z 1 return y
In the above code fragment, y will be forced because it is returned, and y will force x. However, the "observe_from" statement will never be forced, because it does not produce a result that is demanded.
I'm very confused. If the observe_from statement is never demanded, then what semantics should it have? What is the type of observe_from? It seems it is a -> m a -> m () for whatever monad m you are using. But conditioning usually is a function Observation a -> Dist a -> Dist a so you must use the result of the conditioning somehow. And isn't the principle of Monte Carlo to approximate the posterior by sampling from it? I tend to agree with your suggestion that observations and sampling can not be mixed (in the same do-notation) but the latter have to be collected in a prior, then conditioned by an observation.
What is the semantic connection between your sample and obsersvation monad? What is the connection between both and the semantic probability distributions? I claim that once you have typed everything, it becomes clear where the problem is.
Olaf
P.S. It has always bugged me that probabilists use elements and events interchangingly, while this can only be done on discrete spaces. So above I would rather like to write (2.0==) `observe_from` (normal 0 1) which still is a non-sensical statement if (normal 0 1) is a continuous distribution where each point set has probability zero.

Hi Olaf, If you recall, I actually wrote to this list asking for help. Adam Scibior's response was quite helpful, but this conversation, as far as I can tell, has not provided any help, so I will regretfully have to stop replying at some point. Unfortunately, I don't have enough time to respond to everything - my apologies! I may be able to clarify some of the questions that you asked. But I doubt that I can actually satisfy all of your concerns. So, perhaps these responses will mostly clarify our points of disagreement. 1. It seems like you are assuming that the semantics of a probabilistic program is determined by running it under a rejection sampling interpreter. Such an interpreter would: (a) interpret statements like "x <- normal 0 1" by performing a (pseudo)random sample from the distribution (b) interpret statements like "2.0 `observed_from` normal z 1`" by first sampling from "normal z 1" and then rejecting the program run if the sample from "normal z 1" does not satisfy (2.0==). That is a rejection sampler, and it would work if the observations did not have probability 0. But it is only one kind of sampler. There are other kinds of samplers, that you (maybe?) have not considered: * importance sampling: the paper that I recommended is using the observation statements to weight the samples by the probability DENSITY of the observations. This is called "importance sampling" -- we generate sampling from one distribution, but reweight them to approximate a different distribution. * MCMC: my approach is based on MCMC, and so performs a random walk over the space of program traces. In this case, running the program must compute the probability DENSITY of samples from the prior (like "x <- normal 0 1") as well as observations (like observing 2.0 from "normal z 1"). These three alternative interpreters are alternative ways of sampling from the posterior distribution. However, none of these three interpreters constitutes the semantics of the probabilistic program. I mentioned the Giry monad (which I am no expert on) because it gives semantics that is not based on any interpreter. I do not think that the semantics of probabilistic programs should be given by an interpreter. Perhaps we disagree on this. 2. You wrote "If the observe_from statement is never demanded, then what semantics should it have?"
run_lazy $ do x <- normal 0 1 y <- normal x 1 z <- normal y 1 2.0 `observe_from` normal z 1 return y
The semantics that it should have seems fairly obvious? This program leads to the following (un-normalized!) probability density on (x,y,z): Pr(x) * Pr(y|x) * Pr(z|y) * Pr(2.0|z) = normal_pdf(x,0,1) * normal_pdf(y,x,1) * normal_pdf(z,y,1) * normal_pdf(2.0,z,1) Therefore, the program leads to both (i) a distribution on (x,y,z) and (ii) a distribution on the return value y. I don't think this has anything to to with whether there is a result that is demanded, but perhaps we disagree on that. 3. You wrote "But conditioning usually is a function "Observation a -> Dist a -> Dist a" so you must use the result of the conditioning somehow." I was initially very confused by this, because it sounded like you are saying that because people USUALLY do something, therefore I HAVE AN OBLIGATION to do something. But, now I realize that you mean "IT MUST BE THE CASE that you use the result of the conditioning somehow". It is indeed the case that the "observe_from" command affects the intepreter somehow. I think this is pretty simple ... the `observe_from` statement has a side-effect. For both (i) importance-sampling interpreters and (ii) MCMC interpreters, the side-effect is to weight the current trace by the probability (or probability density) of the observation. However, like the "put" command in the state monad, the result "()" of the observation is not demanded. So in that sense, I am not "using the result". It is perhaps interesting that the lazy state monad does not have the problem that "put" commands are ignored. 4. You wrote "And isn't the principle of Monte Carlo to approximate the posterior by sampling from it?" I am not really sure what this means. Sometimes you can generate samples directly from the posterior, and sometimes you can't. In most situations where you have observations, you CANNOT generate samples from the posterior. For example, in MCMC, we talk about the distribution of the sampled points CONVERGING to the posterior. If an MCMC chain has points X[t] for t = 0...\infty, there is no t where X[t] is distributed according to the posterior distribution. 5. You wrote "I claim that once you have typed everything, it becomes clear where the problem is." That is an interesting claim. I don't think that there is actually "problem" per se, so I am not were that typing everything can reveal where it is. Also, this exact same issue comes up in the Practical Probabilistic Programming with Monads paper, and they have typed everything. So, I guess I disbelieve this claim, until I see evidence to the contrary. I guess we will see, after I finish programming the type system. 6. You wrote:
I believe I understand the Giry monad well, and it is my measuring stick for functional probabilistic programming. Even more well-suited for programming ought to be the valuation monad, because that is a monad on domains, the [*] semantic spaces of the lambda calculus. Unfortunately, to my knowledge until now attempts were unsuccessful to find a cartesian closed category of domains which is also closed under this probabilistic powerdomain construction.
[*] There are of course other semantics, domains being one option.
Given your expertise in this area, I doubt that I can shed any light on this. I presume that you have read https://arxiv.org/pdf/1811.04196.pdf, and similar papers. This is getting pretty far afield from my original question. 7. You wrote:
2. The paper "Practical Probabilistic Programming with Monads" - https://doi.org/10.1145/2804302.2804317
Wasn't that what you linked to in your original post? As said above, measure spaces is the wrong target category, in my opinion. There is too much non constructive stuff in there. See the work of Culbertson and Sturtz, which is categorically nice but not very constructive.
The link was indeed in my original post, but it did not seem like you had read it, since you did not consider the possibility of probabilistic programs generating WEIGHTED samples. Also, the idea that probabilistic programs should not yield measures seems weird. Partly because I am not sure what your alternative is, and partly because everybody else seems to disagree with you. For example, the following paper assumes measures: https://link.springer.com/chapter/10.1007/978-3-030-72019-3_16 Again, this is getting pretty far away from my original question. 8. You wrote:
Perhaps I am being too much of a point-free topologist here. Call me pedantic. Or I don't understand sampling at all. To me, a point is an idealised object, only open sets really exist and are observable. If the space is discrete, points are open sets. But on the real line you can not measure with infinite precision, so any observation must contain an interval. That aligns very nicely with functional programming, where only finite parts of infinite lazy structures are ever observable, and these finite parts are the open sets in the domain semantics. So please explain how observing 2.0 from a continuous distribution is not nonsensical.
You haven't actually shown that observing 2.0 from a continuous distribution is nonsensical. What you have done is stated that you don't like it, and that you would like people to represent observations as intervals instead of points. However, its not my job to convince you that point observations make sense, so I think I'll just leave that where it is. OK, that's enough for now. -BenRI On 7/22/21 2:15 AM, Olaf Klinke wrote:
However, a lazy interpreter causes problems when trying to introduce *observation* statements (aka conditioning statements) into the monad [3]. For example,
run_lazy $ do x <- normal 0 1 y <- normal x 1 z <- normal y 1 2.0 `observe_from` normal z 1 return y
In the above code fragment, y will be forced because it is returned, and y will force x. However, the "observe_from" statement will never be forced, because it does not produce a result that is demanded.
I'm very confused. If the observe_from statement is never demanded, then what semantics should it have? What is the type of observe_from? It seems it is a -> m a -> m () for whatever monad m you are using. But conditioning usually is a function Observation a -> Dist a -> Dist a so you must use the result of the conditioning somehow. And isn't the principle of Monte Carlo to approximate the posterior by sampling from it? I tend to agree with your suggestion that observations and sampling can not be mixed (in the same do-notation) but the latter have to be collected in a prior, then conditioned by an observation.
What is the semantic connection between your sample and obsersvation monad? What is the connection between both and the semantic probability distributions? I claim that once you have typed everything, it becomes clear where the problem is.
Olaf
P.S. It has always bugged me that probabilists use elements and events interchangingly, while this can only be done on discrete spaces. So above I would rather like to write (2.0==) `observe_from` (normal 0 1) which still is a non-sensical statement if (normal 0 1) is a continuous distribution where each point set has probability zero.
participants (2)
-
Benjamin Redelings
-
Olaf Klinke