Understanding Stein's Paradox (2021)
103 points
13 days ago
| 10 comments
| joe-antognini.github.io
| HN
rssoconnor
13 days ago
[-]
I think the part on "How arbitrary is the origin, really?" is not correct. The origin is arbitrary. As the Wikipedia article points you you can pick any point, whether or not it is the origin, and use the James-Stein estimator to push your estimate towards that point and it will improve one's mean squared error.

If you pick a point to the left of your sample, then moving your estimate to the left will improve your mean squared error on average. If you pick a point to the right of your sample, then moving your estimate to the right will improve your mean squared error as well.

I'm still trying to come to grips with this, and below is conjecture on my part. Imagine sampling many points from a 3-D Gaussian distribution (with identity covariance), making a nice cloud of points. Next choose any point P. P could be close to the cloud or far away, it doesn't matter. No matter which point P you pick, if you adjust all the points from your cloud of samples in accordance to this James-Stein formula, moving them all towards your chosen point P by various amounts, then, on average they will move closer to the center of your Gaussian distribution. This happens no matter where P is.

The cloud is, of course, centered around the center of the Gaussian distribution. As the points are pulled towards this arbitrary point P some will be pulled away from the the center of Gaussian, some are pulled towards the center, and some are squeezed so that they are pulled away from the center in the paralled direction, but squeezed closer in the perpendicular direction. Anyhow, apparently everything ends up, on average, closer to the center of the Gaussian in the end.

I'm not entirely sure what to make of this result. Perhaps it means that mean squared error is a silly error metric?

reply
titanomachy
13 days ago
[-]
Your visualization helped me understand this! If the center of the distribution is far from P, then all the lines from P to the points in your cluster are basically parallel, and you just shift your point cluster which doesn’t help your estimate. But if P is close to the mean, then it sits near the middle of your cluster, so pulling all points towards P is “shrinking” the cluster more than “shifting” it.
reply
m16ghost
12 days ago
[-]
Here are some links that might help visualize what is going on:

https://www.naftaliharris.com/blog/steinviz/

https://www.youtube.com/watch?v=cUqoHQDinCM (this video actually references the original post)

My takeaway is that the volume of points which get worse as they are pulled towards point P exists in some region R. As the number of dimensions increase, region R's volume shrinks as a % of the total cloud volume, making it much more unlikely that a sample is pulled from that region. In other words, you are more likely to sample points which move closer to the center than move away, which is why the estimator is an improvement on average.

reply
toth
12 days ago
[-]
You make a valid point, but I feel there is something in the direction the article is gesturing at...

The mean of the n-dimensional gaussian is an element of R^n, an unbounded space. There's no uninformed prior over this space, so there is always a choice of origin implicit in some way...

As you say, you can shrink towards any point and you get a valid James-Steiner estimator that is strictly better than the naive estimator. But if you send the point you are shrinking towards to infinity you get the naive estimator again. So it feels like the fact you are implicitly selecting a finite chunk of R^n around an origin plays a role in the paradox...

reply
kgwgk
12 days ago
[-]
> But if you send the point you are shrinking towards to infinity you get the naive estimator again.

You get close to it but strictly speaking wouldn’t it always be better than the naive estimator?

reply
toth
12 days ago
[-]
Right, it's a limit at infinity
reply
rssoconnor
12 days ago
[-]
> There's no uninformed prior over this space, so there is always a choice of origin implicit in some way...

You could use an uninformed improper prior.

reply
kgwgk
12 days ago
[-]
You would just need to come up with a way to pick a point at random uniformly from an unbounded space.
reply
rssoconnor
12 days ago
[-]
You can just use the function that is constantly 1 everywhere as your improper prior.

Improper priors are not distributions so they don't need to integrate to 1. You cannot sample from them. However, you can still apply Bayes' rule using improper priors and you usually get a posterior distribution that is proper.

reply
kgwgk
12 days ago
[-]
Sure.

The point is that you wrote that « you can pick any point […] » and when toth pointed out that « there is always a choice of origin implicit in some way » you replied that « you could use an uninformed improper prior. »

However, it seems that we agree that you cannot pick a point using an uninformed improper prior - and in any method for picking a point there will be an implicit departure from that (improper) uniform distribution.

reply
rssoconnor
12 days ago
[-]
Oh.

When I said "you can pick any point P", I meant universal quantification, i.e "for all points P", rather than a randomly chosen P.

I did say "choose P", which was pretty bad phrasing on my part.

reply
mitthrowaway2
13 days ago
[-]
Sorry, I'm siding with the physicists here. If you're going to declare that your seemingly arbitrary choice of coordinate system is actually not arbitrary and part of your prior information about where the mean of the distribution is suspected to be, you have to put that in the initial problem statement.
reply
kgwgk
13 days ago
[-]
There is nothing magical about the origin, the shrinkage can be done towards any point and in fact when estimating multiple means it's customary to move each point closer to their average.

https://www.math.drexel.edu/~tolya/EfronMorris.pdf

reply
mitthrowaway2
13 days ago
[-]
There is something magical about the origin when the result does not respect translational symmetry.

In fact, in a real world setting I would probably use my first measurement to define the origin, having no other reference to reach for.

reply
kgwgk
13 days ago
[-]
What does not respect translational symmetry?

You have an estimator. If you apply shrinkage towards the origin you have another estimator. If you apply shrinkage towards [42, 42, ..., 42] you have yet another estimator. Etc. Is it a problem that different estimators produce different results?

reply
skinner_
12 days ago
[-]
That's my understanding as well, FWIW. This is how I would phrase it:

Shrinking helps. In R^d there's no such thing as shrinking generally, only shrinking in the direction of some point. (The point that's the fixed point of the shrinking.) Regardless of what that point is, it's a good idea to shrink.

reply
mitthrowaway2
13 days ago
[-]
The James-Stein estimator does not respect translational symmetry. If I do a change of variables x2 = (x - offset), for an arbitrary offset, it gives me a different result! Whereas an estimator that just says I should guess that the mean is x, is unaffected by a change of coordinate system.

This is a big problem if the coordinate system itself is not intended to contain information about the location of the mean.

This makes sense if "zero" is physically meaningful, for example if negative values are not allowed in the problem domain (number of spectators at Wimbledon stadium, etc). Although in that case, my distribution probably shouldn't be Gaussian!

reply
kgwgk
13 days ago
[-]
This is what the original paper from Stein says:

"We choose an arbitrary point in the sample space independent of the outcome of the experiment and call it the origin. Of course, in the way we have expressed the problem this choice has already been made, but in a correct coordinate-free presentation, it would appear as an arbitrary choice of one point in an affine space."

The James-Stein estimator in its general form is about shrinking towards an arbitrary point (which usually is not the origin). It respects translational symmetry if you transform that arbitrary point like everything else.

reply
mitthrowaway2
13 days ago
[-]
That just means that it's assuming arbitrary additional prior information about the problem, which is different than zero information.
reply
kgwgk
13 days ago
[-]
I don't understand what you mean. Who assumes what?

Take any point and shrink your least-squares estimator in that direction. You get an estimator that it's strictly better - in some technical sense - which renders the original estimator inadmissible - in some technical sense.

That's a mathematical fact, it has nothing to do with prior information about the problem.

reply
mitthrowaway2
13 days ago
[-]
The article's presentation of the James-Stein estimator sets the arbitrary point at the origin. (My previous comments should be read in this context). Of course, we could set it anywhere, including [42,...]. Let's call it p. Regardless of where you set it, the estimator suggests that your best estimate û, of the mean μ, should be nudged a little away from x and towards p.

My point is that the choice of 'p' (or, in the article's presentation, the choice of origin) cannot truly be arbitrary because if it reduces the expected squared difference between μ and û, then it necessarily contains information about μ. If all you truly know about μ is x and σ, then you will have no way to guess in which direction you should even shift your estimate û to reduce that error.

If you do have some additional information about μ, beyond just x alone, then sure, take advantage of it! But then don't call it a paradox.

reply
kgwgk
12 days ago
[-]
(I cannot speak for the original article, I’ve not put the effort to fully understand it so I won’t categorically say it’s wrong but it didn’t seem right to me.)

The “paradox” is that it can truly be arbitrary! Pick a random point. Shrink your least-squares estimator. You got yourself a “better” estimator - without having any additional information.

That’s why the “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution” paper had the impact that it had.

reply
mitthrowaway2
12 days ago
[-]
Then you'll have to clarify what you mean by "random" when you say "pick a random point".

Unless you mean that every point on a spherical surface centered on x would have a lower expected squared error than x itself?

reply
kgwgk
12 days ago
[-]
We may be talking about different things.

Let's say that you have a standard multivariate normal with unknown mean mu = [a, b, c].

The usual maximum-likelihood estimator of the unknown mean when you get an observation is to take the observed value as estimate. If you observe [x, y, z] the "naive" estimator gives you the estimate mû = [x, y, z].

For any arbitrary point [p, q, r] you can define another estimator. If you observe [x, y, z] this "shrinkage" estimator gives you an estimate which is no longer precisely at [x, y, z] but is displaced in the direction of [p, q, r]. For simplicity let's say the resulting estimate is mû' = [x', y', z'].

Whatever the choice you make for [p, q, r] the "shrinkage" estimator has lower mean squared error than the "naive" estimator. The expected value of (x'-a)²+(y'-b)²+(z'-c)² is lower than the expected value of (x-a)²+(y-b)²+(z-c)².

reply
Sharlin
13 days ago
[-]
You can put the origin anywhere and for almost all choices the adjustment is almost zero. But if the choice happens to be very close to the sample point, against all (prior) probabilities, then that fact affects the prior.
reply
mitthrowaway2
13 days ago
[-]
Not quite: If the origin is within a standard deviation of |x|² or so (depending on D), then term inside the ReLU is negative, and the adjustment is exactly zero. If the origin is moderately far away from x, then the adjustment is large. If the origin is a vast distance from x, then the adjustment is small in relative terms, but not in absolute terms. The scaling factor approaches zero for large |x| but the displacement between x and û converges toward a constant.

Either way, this is absurd unless we have some additional background information about μ other than our sample x itself. But it's easy to resolve the paradox: Since the choice of origin is arbitrary (unless it isn't!), select our coordinate system such that x = 0, then the adjustment is also zero, and then the James-Stein estimator agrees that û = x = 0.

reply
credit_guy
13 days ago
[-]
Stein's paradox is bogus. Somebody needs to say that.

Here's one wikipedia example:

  > Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements. 
Here's what's bogus about this: the "better estimate (on average)" is mathematically true ... for a certain definition of "better estimate". But whatever that definition is, it is irrelevant to the real world. If you believe you get a better estimate of the US wheat yield by estimating also the number of Wimbledon spectators and the weight of a candy bar in a shop, then you probably believe in telepathy and astrology too.
reply
wavemode
13 days ago
[-]
(Disclaimer, stats noob here) - I thought the point was that, you have a better chance of being -overall- closer to the mean (i.e., the 3D euclidean distance between your guess and the mean would be the smallest, on average), even though you may not necessarily have improved your odds of guessing any of the single individual means.

So it's not that "you get a better estimate of the US wheat yield by estimating also the number of Wimbledon spectators and the weight of a candy bar in a shop", it's simply that you get a better estimate for the combined vector of the three means. (Which, in this case, the vector of the three means is probably meaningless, since the three data sets are entirely unrelated. But we could also imagine scenarios where that vector is meaningful.)

Am I misunderstanding something?

reply
credit_guy
13 days ago
[-]
You are most likely right.

I am personally bothered by the way it is presented as a "paradox", with the implication that it would have real world applications.

I have zero doubts that you can't improve the estimate of the US wheat yields by looking at some other unrelated things, like candy bars. Presenting the result as if it a real "improvement" is false advertisement.

On the other hand, if we look at related observations, then the improvement is not a paradox at all. Let's say I want to estimate the average temperature in the US and in Europe. They are related, and combining the estimates will result to a better result, to nobody's surprise.

reply
falseprofit
13 days ago
[-]
Since when does “paradox” imply real world application?

In your last paragraph, what you’re describing is just inference based on correlation, which is unrelated to this topic.

reply
zeroonetwothree
13 days ago
[-]
You are correct in that the combined estimator is actually worse at estimating an individual value. Its only better if you specifically care about the combination (which you probably don’t in this contrived example)
reply
gweinberg
13 days ago
[-]
Right. The question is when (if ever) you would actually want to be minimizing the rms of the vector error. For most of us, the answer is "never".

I remember back in 7th or 8th grade I asked my math teacher why we want to minimize the rms error rather than the sum of the absolute values of the errors. She couldn't give me a good answer, but the book All of Statistics does answer why (and under what circumstances) that is the right thing to do.

reply
mmmmpancakes
13 days ago
[-]
So this is just showing a bit of your ignorance of stats.

The general notion of compound risk is not specific to MSE loss. You can formulate it for any loss function, including L1 loss which you seem to prefer.

Steins paradox and James Stein estimator is just a special case for normal random variables and MSE loss of the more general theory of compound estimation, which is trying to find an estimator which can leverage all the data to reduce overall error.

This idea, compound estimation and James-Stein, is by now out-dated. Later came the invention of empirical Bayes estimation and the more modern bayesian hierarchical modelling eventually once we had compute for that.

One thing you can recover from EB is the James-Stein estimator, as a special case, in fact, you can design much better families of estimators that are optimal with respect to Bayes risk in compound estimation settings.

This is broadly useful in pretty much any situation where you have a large scale experiment where many small samples are drawn and similar stats are computed in parallel, or when the data has a natural hierarchical structure. For examples, biostats, but also various internet data applications.

so yeah, suggest to be a bit more open to ideas you dont know anything about. @zeroonetwothree is not agreeing with you here, they're pointing out that you cooked up an irrelevant "example" and then claim the technique doesnt make sense there. Of course, it doesnt, but thats not because the idea of JS isnt broadly useful.

----

Another thing is that JS estimator can be viewed as an example of improving overall bias-variance by regularization, although the connection to regularization as most people in ML use it is maybe less obvious. If you think regularization isn't broadly applicable and very important... i've got some news for you.

reply
mmmmpancakes
13 days ago
[-]
No it is not bogus, you just don't know much stats apparently.
reply
jprete
13 days ago
[-]
My intuition is that the problem is in using squares for the error. The volume of space available for a given distance of error in 3-space is O(N^3) the magnitude of the error, so an error term of O(N^2) doesn't grow fast enough compared to the volume that can contain that magnitude of error.

But I really don't know, it's just an intuition with no formalism behind it.

reply
hyperbovine
13 days ago
[-]
Had a bit of a chuckle at the very-2024 definition of the Stein shrinkage estimator:

\hat{mu} = ReLU(…)

reply
toth
12 days ago
[-]
Ditto.

I think that ship has sailed, but I think it's unfortunate that "ReLU(x)" became a popular notation for "max(0,x)". And using the name "rectified linear unit" for basically "positive part" seems like a parody, like insisting on calling water "dihydrogen monoxide".

reply
BlueTemplar
12 days ago
[-]
It hasn't "sailed" as long as they want to communicate with non-machine-learning people.
reply
pfortuny
13 days ago
[-]
I do not get it: if variance is too large, a random sample is very little representative of the mean. As simple as that?

Now the specific formula may be complicated. But otherwise I do not understand the “paradox”? Or am I missing something?

reply
hcks
13 days ago
[-]
In the 1D case the single point will be the best estimator for the mean no matter what the variance is
reply
pfortuny
13 days ago
[-]
OK, this is a good reply, and very informative, many thanks. Now I understand why the thing is paradoxical.
reply
zeroonetwothree
13 days ago
[-]
I don’t understand the picture with the shaded circle. Sure the area to the left is smaller, but it also is more likely to be chosen because in a Gaussian values closer to the mean are more likely. So the picture alone doesn’t prove anything.
reply
rssoconnor
13 days ago
[-]
In the diagram the mean of the distribution is the center of the circle.

Of the set of samples a fixed distance d from the mean of the distribution, strictly less than half of them will be closer to the origin than the mean is, and strictly greater than half of them will be further from the the origin than the mean. This is true for all values of d > 0, so the result holds for all samples.

reply
fromMars
13 days ago
[-]
Can someone confirm the validity of the section called ”Can we derive the James-Stein estimator rigorously?"?

The claim that the best estimator must be smooth seemed surprising to me.

reply
moi2388
11 days ago
[-]
What a great read!
reply
TibbityFlanders
13 days ago
[-]
I'm horrible at stats, but is this saying that if I have 5 jars of pennies, and I guess the amount in each one. Then I find the average of all my guesses, and the variance between the guesses, then I can adjust each guess to a more likely answer with this method?
reply
kgwgk
13 days ago
[-]
Not necessarily "more likely" but "better" in some "loss" sense.

It could be "more likely" in the jars example where estimates may convey some relevant information for each other. But consider this example from wikipedia:

"Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements."

https://en.wikipedia.org/wiki/Stein%27s_example#Example

reply
eru
13 days ago
[-]
No, I don't think these problems are related.
reply