Ttyjfyk

Question

The title of the Comment in Nature Scientists rise up against statistical significance begins with:

Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.

and later contains statements like:

Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.

I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth than I can understand.

Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?

When talking about compatibility intervals, bear in mind four things.

First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...

Second, not all values inside are equally compatible with the data, given the assumptions...

Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...

Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...

Nature: Scientists rise up against statistical significance

Basically, they want to fill research papers with even more false positives! — Mar 21 at 7:23
See the discussion on Gelman's blog: statmodeling.stat.columbia.edu/2019/03/20/…. Obviously the article raises some valid points, but see comments raised by Ioannidis against this article (and also, separately, against the "petition" aspect of it), as quoted by Gelman. — Mar 21 at 8:52
This isn't a new concept though. Meta-analysis has been a thing for the better part of 50 years, and Cochrane have been doing meta-analyses of medical/healthcare studies (where it's easier to standardise objectives and outcomes) for the last 25 years. — Mar 21 at 15:11
Fundamentally the problem is trying to reduce "uncertainty" which is a multidimensional problem to a single number. — Mar 22 at 19:06
Basically if people stated "we found no evidence of an association between X and Y" instead of "X and Y are not related" when finding $p>alpha$ this article wouldn't likely exist. — Mar 22 at 19:35

score 62 · Accepted Answer · 2019-03-21 09:15:40Z

The first three points, as far as I can tell, are a variation on a single argument.

Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:

uniform probability distribution

When actually, they are much more likely to look like this:
enter image description here

As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.

So to break down the numbered points in the article:

Measurements outside the CI still have a chance of happening, because the real (likely gaussian) probability is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.

The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.

95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.

This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.

Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.

Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.

+1 for "... publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications." There was an interesting paper that came out years back that talks about how this "publish or perish" leads to compounding error/bias throughout academia: Why Most Published Research Findings Are False (Ioannidis, 2005) — Mar 21 at 8:03
I don't agree with “the real (likely Gaussian) uncertainty...” – Gaussian is another oversimplification. It's somewhat more justified than the hard-limits model thanks to the Central Limit Theorem, but the real distribution is generally something different still. — Mar 21 at 8:43
@leftaroundabout The real distribution is likely different still, but unless the value is physically impossible, the probability is likely still mathematically nonzero. — Mar 21 at 9:38
@leftaroundabout saying that the uncertainty is likely Gaussian is not inherently a simplification. It describes a prior distribution, which is justified by the CLT as the best prior in the absence of other supporting data, but by expressing uncertainty over the distribution the acknowledgement that the distribution could well not be Gaussian is already there. — Mar 21 at 10:42
@inisfree you are very, very mistaken. Many scientific disciplines (like chemistry and biology, as I stated earlier) use almost zero maths, beside basic arithmetic. There are otherwise brilliant scientists out there who are almost math illiterate, and I've met a few of them. — Mar 22 at 6:58

ttnphns 39.3k15145331 · Accepted Answer · 2019-03-23 13:08:37Z

I'll try.

The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

Values near the middle of the confidence (compatibility) interval are more compatible with the data than values near the ends of the interval.

95% is just a convention. You can compute 90% or 99% or any% intervals.

The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.

usulusul 32713 · Accepted Answer · 2019-03-21 04:32:59Z

Much of the article and the figure you include make a very simple point:

Lack of evidence for an effect is not evidence that it does not exist.

For example,

"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".

Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.

But this is the mistake the authors claim scientists routinely make.

For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.

The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)

I disagree with the blanket statement that "absence of evidence is not evidence of absence". Power calculations allow you determine the likelihood of deeming an effect of a particular size significant, given a particular sample size. Large effect sizes require less data to deem them significantly different from zero, while small effects require a larger sample size. If your study is properly powered, and you are still not seeing significant effects, then you can reasonably conclude that the effect does not exist. If you have sufficient data, non-significance can indeed indicate no effect. — Mar 21 at 13:23
@NuclearWang True, but only if the power analysis is done ahead of time and only if it is done with correct assumptions and then correct interpretations (i.e., your power is only relevant to the magnitude of the effect size that you predict; "80% power" does not mean you have 80% probability to correctly detect zero effect). Additionally, in my experience the use of "non-significant" to mean "no effect" is often applied to secondary outcomes or rare events, which the study is (appropriately) not powered for anyways. Finally, beta is typically >> alpha. — Mar 21 at 15:05
@NuclearWang, I don't think anyone is arguing "absence of evidence is NEVER evidence of absence", I think they are arguing it should not be automatically interpreted as such, and that this is the mistake they see people making. — Mar 21 at 17:28
It's almost like people are not trained in tests for equivalence or something. — 2 days ago

whuber♦ 206k33452819 · Accepted Answer · 2019-03-21 14:17:55Z

The great XKCD did this cartoon a while ago, illustrating the problem. If results with $Pgt0.05$ are simplistically treated as proving a hypothesis - and all too often they are - then 1 in 20 hypotheses so proven will actually be false. Similarly, if $Plt0.05$ is taken as disproving a hypotheses then 1 in 20 true hypotheses will be wrongly rejected. P-values don't tell you whether a hypothesis is true or false, they tell you whether a hypothesis is probably true or false. It seems the referenced article is kicking back against the all-too-common naïve interpretation.

(-1) P-values don't show you whether a hypothesis is probably true or false. You need a prior distribution for that. See this xkcd, for example. The problematic hand-waving that leads to this confusion is that if we have similar priors for a large number of hypothesis, then the p-value will be proportional to probability it is true or false. But before seeing any data, some hypothesis are much more probable than others! — Mar 21 at 17:54
While this effect is something that shouldn't be discounted, it is far from being a significant point of the referenced article. — Mar 21 at 18:10

NatNat 353138 · Accepted Answer · 2019-03-22 04:21:23Z

tl;dr- It's fundamentally impossible to prove that things are unrelated; statistics can only be used to show when things are related. Despite this well-established fact, people frequently misinterpret a lack of statistical significance to imply a lack of relationship.

A good encryption method should generate a ciphertext that, as far as an attacker can tell, doesn't bare any statistical relationship to the protected message. Because if an attacker can determine some sort of relationship, then they can get information about your protected messages by just looking at the ciphertexts – which is a Bad Thing^TM.

However, the ciphertext and its corresponding plaintext 100% determine each other. So even if the world's very best mathematicians can't find any significant relationship no matter how hard they try, we still obviously know that the relationship isn't just there, but that it's completely and fully deterministic. This determinism can exist even when we know that it's impossible to find a relationship.

Despite this, we still get people who'll do stuff like:

Pick some relationship they want to "disprove".

Do some study on it that's inadequate to detect the alleged relationship.

Report the lack of a statistically significant relationship.

Twist this into a lack of relationship.

This leads to all sorts of "scientific studies" that the media will (falsely) report as disproving the existence of some relationship.

If you want to design your own study around this, there're a bunch of ways you can do it:

Lazy research:
The easiest way, by far, is to just be incredibly lazy about it. It's just like from that figure linked in the question:
$hspace50px$.
You can easily get that $`` smallcolordarkredbeginarrayc text'Non-significant' study \[-10px] left(texthigh~P~textvalueright) endarray "$ by simply having small sample sizes, allowing a lot of noise, and other various lazy things. In fact, if you're so lazy as to not collect any data, then you're already done!

Lazy analysis:
For some silly reason, some people think a Pearson correlation coefficient of $0$ means "no correlation". Which is true, in a very limited sense. But, here're a few cases to observe:
$hspace50px$.
This is, there may not be a "linear" relationship, but obviously there can be a more complex one. And it doesn't need to be "encryption"-level complex, but rather "it's actually just a bit of a squiggly line" or "there're two correlations" or whatever.

Lazy answering:
In the spirit of the above, I'm going to stop here. To, ya know, be lazy!

But, seriously, the article sums it up well in:

Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.

+1 cause what you write is both true and thought provoking. However, in my humble opinion, you can prove that two quantities are reasonably uncorrelated under certain assumptions. You have to offcourse first start by e.g. supposing a certain distribution about them, but this can be based on the laws of physics, or statistics (e.g. the speed of molecules of a gas in a container are expected to be gaussian or so on) — Mar 22 at 5:21
@ntg Yeah, it's hard to know how to word some of this stuff, so I left a lot out. I mean, the general truth is that we can't disprove that some relationship exists, though we can generally demonstrate that a specific relationship doesn't exist. Sorta like, we can't establish that two data series are unrelated, but we can establish that they don't appear to be reliably related by a simple linear function. — Mar 22 at 8:05
-1 "tl;dr- It's fundamentally impossible to prove that things are unrelated": Equivalence tests provide evidence of absence of an effect within an arbitrary effect size. — 2 days ago
@Alexis I think you misunderstand equivalence testing; you can use equivalence testing to evidence the absence of a certain relationship holding, e.g. a linear relationship, but not evidence the absence of any relationship. — yesterday
@Alexis Statistical inference can provide you as much evidence of the absence of an effect larger than a specific effect size within the context of some model. Perhaps you're assuming that the model will always be known? — yesterday

score 4 · Accepted Answer · 2019-03-22 11:11:49Z

For me, the most important part was:

...[We] urge authors to
discuss the point estimate, even when they have a large P value or a
wide interval, as well as discussing the limits of that interval.

In other words: Place a higher emphasis on discussing estimates (center and confidence interval), and
a lower emphasis on "Null-hypothesis testing".

How does this work in practice? A lot of research boils down to measuring effect sizes, for example
"We measured a risk ratio of 1.20, with a 95% C.I.
ranging from 0.97 to 1.33". This is a suitable summary of a study. You can
immediately see the most probably effect size and the uncertainty of the measurement. Using this
summary, you can quickly compare this study to other studies like it,
and ideally you can combine all the findings in a weighted average.

Unfortunately, such studies are often summarized as "We did not find a statiscally significant
increase of the risk ratio". This is a valid conclusion of the study above. But it is not a
suitable summary of the study, because you can't easily compare studies using these kinds of summaries. You don't
know which study had the most precise measurement, and you can't intuit what the finding of a meta-study might be.
And you don't immediately spot when studies claim "non-significant risk ratio increase" by having confidence
intervals that are so large you can hide an elephant in them.

That depends on one's null hypothesis. For example, rejecting $H_0:|theta|ge Delta$ provides evidence of an absence of effect larger than an arbitrarily small $Delta$. — 2 days ago

score 4 · Accepted Answer · 2019-03-23 12:53:46Z

For a didactic introduction to the problem, Alex Reinhart wrote a book fully available online and edited at No Starch Press (with more content):
https://www.statisticsdonewrong.com

It explains the root of the problem without sophisticated maths and has specific chapters with examples from simulated data set:

https://www.statisticsdonewrong.com/p-value.html

https://www.statisticsdonewrong.com/regression.html

In the second link, a graphical example illustrates the p-value problem. P-value is often used as a single indicator of statistical difference between dataset but is clearly not enough by its own.

Edit for a more detailed answer:

In many cases, studies aim to reproduce a precise type of data, either physical measurements (say the number of particles in an accelerator during a specific experiment) or quantitative indicators (like the number of patients developing specific symptoms during drug tests). In either this situation, many factors can interfere with the measurement process like human error or systems variations (people reacting differently to the same drug). This is the reason experiments are often done hundreds times if possible and drug testing is done, ideally, on cohorts of thousands patients.

The data set is then reduced to its most simple values using statistics: means, standard deviations and so on. The problem in comparing models through their mean is that the measured values are only indicators of the true values, and are also statistically changing depending on the number and precision of the individual measurements. We have ways to give a good guess on which measures are likely to be the same and which are not, but only with a certain certainty. The usual threshold is to say that if we have less than one out of twenty chance to be wrong saying two values are different, we consider them "statistically different" (that's the meaning of $P<0.05$), else we do not conclude.

This leads to the odd conclusions illustrated in Nature's article where two same measures give the same mean values but researchers conclusions differ due to the size of the sample. This, and other trops from statistical vocabulary and habits is becoming more and more important in the sciences. An other side of the problem is that people tend to forget that they use statistical tools and conclude about effect without proper verification of the statistical power of their samples.

For an other illustration, recently social and life sciences are going through a true replication crisis due to the fact that a lot of effects were taken for granted by people who didn't check the proper statistical power of famous studies (while other falsified the data but this is another problem).

Oh that looks really helpful for the uninitiated or who's initiation has expired decades ago. Thanks! — Mar 21 at 13:12
While not just a link, this answer has all the salient characteristics of a "link only answer". To improve this answer, please incorporate the key points into the answer itself. Ideally, your answer should be useful as an answer even if the content of the links disappears. — Mar 21 at 18:22
About p-values and the base rate fallacy (mentioned in your link), Veritasium published this video called the bayesian trap. — Mar 21 at 20:42
Sorry then, I'll try to improve and develop the answer as soon as possible. My idea was also to provide useful material for the curious reader. — Mar 22 at 16:43
@G.Clavier and the self-described statistics newbie and curious reader appreciates it! — Mar 22 at 22:17

rvlrvl 10.3k1244 · Accepted Answer · 2019-03-24 18:45:23Z

It is "significant" that statisticians, not just scientists, are rising up and objecting to the loose use of "significance" and $P$ values. The most recent issue of The American Statistician is devoted entirely to this matter. See especially the lead editorial by Wasserman, Schirm, and Lazar.

score 2 · Accepted Answer · 2019-03-25 11:57:23Z

It is a fact that for several reasons, p-values have indeed become a problem.

However, despite their weaknesses, they have important advantages such as simplicity and intuitive theory. Therefore, while overall I agree with the Comment in Nature, I do think that rather than ditching statistical significance completely, a more balanced solution is needed. Here are a few options:

1. "Changing the default P-value threshold for
statistical significance from 0.05 to 0.005 for claims of new
discoveries". In my view, Benjamin et al addressed very well the
most compelling arguments against adopting a higher standard of
evidence.

2. Adopting the second-generation p-values. These seem
to be a reasonable solution to most of the problems affecting
classical p-values. As Blume et al say here, second-generation
p-values could help "improve rigor, reproducibility, & transparency in statistical analyses."

3. Redefining p-value as "a quantitative measure of certainty — a
“confidence index” — that an observed relationship, or claim,
is true." This could help change analysis goal from achieving significance to appropriately estimating this confidence.

Importantly, "results that do not reach the threshold for statistical significance or “confidence” (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods."

I think that could help mitigate the obsession with p-values by leading journals, which is behind the misuse of p-values.

score 0 · Accepted Answer · 2019-03-21 17:52:16Z

One thing that has not been mentioned is that error or significance are statistical estimates, not actual physical measurements: They depend heavily on the data you have available and how you process it. You can only provide precise value of error and significance if you have measured every possible event. This is usually not the case, far from it!

Therefore, every estimate of error or significance, in this case any given P-value, is by definition inaccurate and should not be trusted to describe the underlying research – let alone phenomena! – accurately. In fact, it should not be trusted to convey anything about results WITHOUT knowledge of what is being represented, how the error was estimated and what was done to quality control the data. For example, one way to reduce estimated error is to remove outliers. If this is removal is also done statistically, then how can you actually know the outliers were real errors instead of unlikely real measurements that should be included in the error? How could the reduced error improve the significance of the results? What about erroneous measurements near the estimates? They improve the error and can impact statistical significance but can lead to wrong conclusions!

For that matter, I do physical modeling and have created models myself where 3-sigma error is completely unphysical. That is, statistically there's around one event in a thousand (well...more often than that, but I digress) that would result in completely ridiculous value. The magnitude of 3 interval error in my field is roughly equivalent of having best possible estimate of 1 cm turning out to be a meter every now and then. However, this is indeed an accepted result when providing statistical +/- interval calculated from physical, empirical data in my field. Sure, narrowness of uncertainty interval is respected, but often the value of best guess estimate is more useful result even when nominal error interval would be larger.

As a side note, I was once personally responsible for one of those one in a thousand outliers. I was in process of calibrating an instrument when an event happened which we were supposed to measure. Alas, that data point would have been exactly one of those 100 fold outliers, so in a sense, they DO happen and are included in the modeling error!

"You can only provide accurate measure, if you have measured every possible event." Hmm. So, accuracy is hopeless? And also irrelevant? Please expand on the difference between accuracy and bias. Are the inaccurate estimates biased or unbiased? If they are unbiased, then aren't they a little bit useful? "For example, one way to reduce error is to remove outliers." Hmm. That will reduce sample variance, but "error"? "...often the value of best guess estimate is more useful result even when nominal error interval would be larger" I don't deny that a good prior is better than a bad experiment. — Mar 21 at 16:27
Modified the text a bit based on your comment. What I meant was that statistical measure of error is always an estimate unless you have all the possible individual tests, so to speak, available. This rarely happens, except when e.g. polling a set number of people (n.b. not as samples from larger crowd or general population). — Mar 21 at 17:49
I am a practitioner who uses statistics rather than a statistician. I think a basic problem with p values is that many who are not familiar with what they are confuse them with substantive significance. Thus I have been asked to determine which slopes are important by using p values regardless of whether the slopes are large or not. A similar problem is using them to determine relative impact of variables (which is critical to me, but which gets surprisingly little attention in the regression literature). — Mar 22 at 23:10

score 62 · Accepted Answer · 2019-03-21 09:15:40Z

The first three points, as far as I can tell, are a variation on a single argument.

Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:

uniform probability distribution

When actually, they are much more likely to look like this:
enter image description here

As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.

So to break down the numbered points in the article:

Measurements outside the CI still have a chance of happening, because the real (likely gaussian) probability is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.

The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.

95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.

This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.

Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.

Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.

+1 for "... publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications." There was an interesting paper that came out years back that talks about how this "publish or perish" leads to compounding error/bias throughout academia: Why Most Published Research Findings Are False (Ioannidis, 2005) — Mar 21 at 8:03
I don't agree with “the real (likely Gaussian) uncertainty...” – Gaussian is another oversimplification. It's somewhat more justified than the hard-limits model thanks to the Central Limit Theorem, but the real distribution is generally something different still. — Mar 21 at 8:43
@leftaroundabout The real distribution is likely different still, but unless the value is physically impossible, the probability is likely still mathematically nonzero. — Mar 21 at 9:38
@leftaroundabout saying that the uncertainty is likely Gaussian is not inherently a simplification. It describes a prior distribution, which is justified by the CLT as the best prior in the absence of other supporting data, but by expressing uncertainty over the distribution the acknowledgement that the distribution could well not be Gaussian is already there. — Mar 21 at 10:42
@inisfree you are very, very mistaken. Many scientific disciplines (like chemistry and biology, as I stated earlier) use almost zero maths, beside basic arithmetic. There are otherwise brilliant scientists out there who are almost math illiterate, and I've met a few of them. — Mar 22 at 6:58

ttnphns 39.3k15145331 · Accepted Answer · 2019-03-23 13:08:37Z

I'll try.

The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

Values near the middle of the confidence (compatibility) interval are more compatible with the data than values near the ends of the interval.

95% is just a convention. You can compute 90% or 99% or any% intervals.

The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.

usulusul 32713 · Accepted Answer · 2019-03-21 04:32:59Z

Much of the article and the figure you include make a very simple point:

Lack of evidence for an effect is not evidence that it does not exist.

For example,

"In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".

Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.

But this is the mistake the authors claim scientists routinely make.

For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.

The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)

I disagree with the blanket statement that "absence of evidence is not evidence of absence". Power calculations allow you determine the likelihood of deeming an effect of a particular size significant, given a particular sample size. Large effect sizes require less data to deem them significantly different from zero, while small effects require a larger sample size. If your study is properly powered, and you are still not seeing significant effects, then you can reasonably conclude that the effect does not exist. If you have sufficient data, non-significance can indeed indicate no effect. — Mar 21 at 13:23
@NuclearWang True, but only if the power analysis is done ahead of time and only if it is done with correct assumptions and then correct interpretations (i.e., your power is only relevant to the magnitude of the effect size that you predict; "80% power" does not mean you have 80% probability to correctly detect zero effect). Additionally, in my experience the use of "non-significant" to mean "no effect" is often applied to secondary outcomes or rare events, which the study is (appropriately) not powered for anyways. Finally, beta is typically >> alpha. — Mar 21 at 15:05
@NuclearWang, I don't think anyone is arguing "absence of evidence is NEVER evidence of absence", I think they are arguing it should not be automatically interpreted as such, and that this is the mistake they see people making. — Mar 21 at 17:28
It's almost like people are not trained in tests for equivalence or something. — 2 days ago

whuber♦ 206k33452819 · Accepted Answer · 2019-03-21 14:17:55Z

The great XKCD did this cartoon a while ago, illustrating the problem. If results with $Pgt0.05$ are simplistically treated as proving a hypothesis - and all too often they are - then 1 in 20 hypotheses so proven will actually be false. Similarly, if $Plt0.05$ is taken as disproving a hypotheses then 1 in 20 true hypotheses will be wrongly rejected. P-values don't tell you whether a hypothesis is true or false, they tell you whether a hypothesis is probably true or false. It seems the referenced article is kicking back against the all-too-common naïve interpretation.

(-1) P-values don't show you whether a hypothesis is probably true or false. You need a prior distribution for that. See this xkcd, for example. The problematic hand-waving that leads to this confusion is that if we have similar priors for a large number of hypothesis, then the p-value will be proportional to probability it is true or false. But before seeing any data, some hypothesis are much more probable than others! — Mar 21 at 17:54
While this effect is something that shouldn't be discounted, it is far from being a significant point of the referenced article. — Mar 21 at 18:10

NatNat 353138 · Accepted Answer · 2019-03-22 04:21:23Z

tl;dr- It's fundamentally impossible to prove that things are unrelated; statistics can only be used to show when things are related. Despite this well-established fact, people frequently misinterpret a lack of statistical significance to imply a lack of relationship.

A good encryption method should generate a ciphertext that, as far as an attacker can tell, doesn't bare any statistical relationship to the protected message. Because if an attacker can determine some sort of relationship, then they can get information about your protected messages by just looking at the ciphertexts – which is a Bad Thing^TM.

However, the ciphertext and its corresponding plaintext 100% determine each other. So even if the world's very best mathematicians can't find any significant relationship no matter how hard they try, we still obviously know that the relationship isn't just there, but that it's completely and fully deterministic. This determinism can exist even when we know that it's impossible to find a relationship.

Despite this, we still get people who'll do stuff like:

Pick some relationship they want to "disprove".

Do some study on it that's inadequate to detect the alleged relationship.

Report the lack of a statistically significant relationship.

Twist this into a lack of relationship.

This leads to all sorts of "scientific studies" that the media will (falsely) report as disproving the existence of some relationship.

If you want to design your own study around this, there're a bunch of ways you can do it:

Lazy research:
The easiest way, by far, is to just be incredibly lazy about it. It's just like from that figure linked in the question:
$hspace50px$.
You can easily get that $`` smallcolordarkredbeginarrayc text'Non-significant' study \[-10px] left(texthigh~P~textvalueright) endarray "$ by simply having small sample sizes, allowing a lot of noise, and other various lazy things. In fact, if you're so lazy as to not collect any data, then you're already done!

Lazy analysis:
For some silly reason, some people think a Pearson correlation coefficient of $0$ means "no correlation". Which is true, in a very limited sense. But, here're a few cases to observe:
$hspace50px$.
This is, there may not be a "linear" relationship, but obviously there can be a more complex one. And it doesn't need to be "encryption"-level complex, but rather "it's actually just a bit of a squiggly line" or "there're two correlations" or whatever.

Lazy answering:
In the spirit of the above, I'm going to stop here. To, ya know, be lazy!

But, seriously, the article sums it up well in:

Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.

+1 cause what you write is both true and thought provoking. However, in my humble opinion, you can prove that two quantities are reasonably uncorrelated under certain assumptions. You have to offcourse first start by e.g. supposing a certain distribution about them, but this can be based on the laws of physics, or statistics (e.g. the speed of molecules of a gas in a container are expected to be gaussian or so on) — Mar 22 at 5:21
@ntg Yeah, it's hard to know how to word some of this stuff, so I left a lot out. I mean, the general truth is that we can't disprove that some relationship exists, though we can generally demonstrate that a specific relationship doesn't exist. Sorta like, we can't establish that two data series are unrelated, but we can establish that they don't appear to be reliably related by a simple linear function. — Mar 22 at 8:05
-1 "tl;dr- It's fundamentally impossible to prove that things are unrelated": Equivalence tests provide evidence of absence of an effect within an arbitrary effect size. — 2 days ago
@Alexis I think you misunderstand equivalence testing; you can use equivalence testing to evidence the absence of a certain relationship holding, e.g. a linear relationship, but not evidence the absence of any relationship. — yesterday
@Alexis Statistical inference can provide you as much evidence of the absence of an effect larger than a specific effect size within the context of some model. Perhaps you're assuming that the model will always be known? — yesterday

score 4 · Accepted Answer · 2019-03-22 11:11:49Z

For me, the most important part was:

...[We] urge authors to
discuss the point estimate, even when they have a large P value or a
wide interval, as well as discussing the limits of that interval.

In other words: Place a higher emphasis on discussing estimates (center and confidence interval), and
a lower emphasis on "Null-hypothesis testing".

How does this work in practice? A lot of research boils down to measuring effect sizes, for example
"We measured a risk ratio of 1.20, with a 95% C.I.
ranging from 0.97 to 1.33". This is a suitable summary of a study. You can
immediately see the most probably effect size and the uncertainty of the measurement. Using this
summary, you can quickly compare this study to other studies like it,
and ideally you can combine all the findings in a weighted average.

Unfortunately, such studies are often summarized as "We did not find a statiscally significant
increase of the risk ratio". This is a valid conclusion of the study above. But it is not a
suitable summary of the study, because you can't easily compare studies using these kinds of summaries. You don't
know which study had the most precise measurement, and you can't intuit what the finding of a meta-study might be.
And you don't immediately spot when studies claim "non-significant risk ratio increase" by having confidence
intervals that are so large you can hide an elephant in them.

That depends on one's null hypothesis. For example, rejecting $H_0:|theta|ge Delta$ provides evidence of an absence of effect larger than an arbitrarily small $Delta$. — 2 days ago

score 4 · Accepted Answer · 2019-03-23 12:53:46Z

For a didactic introduction to the problem, Alex Reinhart wrote a book fully available online and edited at No Starch Press (with more content):
https://www.statisticsdonewrong.com

It explains the root of the problem without sophisticated maths and has specific chapters with examples from simulated data set:

https://www.statisticsdonewrong.com/p-value.html

https://www.statisticsdonewrong.com/regression.html

In the second link, a graphical example illustrates the p-value problem. P-value is often used as a single indicator of statistical difference between dataset but is clearly not enough by its own.

Edit for a more detailed answer:

In many cases, studies aim to reproduce a precise type of data, either physical measurements (say the number of particles in an accelerator during a specific experiment) or quantitative indicators (like the number of patients developing specific symptoms during drug tests). In either this situation, many factors can interfere with the measurement process like human error or systems variations (people reacting differently to the same drug). This is the reason experiments are often done hundreds times if possible and drug testing is done, ideally, on cohorts of thousands patients.

The data set is then reduced to its most simple values using statistics: means, standard deviations and so on. The problem in comparing models through their mean is that the measured values are only indicators of the true values, and are also statistically changing depending on the number and precision of the individual measurements. We have ways to give a good guess on which measures are likely to be the same and which are not, but only with a certain certainty. The usual threshold is to say that if we have less than one out of twenty chance to be wrong saying two values are different, we consider them "statistically different" (that's the meaning of $P<0.05$), else we do not conclude.

This leads to the odd conclusions illustrated in Nature's article where two same measures give the same mean values but researchers conclusions differ due to the size of the sample. This, and other trops from statistical vocabulary and habits is becoming more and more important in the sciences. An other side of the problem is that people tend to forget that they use statistical tools and conclude about effect without proper verification of the statistical power of their samples.

For an other illustration, recently social and life sciences are going through a true replication crisis due to the fact that a lot of effects were taken for granted by people who didn't check the proper statistical power of famous studies (while other falsified the data but this is another problem).

Oh that looks really helpful for the uninitiated or who's initiation has expired decades ago. Thanks! — Mar 21 at 13:12
While not just a link, this answer has all the salient characteristics of a "link only answer". To improve this answer, please incorporate the key points into the answer itself. Ideally, your answer should be useful as an answer even if the content of the links disappears. — Mar 21 at 18:22
About p-values and the base rate fallacy (mentioned in your link), Veritasium published this video called the bayesian trap. — Mar 21 at 20:42
Sorry then, I'll try to improve and develop the answer as soon as possible. My idea was also to provide useful material for the curious reader. — Mar 22 at 16:43
@G.Clavier and the self-described statistics newbie and curious reader appreciates it! — Mar 22 at 22:17

rvlrvl 10.3k1244 · Accepted Answer · 2019-03-24 18:45:23Z

It is "significant" that statisticians, not just scientists, are rising up and objecting to the loose use of "significance" and $P$ values. The most recent issue of The American Statistician is devoted entirely to this matter. See especially the lead editorial by Wasserman, Schirm, and Lazar.

score 2 · Accepted Answer · 2019-03-25 11:57:23Z

It is a fact that for several reasons, p-values have indeed become a problem.

However, despite their weaknesses, they have important advantages such as simplicity and intuitive theory. Therefore, while overall I agree with the Comment in Nature, I do think that rather than ditching statistical significance completely, a more balanced solution is needed. Here are a few options:

1. "Changing the default P-value threshold for
statistical significance from 0.05 to 0.005 for claims of new
discoveries". In my view, Benjamin et al addressed very well the
most compelling arguments against adopting a higher standard of
evidence.

2. Adopting the second-generation p-values. These seem
to be a reasonable solution to most of the problems affecting
classical p-values. As Blume et al say here, second-generation
p-values could help "improve rigor, reproducibility, & transparency in statistical analyses."

3. Redefining p-value as "a quantitative measure of certainty — a
“confidence index” — that an observed relationship, or claim,
is true." This could help change analysis goal from achieving significance to appropriately estimating this confidence.

Importantly, "results that do not reach the threshold for statistical significance or “confidence” (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods."

I think that could help mitigate the obsession with p-values by leading journals, which is behind the misuse of p-values.

score 0 · Accepted Answer · 2019-03-21 17:52:16Z

One thing that has not been mentioned is that error or significance are statistical estimates, not actual physical measurements: They depend heavily on the data you have available and how you process it. You can only provide precise value of error and significance if you have measured every possible event. This is usually not the case, far from it!

Therefore, every estimate of error or significance, in this case any given P-value, is by definition inaccurate and should not be trusted to describe the underlying research – let alone phenomena! – accurately. In fact, it should not be trusted to convey anything about results WITHOUT knowledge of what is being represented, how the error was estimated and what was done to quality control the data. For example, one way to reduce estimated error is to remove outliers. If this is removal is also done statistically, then how can you actually know the outliers were real errors instead of unlikely real measurements that should be included in the error? How could the reduced error improve the significance of the results? What about erroneous measurements near the estimates? They improve the error and can impact statistical significance but can lead to wrong conclusions!

For that matter, I do physical modeling and have created models myself where 3-sigma error is completely unphysical. That is, statistically there's around one event in a thousand (well...more often than that, but I digress) that would result in completely ridiculous value. The magnitude of 3 interval error in my field is roughly equivalent of having best possible estimate of 1 cm turning out to be a meter every now and then. However, this is indeed an accepted result when providing statistical +/- interval calculated from physical, empirical data in my field. Sure, narrowness of uncertainty interval is respected, but often the value of best guess estimate is more useful result even when nominal error interval would be larger.

As a side note, I was once personally responsible for one of those one in a thousand outliers. I was in process of calibrating an instrument when an event happened which we were supposed to measure. Alas, that data point would have been exactly one of those 100 fold outliers, so in a sense, they DO happen and are included in the modeling error!

"You can only provide accurate measure, if you have measured every possible event." Hmm. So, accuracy is hopeless? And also irrelevant? Please expand on the difference between accuracy and bias. Are the inaccurate estimates biased or unbiased? If they are unbiased, then aren't they a little bit useful? "For example, one way to reduce error is to remove outliers." Hmm. That will reduce sample variance, but "error"? "...often the value of best guess estimate is more useful result even when nominal error interval would be larger" I don't deny that a good prior is better than a bad experiment. — Mar 21 at 16:27
Modified the text a bit based on your comment. What I meant was that statistical measure of error is always an estimate unless you have all the possible individual tests, so to speak, available. This rarely happens, except when e.g. polling a set number of people (n.b. not as samples from larger crowd or general population). — Mar 21 at 17:49
I am a practitioner who uses statistics rather than a statistician. I think a basic problem with p values is that many who are not familiar with what they are confuse them with substantive significance. Thus I have been asked to determine which slopes are important by using p values regardless of whether the slopes are large or not. A similar problem is using them to determine relative impact of variables (which is critical to me, but which gets surprisingly little attention in the regression literature). — Mar 22 at 23:10

搜尋此網誌

Ttyjfyk

10 Answers
10

protected by gung♦ Mar 21 at 13:31

10 Answers
10

10 Answers
10

protected by gung♦ Mar 21 at 13:31

Popular posts from this blog

10 Answers 10

protected by gung♦ Mar 21 at 13:31

10 Answers 10

10 Answers 10

protected by gung♦ Mar 21 at 13:31

Popular posts from this blog

10 Answers
10

10 Answers
10

10 Answers
10