What does “Scientists rise up against statistical significance” mean? (Comment in Nature)What is a good introduction to statistical hypothesis testing for computer scientists?What does 'more statistically significant' mean?Statistical comparison of data dispersion between two groupsA psychology journal banned p-values and confidence intervals; is it indeed wise to stop using them?determining statistical significance of a mean of a meanASA discusses limitations of $p$-values - what are the alternatives?Is this the solution to the p-value problem?What does p value 2.5e-0.5 mean?Statistical significance between years; what test to useHow to determine statistical significance of mean being larger than zero?

What do you call the infoboxes with text and sometimes images on the side of a page we find in textbooks?

Giant Toughroad SLR 2 for 200 miles in two days, will it make it?

Lifted its hind leg on or lifted its hind leg towards?

Is it okay / does it make sense for another player to join a running game of Munchkin?

Lightning Web Component - do I need to track changes for every single input field in a form

Can I create an upright 7-foot × 5-foot wall with the Minor Illusion spell?

Is a naturally all "male" species possible?

Is there an Impartial Brexit Deal comparison site?

Why isn't KTEX's runway designation 10/28 instead of 9/27?

Can a Bard use an arcane focus?

Partial sums of primes

What will be the benefits of Brexit?

Have I saved too much for retirement so far?

What does the "3am" section means in manpages?

Female=gender counterpart?

Latex for-and in equation

Golf game boilerplate

How will losing mobility of one hand affect my career as a programmer?

Proof of Lemma: Every integer can be written as a product of primes

How do I repair my stair bannister?

Reply ‘no position’ while the job posting is still there (‘HiWi’ position in Germany)

Why are all the doors on Ferenginar (the Ferengi home world) far shorter than the average Ferengi?

What is the term when two people sing in harmony, but they aren't singing the same notes?

Is infinity mathematically observable?



What does “Scientists rise up against statistical significance” mean? (Comment in Nature)


What is a good introduction to statistical hypothesis testing for computer scientists?What does 'more statistically significant' mean?Statistical comparison of data dispersion between two groupsA psychology journal banned p-values and confidence intervals; is it indeed wise to stop using them?determining statistical significance of a mean of a meanASA discusses limitations of $p$-values - what are the alternatives?Is this the solution to the p-value problem?What does p value 2.5e-0.5 mean?Statistical significance between years; what test to useHow to determine statistical significance of mean being larger than zero?













55












$begingroup$


The title of the Comment in Nature Scientists rise up against statistical significance begins with:




Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.




and later contains statements like:




Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.




I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth than I can understand.



Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?




When talking about compatibility intervals, bear in mind four things.



  • First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...


  • Second, not all values inside are equally compatible with the data, given the assumptions...


  • Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...


  • Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...





Nature: Scientists rise up against statistical significance










share|cite|improve this question











$endgroup$







  • 13




    $begingroup$
    Basically, they want to fill research papers with even more false positives!
    $endgroup$
    – asdf
    Mar 21 at 7:23






  • 12




    $begingroup$
    See the discussion on Gelman's blog: statmodeling.stat.columbia.edu/2019/03/20/…. Obviously the article raises some valid points, but see comments raised by Ioannidis against this article (and also, separately, against the "petition" aspect of it), as quoted by Gelman.
    $endgroup$
    – amoeba
    Mar 21 at 8:52







  • 3




    $begingroup$
    This isn't a new concept though. Meta-analysis has been a thing for the better part of 50 years, and Cochrane have been doing meta-analyses of medical/healthcare studies (where it's easier to standardise objectives and outcomes) for the last 25 years.
    $endgroup$
    – Graham
    Mar 21 at 15:11






  • 4




    $begingroup$
    Fundamentally the problem is trying to reduce "uncertainty" which is a multidimensional problem to a single number.
    $endgroup$
    – MaxW
    Mar 22 at 19:06






  • 4




    $begingroup$
    Basically if people stated "we found no evidence of an association between X and Y" instead of "X and Y are not related" when finding $p>alpha$ this article wouldn't likely exist.
    $endgroup$
    – Firebug
    Mar 22 at 19:35
















55












$begingroup$


The title of the Comment in Nature Scientists rise up against statistical significance begins with:




Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.




and later contains statements like:




Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.




I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth than I can understand.



Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?




When talking about compatibility intervals, bear in mind four things.



  • First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...


  • Second, not all values inside are equally compatible with the data, given the assumptions...


  • Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...


  • Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...





Nature: Scientists rise up against statistical significance










share|cite|improve this question











$endgroup$







  • 13




    $begingroup$
    Basically, they want to fill research papers with even more false positives!
    $endgroup$
    – asdf
    Mar 21 at 7:23






  • 12




    $begingroup$
    See the discussion on Gelman's blog: statmodeling.stat.columbia.edu/2019/03/20/…. Obviously the article raises some valid points, but see comments raised by Ioannidis against this article (and also, separately, against the "petition" aspect of it), as quoted by Gelman.
    $endgroup$
    – amoeba
    Mar 21 at 8:52







  • 3




    $begingroup$
    This isn't a new concept though. Meta-analysis has been a thing for the better part of 50 years, and Cochrane have been doing meta-analyses of medical/healthcare studies (where it's easier to standardise objectives and outcomes) for the last 25 years.
    $endgroup$
    – Graham
    Mar 21 at 15:11






  • 4




    $begingroup$
    Fundamentally the problem is trying to reduce "uncertainty" which is a multidimensional problem to a single number.
    $endgroup$
    – MaxW
    Mar 22 at 19:06






  • 4




    $begingroup$
    Basically if people stated "we found no evidence of an association between X and Y" instead of "X and Y are not related" when finding $p>alpha$ this article wouldn't likely exist.
    $endgroup$
    – Firebug
    Mar 22 at 19:35














55












55








55


30



$begingroup$


The title of the Comment in Nature Scientists rise up against statistical significance begins with:




Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.




and later contains statements like:




Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.




I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth than I can understand.



Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?




When talking about compatibility intervals, bear in mind four things.



  • First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...


  • Second, not all values inside are equally compatible with the data, given the assumptions...


  • Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...


  • Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...





Nature: Scientists rise up against statistical significance










share|cite|improve this question











$endgroup$




The title of the Comment in Nature Scientists rise up against statistical significance begins with:




Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly crucial effects.




and later contains statements like:




Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.




I think I can grasp that the image below does not say that the two studies disagree because one "rules out" no effect while the other does not. But the article seems to go into much more depth than I can understand.



Towards the end there seems to be a summary in four points. Is it possible to summarize these in even simpler terms for those of us who read statistics rather than write it?




When talking about compatibility intervals, bear in mind four things.



  • First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible...


  • Second, not all values inside are equally compatible with the data, given the assumptions...


  • Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention...


  • Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval...





Nature: Scientists rise up against statistical significance







statistical-significance p-value bias






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Mar 22 at 22:14







uhoh

















asked Mar 21 at 1:19









uhohuhoh

39646




39646







  • 13




    $begingroup$
    Basically, they want to fill research papers with even more false positives!
    $endgroup$
    – asdf
    Mar 21 at 7:23






  • 12




    $begingroup$
    See the discussion on Gelman's blog: statmodeling.stat.columbia.edu/2019/03/20/…. Obviously the article raises some valid points, but see comments raised by Ioannidis against this article (and also, separately, against the "petition" aspect of it), as quoted by Gelman.
    $endgroup$
    – amoeba
    Mar 21 at 8:52







  • 3




    $begingroup$
    This isn't a new concept though. Meta-analysis has been a thing for the better part of 50 years, and Cochrane have been doing meta-analyses of medical/healthcare studies (where it's easier to standardise objectives and outcomes) for the last 25 years.
    $endgroup$
    – Graham
    Mar 21 at 15:11






  • 4




    $begingroup$
    Fundamentally the problem is trying to reduce "uncertainty" which is a multidimensional problem to a single number.
    $endgroup$
    – MaxW
    Mar 22 at 19:06






  • 4




    $begingroup$
    Basically if people stated "we found no evidence of an association between X and Y" instead of "X and Y are not related" when finding $p>alpha$ this article wouldn't likely exist.
    $endgroup$
    – Firebug
    Mar 22 at 19:35













  • 13




    $begingroup$
    Basically, they want to fill research papers with even more false positives!
    $endgroup$
    – asdf
    Mar 21 at 7:23






  • 12




    $begingroup$
    See the discussion on Gelman's blog: statmodeling.stat.columbia.edu/2019/03/20/…. Obviously the article raises some valid points, but see comments raised by Ioannidis against this article (and also, separately, against the "petition" aspect of it), as quoted by Gelman.
    $endgroup$
    – amoeba
    Mar 21 at 8:52







  • 3




    $begingroup$
    This isn't a new concept though. Meta-analysis has been a thing for the better part of 50 years, and Cochrane have been doing meta-analyses of medical/healthcare studies (where it's easier to standardise objectives and outcomes) for the last 25 years.
    $endgroup$
    – Graham
    Mar 21 at 15:11






  • 4




    $begingroup$
    Fundamentally the problem is trying to reduce "uncertainty" which is a multidimensional problem to a single number.
    $endgroup$
    – MaxW
    Mar 22 at 19:06






  • 4




    $begingroup$
    Basically if people stated "we found no evidence of an association between X and Y" instead of "X and Y are not related" when finding $p>alpha$ this article wouldn't likely exist.
    $endgroup$
    – Firebug
    Mar 22 at 19:35








13




13




$begingroup$
Basically, they want to fill research papers with even more false positives!
$endgroup$
– asdf
Mar 21 at 7:23




$begingroup$
Basically, they want to fill research papers with even more false positives!
$endgroup$
– asdf
Mar 21 at 7:23




12




12




$begingroup$
See the discussion on Gelman's blog: statmodeling.stat.columbia.edu/2019/03/20/…. Obviously the article raises some valid points, but see comments raised by Ioannidis against this article (and also, separately, against the "petition" aspect of it), as quoted by Gelman.
$endgroup$
– amoeba
Mar 21 at 8:52





$begingroup$
See the discussion on Gelman's blog: statmodeling.stat.columbia.edu/2019/03/20/…. Obviously the article raises some valid points, but see comments raised by Ioannidis against this article (and also, separately, against the "petition" aspect of it), as quoted by Gelman.
$endgroup$
– amoeba
Mar 21 at 8:52





3




3




$begingroup$
This isn't a new concept though. Meta-analysis has been a thing for the better part of 50 years, and Cochrane have been doing meta-analyses of medical/healthcare studies (where it's easier to standardise objectives and outcomes) for the last 25 years.
$endgroup$
– Graham
Mar 21 at 15:11




$begingroup$
This isn't a new concept though. Meta-analysis has been a thing for the better part of 50 years, and Cochrane have been doing meta-analyses of medical/healthcare studies (where it's easier to standardise objectives and outcomes) for the last 25 years.
$endgroup$
– Graham
Mar 21 at 15:11




4




4




$begingroup$
Fundamentally the problem is trying to reduce "uncertainty" which is a multidimensional problem to a single number.
$endgroup$
– MaxW
Mar 22 at 19:06




$begingroup$
Fundamentally the problem is trying to reduce "uncertainty" which is a multidimensional problem to a single number.
$endgroup$
– MaxW
Mar 22 at 19:06




4




4




$begingroup$
Basically if people stated "we found no evidence of an association between X and Y" instead of "X and Y are not related" when finding $p>alpha$ this article wouldn't likely exist.
$endgroup$
– Firebug
Mar 22 at 19:35





$begingroup$
Basically if people stated "we found no evidence of an association between X and Y" instead of "X and Y are not related" when finding $p>alpha$ this article wouldn't likely exist.
$endgroup$
– Firebug
Mar 22 at 19:35











10 Answers
10






active

oldest

votes


















62












$begingroup$

The first three points, as far as I can tell, are a variation on a single argument.



Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



uniform probability distribution



When actually, they are much more likely to look like this:
enter image description here



As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



So to break down the numbered points in the article:



  1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) probability is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.


  2. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.


  3. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


  4. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.


Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.






share|cite|improve this answer











$endgroup$








  • 25




    $begingroup$
    +1 for "... publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications." There was an interesting paper that came out years back that talks about how this "publish or perish" leads to compounding error/bias throughout academia: Why Most Published Research Findings Are False (Ioannidis, 2005)
    $endgroup$
    – J. Taylor
    Mar 21 at 8:03






  • 4




    $begingroup$
    I don't agree with “the real (likely Gaussian) uncertainty...” – Gaussian is another oversimplification. It's somewhat more justified than the hard-limits model thanks to the Central Limit Theorem, but the real distribution is generally something different still.
    $endgroup$
    – leftaroundabout
    Mar 21 at 8:43






  • 1




    $begingroup$
    @leftaroundabout The real distribution is likely different still, but unless the value is physically impossible, the probability is likely still mathematically nonzero.
    $endgroup$
    – gerrit
    Mar 21 at 9:38







  • 3




    $begingroup$
    @leftaroundabout saying that the uncertainty is likely Gaussian is not inherently a simplification. It describes a prior distribution, which is justified by the CLT as the best prior in the absence of other supporting data, but by expressing uncertainty over the distribution the acknowledgement that the distribution could well not be Gaussian is already there.
    $endgroup$
    – Will
    Mar 21 at 10:42






  • 7




    $begingroup$
    @inisfree you are very, very mistaken. Many scientific disciplines (like chemistry and biology, as I stated earlier) use almost zero maths, beside basic arithmetic. There are otherwise brilliant scientists out there who are almost math illiterate, and I've met a few of them.
    $endgroup$
    – Ingolifs
    Mar 22 at 6:58


















19












$begingroup$

I'll try.



  1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

  2. Values near the middle of the confidence (compatibility) interval are more compatible with the data than values near the ends of the interval.

  3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

  4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.





share|cite|improve this answer











$endgroup$




















    17












    $begingroup$

    Much of the article and the figure you include make a very simple point:




    Lack of evidence for an effect is not evidence that it does not exist.




    For example,




    "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




    Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



    But this is the mistake the authors claim scientists routinely make.



    For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



    The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)






    share|cite|improve this answer









    $endgroup$








    • 4




      $begingroup$
      I disagree with the blanket statement that "absence of evidence is not evidence of absence". Power calculations allow you determine the likelihood of deeming an effect of a particular size significant, given a particular sample size. Large effect sizes require less data to deem them significantly different from zero, while small effects require a larger sample size. If your study is properly powered, and you are still not seeing significant effects, then you can reasonably conclude that the effect does not exist. If you have sufficient data, non-significance can indeed indicate no effect.
      $endgroup$
      – Nuclear Wang
      Mar 21 at 13:23






    • 1




      $begingroup$
      @NuclearWang True, but only if the power analysis is done ahead of time and only if it is done with correct assumptions and then correct interpretations (i.e., your power is only relevant to the magnitude of the effect size that you predict; "80% power" does not mean you have 80% probability to correctly detect zero effect). Additionally, in my experience the use of "non-significant" to mean "no effect" is often applied to secondary outcomes or rare events, which the study is (appropriately) not powered for anyways. Finally, beta is typically >> alpha.
      $endgroup$
      – Bryan Krause
      Mar 21 at 15:05






    • 8




      $begingroup$
      @NuclearWang, I don't think anyone is arguing "absence of evidence is NEVER evidence of absence", I think they are arguing it should not be automatically interpreted as such, and that this is the mistake they see people making.
      $endgroup$
      – usul
      Mar 21 at 17:28










    • $begingroup$
      It's almost like people are not trained in tests for equivalence or something.
      $endgroup$
      – Alexis
      2 days ago


















    9












    $begingroup$

    The great XKCD did this cartoon a while ago, illustrating the problem. If results with $Pgt0.05$ are simplistically treated as proving a hypothesis - and all too often they are - then 1 in 20 hypotheses so proven will actually be false. Similarly, if $Plt0.05$ is taken as disproving a hypotheses then 1 in 20 true hypotheses will be wrongly rejected. P-values don't tell you whether a hypothesis is true or false, they tell you whether a hypothesis is probably true or false. It seems the referenced article is kicking back against the all-too-common naïve interpretation.






    share|cite|improve this answer










    New contributor




    digitig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$








    • 7




      $begingroup$
      (-1) P-values don't show you whether a hypothesis is probably true or false. You need a prior distribution for that. See this xkcd, for example. The problematic hand-waving that leads to this confusion is that if we have similar priors for a large number of hypothesis, then the p-value will be proportional to probability it is true or false. But before seeing any data, some hypothesis are much more probable than others!
      $endgroup$
      – Cliff AB
      Mar 21 at 17:54







    • 3




      $begingroup$
      While this effect is something that shouldn't be discounted, it is far from being a significant point of the referenced article.
      $endgroup$
      – R.M.
      Mar 21 at 18:10


















    6












    $begingroup$

    tl;dr- It's fundamentally impossible to prove that things are unrelated; statistics can only be used to show when things are related. Despite this well-established fact, people frequently misinterpret a lack of statistical significance to imply a lack of relationship.




    A good encryption method should generate a ciphertext that, as far as an attacker can tell, doesn't bare any statistical relationship to the protected message. Because if an attacker can determine some sort of relationship, then they can get information about your protected messages by just looking at the ciphertexts – which is a Bad ThingTM.



    However, the ciphertext and its corresponding plaintext 100% determine each other. So even if the world's very best mathematicians can't find any significant relationship no matter how hard they try, we still obviously know that the relationship isn't just there, but that it's completely and fully deterministic. This determinism can exist even when we know that it's impossible to find a relationship.



    Despite this, we still get people who'll do stuff like:



    1. Pick some relationship they want to "disprove".


    2. Do some study on it that's inadequate to detect the alleged relationship.


    3. Report the lack of a statistically significant relationship.


    4. Twist this into a lack of relationship.


    This leads to all sorts of "scientific studies" that the media will (falsely) report as disproving the existence of some relationship.



    If you want to design your own study around this, there're a bunch of ways you can do it:



    1. Lazy research:
      The easiest way, by far, is to just be incredibly lazy about it. It's just like from that figure linked in the question:
      $hspace50px$.
      You can easily get that $`` smallcolordarkredbeginarrayc text'Non-significant' study \[-10px] left(texthigh~P~textvalueright) endarray "$ by simply having small sample sizes, allowing a lot of noise, and other various lazy things. In fact, if you're so lazy as to not collect any data, then you're already done!


    2. Lazy analysis:
      For some silly reason, some people think a Pearson correlation coefficient of $0$ means "no correlation". Which is true, in a very limited sense. But, here're a few cases to observe:
      $hspace50px$.
      This is, there may not be a "linear" relationship, but obviously there can be a more complex one. And it doesn't need to be "encryption"-level complex, but rather "it's actually just a bit of a squiggly line" or "there're two correlations" or whatever.


    3. Lazy answering:
      In the spirit of the above, I'm going to stop here. To, ya know, be lazy!


    But, seriously, the article sums it up well in:




    Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.







    share|cite|improve this answer









    $endgroup$












    • $begingroup$
      +1 cause what you write is both true and thought provoking. However, in my humble opinion, you can prove that two quantities are reasonably uncorrelated under certain assumptions. You have to offcourse first start by e.g. supposing a certain distribution about them, but this can be based on the laws of physics, or statistics (e.g. the speed of molecules of a gas in a container are expected to be gaussian or so on)
      $endgroup$
      – ntg
      Mar 22 at 5:21






    • 2




      $begingroup$
      @ntg Yeah, it's hard to know how to word some of this stuff, so I left a lot out. I mean, the general truth is that we can't disprove that some relationship exists, though we can generally demonstrate that a specific relationship doesn't exist. Sorta like, we can't establish that two data series are unrelated, but we can establish that they don't appear to be reliably related by a simple linear function.
      $endgroup$
      – Nat
      Mar 22 at 8:05






    • 1




      $begingroup$
      -1 "tl;dr- It's fundamentally impossible to prove that things are unrelated": Equivalence tests provide evidence of absence of an effect within an arbitrary effect size.
      $endgroup$
      – Alexis
      2 days ago






    • 2




      $begingroup$
      @Alexis I think you misunderstand equivalence testing; you can use equivalence testing to evidence the absence of a certain relationship holding, e.g. a linear relationship, but not evidence the absence of any relationship.
      $endgroup$
      – Nat
      yesterday






    • 1




      $begingroup$
      @Alexis Statistical inference can provide you as much evidence of the absence of an effect larger than a specific effect size within the context of some model. Perhaps you're assuming that the model will always be known?
      $endgroup$
      – Nat
      yesterday


















    4












    $begingroup$

    For me, the most important part was:




    ...[We] urge authors to
    discuss the point estimate, even when they have a large P value or a
    wide interval, as well as discussing the limits of that interval.




    In other words: Place a higher emphasis on discussing estimates (center and confidence interval), and
    a lower emphasis on "Null-hypothesis testing".



    How does this work in practice? A lot of research boils down to measuring effect sizes, for example
    "We measured a risk ratio of 1.20, with a 95% C.I.
    ranging from 0.97 to 1.33". This is a suitable summary of a study. You can
    immediately see the most probably effect size and the uncertainty of the measurement. Using this
    summary, you can quickly compare this study to other studies like it,
    and ideally you can combine all the findings in a weighted average.



    Unfortunately, such studies are often summarized as "We did not find a statiscally significant
    increase of the risk ratio". This is a valid conclusion of the study above. But it is not a
    suitable summary of the study, because you can't easily compare studies using these kinds of summaries. You don't
    know which study had the most precise measurement, and you can't intuit what the finding of a meta-study might be.
    And you don't immediately spot when studies claim "non-significant risk ratio increase" by having confidence
    intervals that are so large you can hide an elephant in them.






    share|cite|improve this answer











    $endgroup$












    • $begingroup$
      That depends on one's null hypothesis. For example, rejecting $H_0:|theta|ge Delta$ provides evidence of an absence of effect larger than an arbitrarily small $Delta$.
      $endgroup$
      – Alexis
      2 days ago



















    4












    $begingroup$

    For a didactic introduction to the problem, Alex Reinhart wrote a book fully available online and edited at No Starch Press (with more content):
    https://www.statisticsdonewrong.com



    It explains the root of the problem without sophisticated maths and has specific chapters with examples from simulated data set:



    https://www.statisticsdonewrong.com/p-value.html



    https://www.statisticsdonewrong.com/regression.html



    In the second link, a graphical example illustrates the p-value problem. P-value is often used as a single indicator of statistical difference between dataset but is clearly not enough by its own.



    Edit for a more detailed answer:



    In many cases, studies aim to reproduce a precise type of data, either physical measurements (say the number of particles in an accelerator during a specific experiment) or quantitative indicators (like the number of patients developing specific symptoms during drug tests). In either this situation, many factors can interfere with the measurement process like human error or systems variations (people reacting differently to the same drug). This is the reason experiments are often done hundreds times if possible and drug testing is done, ideally, on cohorts of thousands patients.



    The data set is then reduced to its most simple values using statistics: means, standard deviations and so on. The problem in comparing models through their mean is that the measured values are only indicators of the true values, and are also statistically changing depending on the number and precision of the individual measurements. We have ways to give a good guess on which measures are likely to be the same and which are not, but only with a certain certainty. The usual threshold is to say that if we have less than one out of twenty chance to be wrong saying two values are different, we consider them "statistically different" (that's the meaning of $P<0.05$), else we do not conclude.



    This leads to the odd conclusions illustrated in Nature's article where two same measures give the same mean values but researchers conclusions differ due to the size of the sample. This, and other trops from statistical vocabulary and habits is becoming more and more important in the sciences. An other side of the problem is that people tend to forget that they use statistical tools and conclude about effect without proper verification of the statistical power of their samples.



    For an other illustration, recently social and life sciences are going through a true replication crisis due to the fact that a lot of effects were taken for granted by people who didn't check the proper statistical power of famous studies (while other falsified the data but this is another problem).






    share|cite|improve this answer











    $endgroup$












    • $begingroup$
      Oh that looks really helpful for the uninitiated or who's initiation has expired decades ago. Thanks!
      $endgroup$
      – uhoh
      Mar 21 at 13:12






    • 3




      $begingroup$
      While not just a link, this answer has all the salient characteristics of a "link only answer". To improve this answer, please incorporate the key points into the answer itself. Ideally, your answer should be useful as an answer even if the content of the links disappears.
      $endgroup$
      – R.M.
      Mar 21 at 18:22






    • 2




      $begingroup$
      About p-values and the base rate fallacy (mentioned in your link), Veritasium published this video called the bayesian trap.
      $endgroup$
      – jjmontes
      Mar 21 at 20:42







    • 2




      $begingroup$
      Sorry then, I'll try to improve and develop the answer as soon as possible. My idea was also to provide useful material for the curious reader.
      $endgroup$
      – G.Clavier
      Mar 22 at 16:43






    • 1




      $begingroup$
      @G.Clavier and the self-described statistics newbie and curious reader appreciates it!
      $endgroup$
      – uhoh
      Mar 22 at 22:17


















    3












    $begingroup$

    It is "significant" that statisticians, not just scientists, are rising up and objecting to the loose use of "significance" and $P$ values. The most recent issue of The American Statistician is devoted entirely to this matter. See especially the lead editorial by Wasserman, Schirm, and Lazar.






    share|cite|improve this answer









    $endgroup$




















      2












      $begingroup$

      It is a fact that for several reasons, p-values have indeed become a problem.



      However, despite their weaknesses, they have important advantages such as simplicity and intuitive theory. Therefore, while overall I agree with the Comment in Nature, I do think that rather than ditching statistical significance completely, a more balanced solution is needed. Here are a few options:



      1. "Changing the default P-value threshold for
      statistical significance from 0.05 to 0.005 for claims of new
      discoveries".
      In my view, Benjamin et al addressed very well the
      most compelling arguments against adopting a higher standard of
      evidence.



      2. Adopting the second-generation p-values. These seem
      to be a reasonable solution to most of the problems affecting
      classical p-values. As Blume et al say here, second-generation
      p-values could help "improve rigor, reproducibility, & transparency in statistical analyses."



      3. Redefining p-value as "a quantitative measure of certainty — a
      “confidence index” — that an observed relationship, or claim,
      is true."
      This could help change analysis goal from achieving significance to appropriately estimating this confidence.



      Importantly, "results that do not reach the threshold for statistical significance or “confidence” (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods."



      I think that could help mitigate the obsession with p-values by leading journals, which is behind the misuse of p-values.






      share|cite|improve this answer











      $endgroup$




















        0












        $begingroup$

        One thing that has not been mentioned is that error or significance are statistical estimates, not actual physical measurements: They depend heavily on the data you have available and how you process it. You can only provide precise value of error and significance if you have measured every possible event. This is usually not the case, far from it!



        Therefore, every estimate of error or significance, in this case any given P-value, is by definition inaccurate and should not be trusted to describe the underlying research – let alone phenomena! – accurately. In fact, it should not be trusted to convey anything about results WITHOUT knowledge of what is being represented, how the error was estimated and what was done to quality control the data. For example, one way to reduce estimated error is to remove outliers. If this is removal is also done statistically, then how can you actually know the outliers were real errors instead of unlikely real measurements that should be included in the error? How could the reduced error improve the significance of the results? What about erroneous measurements near the estimates? They improve the error and can impact statistical significance but can lead to wrong conclusions!



        For that matter, I do physical modeling and have created models myself where 3-sigma error is completely unphysical. That is, statistically there's around one event in a thousand (well...more often than that, but I digress) that would result in completely ridiculous value. The magnitude of 3 interval error in my field is roughly equivalent of having best possible estimate of 1 cm turning out to be a meter every now and then. However, this is indeed an accepted result when providing statistical +/- interval calculated from physical, empirical data in my field. Sure, narrowness of uncertainty interval is respected, but often the value of best guess estimate is more useful result even when nominal error interval would be larger.



        As a side note, I was once personally responsible for one of those one in a thousand outliers. I was in process of calibrating an instrument when an event happened which we were supposed to measure. Alas, that data point would have been exactly one of those 100 fold outliers, so in a sense, they DO happen and are included in the modeling error!






        share|cite|improve this answer










        New contributor




        Geenimetsuri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$












        • $begingroup$
          "You can only provide accurate measure, if you have measured every possible event." Hmm. So, accuracy is hopeless? And also irrelevant? Please expand on the difference between accuracy and bias. Are the inaccurate estimates biased or unbiased? If they are unbiased, then aren't they a little bit useful? "For example, one way to reduce error is to remove outliers." Hmm. That will reduce sample variance, but "error"? "...often the value of best guess estimate is more useful result even when nominal error interval would be larger" I don't deny that a good prior is better than a bad experiment.
          $endgroup$
          – Peter Leopold
          Mar 21 at 16:27










        • $begingroup$
          Modified the text a bit based on your comment. What I meant was that statistical measure of error is always an estimate unless you have all the possible individual tests, so to speak, available. This rarely happens, except when e.g. polling a set number of people (n.b. not as samples from larger crowd or general population).
          $endgroup$
          – Geenimetsuri
          Mar 21 at 17:49







        • 1




          $begingroup$
          I am a practitioner who uses statistics rather than a statistician. I think a basic problem with p values is that many who are not familiar with what they are confuse them with substantive significance. Thus I have been asked to determine which slopes are important by using p values regardless of whether the slopes are large or not. A similar problem is using them to determine relative impact of variables (which is critical to me, but which gets surprisingly little attention in the regression literature).
          $endgroup$
          – user54285
          Mar 22 at 23:10









        protected by gung Mar 21 at 13:31



        Thank you for your interest in this question.
        Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



        Would you like to answer one of these unanswered questions instead?














        10 Answers
        10






        active

        oldest

        votes








        10 Answers
        10






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        62












        $begingroup$

        The first three points, as far as I can tell, are a variation on a single argument.



        Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



        uniform probability distribution



        When actually, they are much more likely to look like this:
        enter image description here



        As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



        So to break down the numbered points in the article:



        1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) probability is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.


        2. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.


        3. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


        4. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.


        Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



        Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.






        share|cite|improve this answer











        $endgroup$








        • 25




          $begingroup$
          +1 for "... publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications." There was an interesting paper that came out years back that talks about how this "publish or perish" leads to compounding error/bias throughout academia: Why Most Published Research Findings Are False (Ioannidis, 2005)
          $endgroup$
          – J. Taylor
          Mar 21 at 8:03






        • 4




          $begingroup$
          I don't agree with “the real (likely Gaussian) uncertainty...” – Gaussian is another oversimplification. It's somewhat more justified than the hard-limits model thanks to the Central Limit Theorem, but the real distribution is generally something different still.
          $endgroup$
          – leftaroundabout
          Mar 21 at 8:43






        • 1




          $begingroup$
          @leftaroundabout The real distribution is likely different still, but unless the value is physically impossible, the probability is likely still mathematically nonzero.
          $endgroup$
          – gerrit
          Mar 21 at 9:38







        • 3




          $begingroup$
          @leftaroundabout saying that the uncertainty is likely Gaussian is not inherently a simplification. It describes a prior distribution, which is justified by the CLT as the best prior in the absence of other supporting data, but by expressing uncertainty over the distribution the acknowledgement that the distribution could well not be Gaussian is already there.
          $endgroup$
          – Will
          Mar 21 at 10:42






        • 7




          $begingroup$
          @inisfree you are very, very mistaken. Many scientific disciplines (like chemistry and biology, as I stated earlier) use almost zero maths, beside basic arithmetic. There are otherwise brilliant scientists out there who are almost math illiterate, and I've met a few of them.
          $endgroup$
          – Ingolifs
          Mar 22 at 6:58















        62












        $begingroup$

        The first three points, as far as I can tell, are a variation on a single argument.



        Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



        uniform probability distribution



        When actually, they are much more likely to look like this:
        enter image description here



        As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



        So to break down the numbered points in the article:



        1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) probability is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.


        2. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.


        3. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


        4. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.


        Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



        Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.






        share|cite|improve this answer











        $endgroup$








        • 25




          $begingroup$
          +1 for "... publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications." There was an interesting paper that came out years back that talks about how this "publish or perish" leads to compounding error/bias throughout academia: Why Most Published Research Findings Are False (Ioannidis, 2005)
          $endgroup$
          – J. Taylor
          Mar 21 at 8:03






        • 4




          $begingroup$
          I don't agree with “the real (likely Gaussian) uncertainty...” – Gaussian is another oversimplification. It's somewhat more justified than the hard-limits model thanks to the Central Limit Theorem, but the real distribution is generally something different still.
          $endgroup$
          – leftaroundabout
          Mar 21 at 8:43






        • 1




          $begingroup$
          @leftaroundabout The real distribution is likely different still, but unless the value is physically impossible, the probability is likely still mathematically nonzero.
          $endgroup$
          – gerrit
          Mar 21 at 9:38







        • 3




          $begingroup$
          @leftaroundabout saying that the uncertainty is likely Gaussian is not inherently a simplification. It describes a prior distribution, which is justified by the CLT as the best prior in the absence of other supporting data, but by expressing uncertainty over the distribution the acknowledgement that the distribution could well not be Gaussian is already there.
          $endgroup$
          – Will
          Mar 21 at 10:42






        • 7




          $begingroup$
          @inisfree you are very, very mistaken. Many scientific disciplines (like chemistry and biology, as I stated earlier) use almost zero maths, beside basic arithmetic. There are otherwise brilliant scientists out there who are almost math illiterate, and I've met a few of them.
          $endgroup$
          – Ingolifs
          Mar 22 at 6:58













        62












        62








        62





        $begingroup$

        The first three points, as far as I can tell, are a variation on a single argument.



        Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



        uniform probability distribution



        When actually, they are much more likely to look like this:
        enter image description here



        As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



        So to break down the numbered points in the article:



        1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) probability is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.


        2. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.


        3. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


        4. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.


        Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



        Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.






        share|cite|improve this answer











        $endgroup$



        The first three points, as far as I can tell, are a variation on a single argument.



        Scientists often treat uncertainty measurements ($12 pm 1 $, for instance) as probability distributions that look like this:



        uniform probability distribution



        When actually, they are much more likely to look like this:
        enter image description here



        As a former chemist, I can confirm that many scientists with non-mathematical backgrounds (primarily non-physical chemists and biologists) don't really understand how uncertainty (or error, as they call it) is supposed to work. They recall a time in undergrad physics where they maybe had to use them, possibly even having to calculate a compound error through several different measurements, but they never really understood them. I too was guilty of this, and assumed all measurements had to come within the $pm$ interval. Only recently (and outside academia), did I find out that error measurements usually refer to a certain standard deviation, not an absolute limit.



        So to break down the numbered points in the article:



        1. Measurements outside the CI still have a chance of happening, because the real (likely gaussian) probability is non-zero there (or anywhere for that matter, although they become vanishingly small when you get far out). If the values after the $pm$ do indeed represent one s.d., then there is still a 32% chance of a data point falling outside of them.


        2. The distribution is not uniform (flat topped, as in the first graph), it is peaked. You are more likely to get a value in the middle than you are at the edges. It's like rolling a bunch of dice, rather than a single die.


        3. 95% is an arbitrary cutoff, and coincides almost exactly with two standard deviations.


        4. This point is more of a comment on academic honesty in general. A realisation I had during my PhD is that science isn't some abstract force, it is the cumulative efforts of people attempting to do science. These are people who are trying to discover new things about the universe, but at the same time are also trying to keep their kids fed and keep their jobs, which unfortunately in modern times means some form of publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications.


        Arbitrary thresholds such as $p < 0.05$ can often be self-perpetuating, especially among those who don't fully understand statistics and just need a pass/fail stamp on their results. As such, people do sometimes half-jokingly talk about 'running the test again until you get $p < 0.05$'. It can be very tempting, especially if a Ph.D/grant/employment is riding on the outcome, for these marginal results to be, jiggled around until the desired $p = 0.0498$ shows up in the analysis.



        Such practices can be detrimental to the science as a whole, especially if it is done widely, all in the pursuit of a number which is in the eyes of nature, meaningless. This part in effect is exhorting scientists to be honest about their data and work, even when that honesty is to their detriment.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited Mar 21 at 9:15

























        answered Mar 21 at 3:06









        IngolifsIngolifs

        963322




        963322







        • 25




          $begingroup$
          +1 for "... publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications." There was an interesting paper that came out years back that talks about how this "publish or perish" leads to compounding error/bias throughout academia: Why Most Published Research Findings Are False (Ioannidis, 2005)
          $endgroup$
          – J. Taylor
          Mar 21 at 8:03






        • 4




          $begingroup$
          I don't agree with “the real (likely Gaussian) uncertainty...” – Gaussian is another oversimplification. It's somewhat more justified than the hard-limits model thanks to the Central Limit Theorem, but the real distribution is generally something different still.
          $endgroup$
          – leftaroundabout
          Mar 21 at 8:43






        • 1




          $begingroup$
          @leftaroundabout The real distribution is likely different still, but unless the value is physically impossible, the probability is likely still mathematically nonzero.
          $endgroup$
          – gerrit
          Mar 21 at 9:38







        • 3




          $begingroup$
          @leftaroundabout saying that the uncertainty is likely Gaussian is not inherently a simplification. It describes a prior distribution, which is justified by the CLT as the best prior in the absence of other supporting data, but by expressing uncertainty over the distribution the acknowledgement that the distribution could well not be Gaussian is already there.
          $endgroup$
          – Will
          Mar 21 at 10:42






        • 7




          $begingroup$
          @inisfree you are very, very mistaken. Many scientific disciplines (like chemistry and biology, as I stated earlier) use almost zero maths, beside basic arithmetic. There are otherwise brilliant scientists out there who are almost math illiterate, and I've met a few of them.
          $endgroup$
          – Ingolifs
          Mar 22 at 6:58












        • 25




          $begingroup$
          +1 for "... publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications." There was an interesting paper that came out years back that talks about how this "publish or perish" leads to compounding error/bias throughout academia: Why Most Published Research Findings Are False (Ioannidis, 2005)
          $endgroup$
          – J. Taylor
          Mar 21 at 8:03






        • 4




          $begingroup$
          I don't agree with “the real (likely Gaussian) uncertainty...” – Gaussian is another oversimplification. It's somewhat more justified than the hard-limits model thanks to the Central Limit Theorem, but the real distribution is generally something different still.
          $endgroup$
          – leftaroundabout
          Mar 21 at 8:43






        • 1




          $begingroup$
          @leftaroundabout The real distribution is likely different still, but unless the value is physically impossible, the probability is likely still mathematically nonzero.
          $endgroup$
          – gerrit
          Mar 21 at 9:38







        • 3




          $begingroup$
          @leftaroundabout saying that the uncertainty is likely Gaussian is not inherently a simplification. It describes a prior distribution, which is justified by the CLT as the best prior in the absence of other supporting data, but by expressing uncertainty over the distribution the acknowledgement that the distribution could well not be Gaussian is already there.
          $endgroup$
          – Will
          Mar 21 at 10:42






        • 7




          $begingroup$
          @inisfree you are very, very mistaken. Many scientific disciplines (like chemistry and biology, as I stated earlier) use almost zero maths, beside basic arithmetic. There are otherwise brilliant scientists out there who are almost math illiterate, and I've met a few of them.
          $endgroup$
          – Ingolifs
          Mar 22 at 6:58







        25




        25




        $begingroup$
        +1 for "... publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications." There was an interesting paper that came out years back that talks about how this "publish or perish" leads to compounding error/bias throughout academia: Why Most Published Research Findings Are False (Ioannidis, 2005)
        $endgroup$
        – J. Taylor
        Mar 21 at 8:03




        $begingroup$
        +1 for "... publish or perish is at play. In reality, scientists depend on discoveries that are both true and interesting, because uninteresting results don't result in publications." There was an interesting paper that came out years back that talks about how this "publish or perish" leads to compounding error/bias throughout academia: Why Most Published Research Findings Are False (Ioannidis, 2005)
        $endgroup$
        – J. Taylor
        Mar 21 at 8:03




        4




        4




        $begingroup$
        I don't agree with “the real (likely Gaussian) uncertainty...” – Gaussian is another oversimplification. It's somewhat more justified than the hard-limits model thanks to the Central Limit Theorem, but the real distribution is generally something different still.
        $endgroup$
        – leftaroundabout
        Mar 21 at 8:43




        $begingroup$
        I don't agree with “the real (likely Gaussian) uncertainty...” – Gaussian is another oversimplification. It's somewhat more justified than the hard-limits model thanks to the Central Limit Theorem, but the real distribution is generally something different still.
        $endgroup$
        – leftaroundabout
        Mar 21 at 8:43




        1




        1




        $begingroup$
        @leftaroundabout The real distribution is likely different still, but unless the value is physically impossible, the probability is likely still mathematically nonzero.
        $endgroup$
        – gerrit
        Mar 21 at 9:38





        $begingroup$
        @leftaroundabout The real distribution is likely different still, but unless the value is physically impossible, the probability is likely still mathematically nonzero.
        $endgroup$
        – gerrit
        Mar 21 at 9:38





        3




        3




        $begingroup$
        @leftaroundabout saying that the uncertainty is likely Gaussian is not inherently a simplification. It describes a prior distribution, which is justified by the CLT as the best prior in the absence of other supporting data, but by expressing uncertainty over the distribution the acknowledgement that the distribution could well not be Gaussian is already there.
        $endgroup$
        – Will
        Mar 21 at 10:42




        $begingroup$
        @leftaroundabout saying that the uncertainty is likely Gaussian is not inherently a simplification. It describes a prior distribution, which is justified by the CLT as the best prior in the absence of other supporting data, but by expressing uncertainty over the distribution the acknowledgement that the distribution could well not be Gaussian is already there.
        $endgroup$
        – Will
        Mar 21 at 10:42




        7




        7




        $begingroup$
        @inisfree you are very, very mistaken. Many scientific disciplines (like chemistry and biology, as I stated earlier) use almost zero maths, beside basic arithmetic. There are otherwise brilliant scientists out there who are almost math illiterate, and I've met a few of them.
        $endgroup$
        – Ingolifs
        Mar 22 at 6:58




        $begingroup$
        @inisfree you are very, very mistaken. Many scientific disciplines (like chemistry and biology, as I stated earlier) use almost zero maths, beside basic arithmetic. There are otherwise brilliant scientists out there who are almost math illiterate, and I've met a few of them.
        $endgroup$
        – Ingolifs
        Mar 22 at 6:58













        19












        $begingroup$

        I'll try.



        1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

        2. Values near the middle of the confidence (compatibility) interval are more compatible with the data than values near the ends of the interval.

        3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

        4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.





        share|cite|improve this answer











        $endgroup$

















          19












          $begingroup$

          I'll try.



          1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

          2. Values near the middle of the confidence (compatibility) interval are more compatible with the data than values near the ends of the interval.

          3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

          4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.





          share|cite|improve this answer











          $endgroup$















            19












            19








            19





            $begingroup$

            I'll try.



            1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

            2. Values near the middle of the confidence (compatibility) interval are more compatible with the data than values near the ends of the interval.

            3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

            4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.





            share|cite|improve this answer











            $endgroup$



            I'll try.



            1. The confidence interval (which they rename compatibility interval) shows the values of the parameter that are most compatible with the data. But that doesn't mean the values outside the interval are absolutely incompatible with the data.

            2. Values near the middle of the confidence (compatibility) interval are more compatible with the data than values near the ends of the interval.

            3. 95% is just a convention. You can compute 90% or 99% or any% intervals.

            4. The confidence/compatibility intervals are only helpful if the experiment was done properly, if the analysis was done according to a preset plan, and the data conform with the assumption of the analysis methods. If you've got bad data analyzed badly, the compatibility interval is not meaningful or helpful.






            share|cite|improve this answer














            share|cite|improve this answer



            share|cite|improve this answer








            edited 2 days ago









            ttnphns

            39.3k15145331




            39.3k15145331










            answered Mar 21 at 1:32









            Harvey MotulskyHarvey Motulsky

            11k44486




            11k44486





















                17












                $begingroup$

                Much of the article and the figure you include make a very simple point:




                Lack of evidence for an effect is not evidence that it does not exist.




                For example,




                "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




                Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



                But this is the mistake the authors claim scientists routinely make.



                For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



                The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)






                share|cite|improve this answer









                $endgroup$








                • 4




                  $begingroup$
                  I disagree with the blanket statement that "absence of evidence is not evidence of absence". Power calculations allow you determine the likelihood of deeming an effect of a particular size significant, given a particular sample size. Large effect sizes require less data to deem them significantly different from zero, while small effects require a larger sample size. If your study is properly powered, and you are still not seeing significant effects, then you can reasonably conclude that the effect does not exist. If you have sufficient data, non-significance can indeed indicate no effect.
                  $endgroup$
                  – Nuclear Wang
                  Mar 21 at 13:23






                • 1




                  $begingroup$
                  @NuclearWang True, but only if the power analysis is done ahead of time and only if it is done with correct assumptions and then correct interpretations (i.e., your power is only relevant to the magnitude of the effect size that you predict; "80% power" does not mean you have 80% probability to correctly detect zero effect). Additionally, in my experience the use of "non-significant" to mean "no effect" is often applied to secondary outcomes or rare events, which the study is (appropriately) not powered for anyways. Finally, beta is typically >> alpha.
                  $endgroup$
                  – Bryan Krause
                  Mar 21 at 15:05






                • 8




                  $begingroup$
                  @NuclearWang, I don't think anyone is arguing "absence of evidence is NEVER evidence of absence", I think they are arguing it should not be automatically interpreted as such, and that this is the mistake they see people making.
                  $endgroup$
                  – usul
                  Mar 21 at 17:28










                • $begingroup$
                  It's almost like people are not trained in tests for equivalence or something.
                  $endgroup$
                  – Alexis
                  2 days ago















                17












                $begingroup$

                Much of the article and the figure you include make a very simple point:




                Lack of evidence for an effect is not evidence that it does not exist.




                For example,




                "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




                Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



                But this is the mistake the authors claim scientists routinely make.



                For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



                The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)






                share|cite|improve this answer









                $endgroup$








                • 4




                  $begingroup$
                  I disagree with the blanket statement that "absence of evidence is not evidence of absence". Power calculations allow you determine the likelihood of deeming an effect of a particular size significant, given a particular sample size. Large effect sizes require less data to deem them significantly different from zero, while small effects require a larger sample size. If your study is properly powered, and you are still not seeing significant effects, then you can reasonably conclude that the effect does not exist. If you have sufficient data, non-significance can indeed indicate no effect.
                  $endgroup$
                  – Nuclear Wang
                  Mar 21 at 13:23






                • 1




                  $begingroup$
                  @NuclearWang True, but only if the power analysis is done ahead of time and only if it is done with correct assumptions and then correct interpretations (i.e., your power is only relevant to the magnitude of the effect size that you predict; "80% power" does not mean you have 80% probability to correctly detect zero effect). Additionally, in my experience the use of "non-significant" to mean "no effect" is often applied to secondary outcomes or rare events, which the study is (appropriately) not powered for anyways. Finally, beta is typically >> alpha.
                  $endgroup$
                  – Bryan Krause
                  Mar 21 at 15:05






                • 8




                  $begingroup$
                  @NuclearWang, I don't think anyone is arguing "absence of evidence is NEVER evidence of absence", I think they are arguing it should not be automatically interpreted as such, and that this is the mistake they see people making.
                  $endgroup$
                  – usul
                  Mar 21 at 17:28










                • $begingroup$
                  It's almost like people are not trained in tests for equivalence or something.
                  $endgroup$
                  – Alexis
                  2 days ago













                17












                17








                17





                $begingroup$

                Much of the article and the figure you include make a very simple point:




                Lack of evidence for an effect is not evidence that it does not exist.




                For example,




                "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




                Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



                But this is the mistake the authors claim scientists routinely make.



                For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



                The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)






                share|cite|improve this answer









                $endgroup$



                Much of the article and the figure you include make a very simple point:




                Lack of evidence for an effect is not evidence that it does not exist.




                For example,




                "In our study, mice given cyanide did not die at statistically-significantly higher rates" is not evidence for the claim "cyanide has no effect on mouse deaths".




                Suppose we give two mice a dose of cyanide and one of them dies. In the control group of two mice, neither dies. Since the sample size was so small, this result is not statistically significant ($p > 0.05$). So this experiment does not show a statistically significant effect of cyanide on mouse lifespan. Should we conclude that cyanide has no effect on mice? Obviously not.



                But this is the mistake the authors claim scientists routinely make.



                For example in your figure, the red line could arise from a study on very few mice, while the blue line could arise from the exact same study, but on many mice.



                The authors suggest that, instead of using effect sizes and p-values, scientists instead describe the range of possibilities that are more or less compatible with their findings. In our two-mouse experiment, we would have to write that our findings are both compatible with cyanide being very poisonous, and with it not being poisonous at all. In a 100-mouse experiment, we might find a confidence interval range of $[60%,70%]$ fatality with a point estimate of $65%$. Then we should write that our results would be most compatible with an assumption that this dose kills 65% of mice, but our results would also be somewhat compatible with percentages as low as 60 or high as 70, and that our results would be less compatible with a truth outside that range. (We should also describe what statistical assumptions we make to compute these numbers.)







                share|cite|improve this answer












                share|cite|improve this answer



                share|cite|improve this answer










                answered Mar 21 at 4:32









                usulusul

                32713




                32713







                • 4




                  $begingroup$
                  I disagree with the blanket statement that "absence of evidence is not evidence of absence". Power calculations allow you determine the likelihood of deeming an effect of a particular size significant, given a particular sample size. Large effect sizes require less data to deem them significantly different from zero, while small effects require a larger sample size. If your study is properly powered, and you are still not seeing significant effects, then you can reasonably conclude that the effect does not exist. If you have sufficient data, non-significance can indeed indicate no effect.
                  $endgroup$
                  – Nuclear Wang
                  Mar 21 at 13:23






                • 1




                  $begingroup$
                  @NuclearWang True, but only if the power analysis is done ahead of time and only if it is done with correct assumptions and then correct interpretations (i.e., your power is only relevant to the magnitude of the effect size that you predict; "80% power" does not mean you have 80% probability to correctly detect zero effect). Additionally, in my experience the use of "non-significant" to mean "no effect" is often applied to secondary outcomes or rare events, which the study is (appropriately) not powered for anyways. Finally, beta is typically >> alpha.
                  $endgroup$
                  – Bryan Krause
                  Mar 21 at 15:05






                • 8




                  $begingroup$
                  @NuclearWang, I don't think anyone is arguing "absence of evidence is NEVER evidence of absence", I think they are arguing it should not be automatically interpreted as such, and that this is the mistake they see people making.
                  $endgroup$
                  – usul
                  Mar 21 at 17:28










                • $begingroup$
                  It's almost like people are not trained in tests for equivalence or something.
                  $endgroup$
                  – Alexis
                  2 days ago












                • 4




                  $begingroup$
                  I disagree with the blanket statement that "absence of evidence is not evidence of absence". Power calculations allow you determine the likelihood of deeming an effect of a particular size significant, given a particular sample size. Large effect sizes require less data to deem them significantly different from zero, while small effects require a larger sample size. If your study is properly powered, and you are still not seeing significant effects, then you can reasonably conclude that the effect does not exist. If you have sufficient data, non-significance can indeed indicate no effect.
                  $endgroup$
                  – Nuclear Wang
                  Mar 21 at 13:23






                • 1




                  $begingroup$
                  @NuclearWang True, but only if the power analysis is done ahead of time and only if it is done with correct assumptions and then correct interpretations (i.e., your power is only relevant to the magnitude of the effect size that you predict; "80% power" does not mean you have 80% probability to correctly detect zero effect). Additionally, in my experience the use of "non-significant" to mean "no effect" is often applied to secondary outcomes or rare events, which the study is (appropriately) not powered for anyways. Finally, beta is typically >> alpha.
                  $endgroup$
                  – Bryan Krause
                  Mar 21 at 15:05






                • 8




                  $begingroup$
                  @NuclearWang, I don't think anyone is arguing "absence of evidence is NEVER evidence of absence", I think they are arguing it should not be automatically interpreted as such, and that this is the mistake they see people making.
                  $endgroup$
                  – usul
                  Mar 21 at 17:28










                • $begingroup$
                  It's almost like people are not trained in tests for equivalence or something.
                  $endgroup$
                  – Alexis
                  2 days ago







                4




                4




                $begingroup$
                I disagree with the blanket statement that "absence of evidence is not evidence of absence". Power calculations allow you determine the likelihood of deeming an effect of a particular size significant, given a particular sample size. Large effect sizes require less data to deem them significantly different from zero, while small effects require a larger sample size. If your study is properly powered, and you are still not seeing significant effects, then you can reasonably conclude that the effect does not exist. If you have sufficient data, non-significance can indeed indicate no effect.
                $endgroup$
                – Nuclear Wang
                Mar 21 at 13:23




                $begingroup$
                I disagree with the blanket statement that "absence of evidence is not evidence of absence". Power calculations allow you determine the likelihood of deeming an effect of a particular size significant, given a particular sample size. Large effect sizes require less data to deem them significantly different from zero, while small effects require a larger sample size. If your study is properly powered, and you are still not seeing significant effects, then you can reasonably conclude that the effect does not exist. If you have sufficient data, non-significance can indeed indicate no effect.
                $endgroup$
                – Nuclear Wang
                Mar 21 at 13:23




                1




                1




                $begingroup$
                @NuclearWang True, but only if the power analysis is done ahead of time and only if it is done with correct assumptions and then correct interpretations (i.e., your power is only relevant to the magnitude of the effect size that you predict; "80% power" does not mean you have 80% probability to correctly detect zero effect). Additionally, in my experience the use of "non-significant" to mean "no effect" is often applied to secondary outcomes or rare events, which the study is (appropriately) not powered for anyways. Finally, beta is typically >> alpha.
                $endgroup$
                – Bryan Krause
                Mar 21 at 15:05




                $begingroup$
                @NuclearWang True, but only if the power analysis is done ahead of time and only if it is done with correct assumptions and then correct interpretations (i.e., your power is only relevant to the magnitude of the effect size that you predict; "80% power" does not mean you have 80% probability to correctly detect zero effect). Additionally, in my experience the use of "non-significant" to mean "no effect" is often applied to secondary outcomes or rare events, which the study is (appropriately) not powered for anyways. Finally, beta is typically >> alpha.
                $endgroup$
                – Bryan Krause
                Mar 21 at 15:05




                8




                8




                $begingroup$
                @NuclearWang, I don't think anyone is arguing "absence of evidence is NEVER evidence of absence", I think they are arguing it should not be automatically interpreted as such, and that this is the mistake they see people making.
                $endgroup$
                – usul
                Mar 21 at 17:28




                $begingroup$
                @NuclearWang, I don't think anyone is arguing "absence of evidence is NEVER evidence of absence", I think they are arguing it should not be automatically interpreted as such, and that this is the mistake they see people making.
                $endgroup$
                – usul
                Mar 21 at 17:28












                $begingroup$
                It's almost like people are not trained in tests for equivalence or something.
                $endgroup$
                – Alexis
                2 days ago




                $begingroup$
                It's almost like people are not trained in tests for equivalence or something.
                $endgroup$
                – Alexis
                2 days ago











                9












                $begingroup$

                The great XKCD did this cartoon a while ago, illustrating the problem. If results with $Pgt0.05$ are simplistically treated as proving a hypothesis - and all too often they are - then 1 in 20 hypotheses so proven will actually be false. Similarly, if $Plt0.05$ is taken as disproving a hypotheses then 1 in 20 true hypotheses will be wrongly rejected. P-values don't tell you whether a hypothesis is true or false, they tell you whether a hypothesis is probably true or false. It seems the referenced article is kicking back against the all-too-common naïve interpretation.






                share|cite|improve this answer










                New contributor




                digitig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$








                • 7




                  $begingroup$
                  (-1) P-values don't show you whether a hypothesis is probably true or false. You need a prior distribution for that. See this xkcd, for example. The problematic hand-waving that leads to this confusion is that if we have similar priors for a large number of hypothesis, then the p-value will be proportional to probability it is true or false. But before seeing any data, some hypothesis are much more probable than others!
                  $endgroup$
                  – Cliff AB
                  Mar 21 at 17:54







                • 3




                  $begingroup$
                  While this effect is something that shouldn't be discounted, it is far from being a significant point of the referenced article.
                  $endgroup$
                  – R.M.
                  Mar 21 at 18:10















                9












                $begingroup$

                The great XKCD did this cartoon a while ago, illustrating the problem. If results with $Pgt0.05$ are simplistically treated as proving a hypothesis - and all too often they are - then 1 in 20 hypotheses so proven will actually be false. Similarly, if $Plt0.05$ is taken as disproving a hypotheses then 1 in 20 true hypotheses will be wrongly rejected. P-values don't tell you whether a hypothesis is true or false, they tell you whether a hypothesis is probably true or false. It seems the referenced article is kicking back against the all-too-common naïve interpretation.






                share|cite|improve this answer










                New contributor




                digitig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$








                • 7




                  $begingroup$
                  (-1) P-values don't show you whether a hypothesis is probably true or false. You need a prior distribution for that. See this xkcd, for example. The problematic hand-waving that leads to this confusion is that if we have similar priors for a large number of hypothesis, then the p-value will be proportional to probability it is true or false. But before seeing any data, some hypothesis are much more probable than others!
                  $endgroup$
                  – Cliff AB
                  Mar 21 at 17:54







                • 3




                  $begingroup$
                  While this effect is something that shouldn't be discounted, it is far from being a significant point of the referenced article.
                  $endgroup$
                  – R.M.
                  Mar 21 at 18:10













                9












                9








                9





                $begingroup$

                The great XKCD did this cartoon a while ago, illustrating the problem. If results with $Pgt0.05$ are simplistically treated as proving a hypothesis - and all too often they are - then 1 in 20 hypotheses so proven will actually be false. Similarly, if $Plt0.05$ is taken as disproving a hypotheses then 1 in 20 true hypotheses will be wrongly rejected. P-values don't tell you whether a hypothesis is true or false, they tell you whether a hypothesis is probably true or false. It seems the referenced article is kicking back against the all-too-common naïve interpretation.






                share|cite|improve this answer










                New contributor




                digitig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                The great XKCD did this cartoon a while ago, illustrating the problem. If results with $Pgt0.05$ are simplistically treated as proving a hypothesis - and all too often they are - then 1 in 20 hypotheses so proven will actually be false. Similarly, if $Plt0.05$ is taken as disproving a hypotheses then 1 in 20 true hypotheses will be wrongly rejected. P-values don't tell you whether a hypothesis is true or false, they tell you whether a hypothesis is probably true or false. It seems the referenced article is kicking back against the all-too-common naïve interpretation.







                share|cite|improve this answer










                New contributor




                digitig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|cite|improve this answer



                share|cite|improve this answer








                edited Mar 21 at 14:17









                whuber

                206k33452819




                206k33452819






                New contributor




                digitig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered Mar 21 at 8:34









                digitigdigitig

                2231




                2231




                New contributor




                digitig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                digitig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                digitig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.







                • 7




                  $begingroup$
                  (-1) P-values don't show you whether a hypothesis is probably true or false. You need a prior distribution for that. See this xkcd, for example. The problematic hand-waving that leads to this confusion is that if we have similar priors for a large number of hypothesis, then the p-value will be proportional to probability it is true or false. But before seeing any data, some hypothesis are much more probable than others!
                  $endgroup$
                  – Cliff AB
                  Mar 21 at 17:54







                • 3




                  $begingroup$
                  While this effect is something that shouldn't be discounted, it is far from being a significant point of the referenced article.
                  $endgroup$
                  – R.M.
                  Mar 21 at 18:10












                • 7




                  $begingroup$
                  (-1) P-values don't show you whether a hypothesis is probably true or false. You need a prior distribution for that. See this xkcd, for example. The problematic hand-waving that leads to this confusion is that if we have similar priors for a large number of hypothesis, then the p-value will be proportional to probability it is true or false. But before seeing any data, some hypothesis are much more probable than others!
                  $endgroup$
                  – Cliff AB
                  Mar 21 at 17:54







                • 3




                  $begingroup$
                  While this effect is something that shouldn't be discounted, it is far from being a significant point of the referenced article.
                  $endgroup$
                  – R.M.
                  Mar 21 at 18:10







                7




                7




                $begingroup$
                (-1) P-values don't show you whether a hypothesis is probably true or false. You need a prior distribution for that. See this xkcd, for example. The problematic hand-waving that leads to this confusion is that if we have similar priors for a large number of hypothesis, then the p-value will be proportional to probability it is true or false. But before seeing any data, some hypothesis are much more probable than others!
                $endgroup$
                – Cliff AB
                Mar 21 at 17:54





                $begingroup$
                (-1) P-values don't show you whether a hypothesis is probably true or false. You need a prior distribution for that. See this xkcd, for example. The problematic hand-waving that leads to this confusion is that if we have similar priors for a large number of hypothesis, then the p-value will be proportional to probability it is true or false. But before seeing any data, some hypothesis are much more probable than others!
                $endgroup$
                – Cliff AB
                Mar 21 at 17:54





                3




                3




                $begingroup$
                While this effect is something that shouldn't be discounted, it is far from being a significant point of the referenced article.
                $endgroup$
                – R.M.
                Mar 21 at 18:10




                $begingroup$
                While this effect is something that shouldn't be discounted, it is far from being a significant point of the referenced article.
                $endgroup$
                – R.M.
                Mar 21 at 18:10











                6












                $begingroup$

                tl;dr- It's fundamentally impossible to prove that things are unrelated; statistics can only be used to show when things are related. Despite this well-established fact, people frequently misinterpret a lack of statistical significance to imply a lack of relationship.




                A good encryption method should generate a ciphertext that, as far as an attacker can tell, doesn't bare any statistical relationship to the protected message. Because if an attacker can determine some sort of relationship, then they can get information about your protected messages by just looking at the ciphertexts – which is a Bad ThingTM.



                However, the ciphertext and its corresponding plaintext 100% determine each other. So even if the world's very best mathematicians can't find any significant relationship no matter how hard they try, we still obviously know that the relationship isn't just there, but that it's completely and fully deterministic. This determinism can exist even when we know that it's impossible to find a relationship.



                Despite this, we still get people who'll do stuff like:



                1. Pick some relationship they want to "disprove".


                2. Do some study on it that's inadequate to detect the alleged relationship.


                3. Report the lack of a statistically significant relationship.


                4. Twist this into a lack of relationship.


                This leads to all sorts of "scientific studies" that the media will (falsely) report as disproving the existence of some relationship.



                If you want to design your own study around this, there're a bunch of ways you can do it:



                1. Lazy research:
                  The easiest way, by far, is to just be incredibly lazy about it. It's just like from that figure linked in the question:
                  $hspace50px$.
                  You can easily get that $`` smallcolordarkredbeginarrayc text'Non-significant' study \[-10px] left(texthigh~P~textvalueright) endarray "$ by simply having small sample sizes, allowing a lot of noise, and other various lazy things. In fact, if you're so lazy as to not collect any data, then you're already done!


                2. Lazy analysis:
                  For some silly reason, some people think a Pearson correlation coefficient of $0$ means "no correlation". Which is true, in a very limited sense. But, here're a few cases to observe:
                  $hspace50px$.
                  This is, there may not be a "linear" relationship, but obviously there can be a more complex one. And it doesn't need to be "encryption"-level complex, but rather "it's actually just a bit of a squiggly line" or "there're two correlations" or whatever.


                3. Lazy answering:
                  In the spirit of the above, I'm going to stop here. To, ya know, be lazy!


                But, seriously, the article sums it up well in:




                Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.







                share|cite|improve this answer









                $endgroup$












                • $begingroup$
                  +1 cause what you write is both true and thought provoking. However, in my humble opinion, you can prove that two quantities are reasonably uncorrelated under certain assumptions. You have to offcourse first start by e.g. supposing a certain distribution about them, but this can be based on the laws of physics, or statistics (e.g. the speed of molecules of a gas in a container are expected to be gaussian or so on)
                  $endgroup$
                  – ntg
                  Mar 22 at 5:21






                • 2




                  $begingroup$
                  @ntg Yeah, it's hard to know how to word some of this stuff, so I left a lot out. I mean, the general truth is that we can't disprove that some relationship exists, though we can generally demonstrate that a specific relationship doesn't exist. Sorta like, we can't establish that two data series are unrelated, but we can establish that they don't appear to be reliably related by a simple linear function.
                  $endgroup$
                  – Nat
                  Mar 22 at 8:05






                • 1




                  $begingroup$
                  -1 "tl;dr- It's fundamentally impossible to prove that things are unrelated": Equivalence tests provide evidence of absence of an effect within an arbitrary effect size.
                  $endgroup$
                  – Alexis
                  2 days ago






                • 2




                  $begingroup$
                  @Alexis I think you misunderstand equivalence testing; you can use equivalence testing to evidence the absence of a certain relationship holding, e.g. a linear relationship, but not evidence the absence of any relationship.
                  $endgroup$
                  – Nat
                  yesterday






                • 1




                  $begingroup$
                  @Alexis Statistical inference can provide you as much evidence of the absence of an effect larger than a specific effect size within the context of some model. Perhaps you're assuming that the model will always be known?
                  $endgroup$
                  – Nat
                  yesterday















                6












                $begingroup$

                tl;dr- It's fundamentally impossible to prove that things are unrelated; statistics can only be used to show when things are related. Despite this well-established fact, people frequently misinterpret a lack of statistical significance to imply a lack of relationship.




                A good encryption method should generate a ciphertext that, as far as an attacker can tell, doesn't bare any statistical relationship to the protected message. Because if an attacker can determine some sort of relationship, then they can get information about your protected messages by just looking at the ciphertexts – which is a Bad ThingTM.



                However, the ciphertext and its corresponding plaintext 100% determine each other. So even if the world's very best mathematicians can't find any significant relationship no matter how hard they try, we still obviously know that the relationship isn't just there, but that it's completely and fully deterministic. This determinism can exist even when we know that it's impossible to find a relationship.



                Despite this, we still get people who'll do stuff like:



                1. Pick some relationship they want to "disprove".


                2. Do some study on it that's inadequate to detect the alleged relationship.


                3. Report the lack of a statistically significant relationship.


                4. Twist this into a lack of relationship.


                This leads to all sorts of "scientific studies" that the media will (falsely) report as disproving the existence of some relationship.



                If you want to design your own study around this, there're a bunch of ways you can do it:



                1. Lazy research:
                  The easiest way, by far, is to just be incredibly lazy about it. It's just like from that figure linked in the question:
                  $hspace50px$.
                  You can easily get that $`` smallcolordarkredbeginarrayc text'Non-significant' study \[-10px] left(texthigh~P~textvalueright) endarray "$ by simply having small sample sizes, allowing a lot of noise, and other various lazy things. In fact, if you're so lazy as to not collect any data, then you're already done!


                2. Lazy analysis:
                  For some silly reason, some people think a Pearson correlation coefficient of $0$ means "no correlation". Which is true, in a very limited sense. But, here're a few cases to observe:
                  $hspace50px$.
                  This is, there may not be a "linear" relationship, but obviously there can be a more complex one. And it doesn't need to be "encryption"-level complex, but rather "it's actually just a bit of a squiggly line" or "there're two correlations" or whatever.


                3. Lazy answering:
                  In the spirit of the above, I'm going to stop here. To, ya know, be lazy!


                But, seriously, the article sums it up well in:




                Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.







                share|cite|improve this answer









                $endgroup$












                • $begingroup$
                  +1 cause what you write is both true and thought provoking. However, in my humble opinion, you can prove that two quantities are reasonably uncorrelated under certain assumptions. You have to offcourse first start by e.g. supposing a certain distribution about them, but this can be based on the laws of physics, or statistics (e.g. the speed of molecules of a gas in a container are expected to be gaussian or so on)
                  $endgroup$
                  – ntg
                  Mar 22 at 5:21






                • 2




                  $begingroup$
                  @ntg Yeah, it's hard to know how to word some of this stuff, so I left a lot out. I mean, the general truth is that we can't disprove that some relationship exists, though we can generally demonstrate that a specific relationship doesn't exist. Sorta like, we can't establish that two data series are unrelated, but we can establish that they don't appear to be reliably related by a simple linear function.
                  $endgroup$
                  – Nat
                  Mar 22 at 8:05






                • 1




                  $begingroup$
                  -1 "tl;dr- It's fundamentally impossible to prove that things are unrelated": Equivalence tests provide evidence of absence of an effect within an arbitrary effect size.
                  $endgroup$
                  – Alexis
                  2 days ago






                • 2




                  $begingroup$
                  @Alexis I think you misunderstand equivalence testing; you can use equivalence testing to evidence the absence of a certain relationship holding, e.g. a linear relationship, but not evidence the absence of any relationship.
                  $endgroup$
                  – Nat
                  yesterday






                • 1




                  $begingroup$
                  @Alexis Statistical inference can provide you as much evidence of the absence of an effect larger than a specific effect size within the context of some model. Perhaps you're assuming that the model will always be known?
                  $endgroup$
                  – Nat
                  yesterday













                6












                6








                6





                $begingroup$

                tl;dr- It's fundamentally impossible to prove that things are unrelated; statistics can only be used to show when things are related. Despite this well-established fact, people frequently misinterpret a lack of statistical significance to imply a lack of relationship.




                A good encryption method should generate a ciphertext that, as far as an attacker can tell, doesn't bare any statistical relationship to the protected message. Because if an attacker can determine some sort of relationship, then they can get information about your protected messages by just looking at the ciphertexts – which is a Bad ThingTM.



                However, the ciphertext and its corresponding plaintext 100% determine each other. So even if the world's very best mathematicians can't find any significant relationship no matter how hard they try, we still obviously know that the relationship isn't just there, but that it's completely and fully deterministic. This determinism can exist even when we know that it's impossible to find a relationship.



                Despite this, we still get people who'll do stuff like:



                1. Pick some relationship they want to "disprove".


                2. Do some study on it that's inadequate to detect the alleged relationship.


                3. Report the lack of a statistically significant relationship.


                4. Twist this into a lack of relationship.


                This leads to all sorts of "scientific studies" that the media will (falsely) report as disproving the existence of some relationship.



                If you want to design your own study around this, there're a bunch of ways you can do it:



                1. Lazy research:
                  The easiest way, by far, is to just be incredibly lazy about it. It's just like from that figure linked in the question:
                  $hspace50px$.
                  You can easily get that $`` smallcolordarkredbeginarrayc text'Non-significant' study \[-10px] left(texthigh~P~textvalueright) endarray "$ by simply having small sample sizes, allowing a lot of noise, and other various lazy things. In fact, if you're so lazy as to not collect any data, then you're already done!


                2. Lazy analysis:
                  For some silly reason, some people think a Pearson correlation coefficient of $0$ means "no correlation". Which is true, in a very limited sense. But, here're a few cases to observe:
                  $hspace50px$.
                  This is, there may not be a "linear" relationship, but obviously there can be a more complex one. And it doesn't need to be "encryption"-level complex, but rather "it's actually just a bit of a squiggly line" or "there're two correlations" or whatever.


                3. Lazy answering:
                  In the spirit of the above, I'm going to stop here. To, ya know, be lazy!


                But, seriously, the article sums it up well in:




                Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.







                share|cite|improve this answer









                $endgroup$



                tl;dr- It's fundamentally impossible to prove that things are unrelated; statistics can only be used to show when things are related. Despite this well-established fact, people frequently misinterpret a lack of statistical significance to imply a lack of relationship.




                A good encryption method should generate a ciphertext that, as far as an attacker can tell, doesn't bare any statistical relationship to the protected message. Because if an attacker can determine some sort of relationship, then they can get information about your protected messages by just looking at the ciphertexts – which is a Bad ThingTM.



                However, the ciphertext and its corresponding plaintext 100% determine each other. So even if the world's very best mathematicians can't find any significant relationship no matter how hard they try, we still obviously know that the relationship isn't just there, but that it's completely and fully deterministic. This determinism can exist even when we know that it's impossible to find a relationship.



                Despite this, we still get people who'll do stuff like:



                1. Pick some relationship they want to "disprove".


                2. Do some study on it that's inadequate to detect the alleged relationship.


                3. Report the lack of a statistically significant relationship.


                4. Twist this into a lack of relationship.


                This leads to all sorts of "scientific studies" that the media will (falsely) report as disproving the existence of some relationship.



                If you want to design your own study around this, there're a bunch of ways you can do it:



                1. Lazy research:
                  The easiest way, by far, is to just be incredibly lazy about it. It's just like from that figure linked in the question:
                  $hspace50px$.
                  You can easily get that $`` smallcolordarkredbeginarrayc text'Non-significant' study \[-10px] left(texthigh~P~textvalueright) endarray "$ by simply having small sample sizes, allowing a lot of noise, and other various lazy things. In fact, if you're so lazy as to not collect any data, then you're already done!


                2. Lazy analysis:
                  For some silly reason, some people think a Pearson correlation coefficient of $0$ means "no correlation". Which is true, in a very limited sense. But, here're a few cases to observe:
                  $hspace50px$.
                  This is, there may not be a "linear" relationship, but obviously there can be a more complex one. And it doesn't need to be "encryption"-level complex, but rather "it's actually just a bit of a squiggly line" or "there're two correlations" or whatever.


                3. Lazy answering:
                  In the spirit of the above, I'm going to stop here. To, ya know, be lazy!


                But, seriously, the article sums it up well in:




                Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero.








                share|cite|improve this answer












                share|cite|improve this answer



                share|cite|improve this answer










                answered Mar 22 at 4:21









                NatNat

                353138




                353138











                • $begingroup$
                  +1 cause what you write is both true and thought provoking. However, in my humble opinion, you can prove that two quantities are reasonably uncorrelated under certain assumptions. You have to offcourse first start by e.g. supposing a certain distribution about them, but this can be based on the laws of physics, or statistics (e.g. the speed of molecules of a gas in a container are expected to be gaussian or so on)
                  $endgroup$
                  – ntg
                  Mar 22 at 5:21






                • 2




                  $begingroup$
                  @ntg Yeah, it's hard to know how to word some of this stuff, so I left a lot out. I mean, the general truth is that we can't disprove that some relationship exists, though we can generally demonstrate that a specific relationship doesn't exist. Sorta like, we can't establish that two data series are unrelated, but we can establish that they don't appear to be reliably related by a simple linear function.
                  $endgroup$
                  – Nat
                  Mar 22 at 8:05






                • 1




                  $begingroup$
                  -1 "tl;dr- It's fundamentally impossible to prove that things are unrelated": Equivalence tests provide evidence of absence of an effect within an arbitrary effect size.
                  $endgroup$
                  – Alexis
                  2 days ago






                • 2




                  $begingroup$
                  @Alexis I think you misunderstand equivalence testing; you can use equivalence testing to evidence the absence of a certain relationship holding, e.g. a linear relationship, but not evidence the absence of any relationship.
                  $endgroup$
                  – Nat
                  yesterday






                • 1




                  $begingroup$
                  @Alexis Statistical inference can provide you as much evidence of the absence of an effect larger than a specific effect size within the context of some model. Perhaps you're assuming that the model will always be known?
                  $endgroup$
                  – Nat
                  yesterday
















                • $begingroup$
                  +1 cause what you write is both true and thought provoking. However, in my humble opinion, you can prove that two quantities are reasonably uncorrelated under certain assumptions. You have to offcourse first start by e.g. supposing a certain distribution about them, but this can be based on the laws of physics, or statistics (e.g. the speed of molecules of a gas in a container are expected to be gaussian or so on)
                  $endgroup$
                  – ntg
                  Mar 22 at 5:21






                • 2




                  $begingroup$
                  @ntg Yeah, it's hard to know how to word some of this stuff, so I left a lot out. I mean, the general truth is that we can't disprove that some relationship exists, though we can generally demonstrate that a specific relationship doesn't exist. Sorta like, we can't establish that two data series are unrelated, but we can establish that they don't appear to be reliably related by a simple linear function.
                  $endgroup$
                  – Nat
                  Mar 22 at 8:05






                • 1




                  $begingroup$
                  -1 "tl;dr- It's fundamentally impossible to prove that things are unrelated": Equivalence tests provide evidence of absence of an effect within an arbitrary effect size.
                  $endgroup$
                  – Alexis
                  2 days ago






                • 2




                  $begingroup$
                  @Alexis I think you misunderstand equivalence testing; you can use equivalence testing to evidence the absence of a certain relationship holding, e.g. a linear relationship, but not evidence the absence of any relationship.
                  $endgroup$
                  – Nat
                  yesterday






                • 1




                  $begingroup$
                  @Alexis Statistical inference can provide you as much evidence of the absence of an effect larger than a specific effect size within the context of some model. Perhaps you're assuming that the model will always be known?
                  $endgroup$
                  – Nat
                  yesterday















                $begingroup$
                +1 cause what you write is both true and thought provoking. However, in my humble opinion, you can prove that two quantities are reasonably uncorrelated under certain assumptions. You have to offcourse first start by e.g. supposing a certain distribution about them, but this can be based on the laws of physics, or statistics (e.g. the speed of molecules of a gas in a container are expected to be gaussian or so on)
                $endgroup$
                – ntg
                Mar 22 at 5:21




                $begingroup$
                +1 cause what you write is both true and thought provoking. However, in my humble opinion, you can prove that two quantities are reasonably uncorrelated under certain assumptions. You have to offcourse first start by e.g. supposing a certain distribution about them, but this can be based on the laws of physics, or statistics (e.g. the speed of molecules of a gas in a container are expected to be gaussian or so on)
                $endgroup$
                – ntg
                Mar 22 at 5:21




                2




                2




                $begingroup$
                @ntg Yeah, it's hard to know how to word some of this stuff, so I left a lot out. I mean, the general truth is that we can't disprove that some relationship exists, though we can generally demonstrate that a specific relationship doesn't exist. Sorta like, we can't establish that two data series are unrelated, but we can establish that they don't appear to be reliably related by a simple linear function.
                $endgroup$
                – Nat
                Mar 22 at 8:05




                $begingroup$
                @ntg Yeah, it's hard to know how to word some of this stuff, so I left a lot out. I mean, the general truth is that we can't disprove that some relationship exists, though we can generally demonstrate that a specific relationship doesn't exist. Sorta like, we can't establish that two data series are unrelated, but we can establish that they don't appear to be reliably related by a simple linear function.
                $endgroup$
                – Nat
                Mar 22 at 8:05




                1




                1




                $begingroup$
                -1 "tl;dr- It's fundamentally impossible to prove that things are unrelated": Equivalence tests provide evidence of absence of an effect within an arbitrary effect size.
                $endgroup$
                – Alexis
                2 days ago




                $begingroup$
                -1 "tl;dr- It's fundamentally impossible to prove that things are unrelated": Equivalence tests provide evidence of absence of an effect within an arbitrary effect size.
                $endgroup$
                – Alexis
                2 days ago




                2




                2




                $begingroup$
                @Alexis I think you misunderstand equivalence testing; you can use equivalence testing to evidence the absence of a certain relationship holding, e.g. a linear relationship, but not evidence the absence of any relationship.
                $endgroup$
                – Nat
                yesterday




                $begingroup$
                @Alexis I think you misunderstand equivalence testing; you can use equivalence testing to evidence the absence of a certain relationship holding, e.g. a linear relationship, but not evidence the absence of any relationship.
                $endgroup$
                – Nat
                yesterday




                1




                1




                $begingroup$
                @Alexis Statistical inference can provide you as much evidence of the absence of an effect larger than a specific effect size within the context of some model. Perhaps you're assuming that the model will always be known?
                $endgroup$
                – Nat
                yesterday




                $begingroup$
                @Alexis Statistical inference can provide you as much evidence of the absence of an effect larger than a specific effect size within the context of some model. Perhaps you're assuming that the model will always be known?
                $endgroup$
                – Nat
                yesterday











                4












                $begingroup$

                For me, the most important part was:




                ...[We] urge authors to
                discuss the point estimate, even when they have a large P value or a
                wide interval, as well as discussing the limits of that interval.




                In other words: Place a higher emphasis on discussing estimates (center and confidence interval), and
                a lower emphasis on "Null-hypothesis testing".



                How does this work in practice? A lot of research boils down to measuring effect sizes, for example
                "We measured a risk ratio of 1.20, with a 95% C.I.
                ranging from 0.97 to 1.33". This is a suitable summary of a study. You can
                immediately see the most probably effect size and the uncertainty of the measurement. Using this
                summary, you can quickly compare this study to other studies like it,
                and ideally you can combine all the findings in a weighted average.



                Unfortunately, such studies are often summarized as "We did not find a statiscally significant
                increase of the risk ratio". This is a valid conclusion of the study above. But it is not a
                suitable summary of the study, because you can't easily compare studies using these kinds of summaries. You don't
                know which study had the most precise measurement, and you can't intuit what the finding of a meta-study might be.
                And you don't immediately spot when studies claim "non-significant risk ratio increase" by having confidence
                intervals that are so large you can hide an elephant in them.






                share|cite|improve this answer











                $endgroup$












                • $begingroup$
                  That depends on one's null hypothesis. For example, rejecting $H_0:|theta|ge Delta$ provides evidence of an absence of effect larger than an arbitrarily small $Delta$.
                  $endgroup$
                  – Alexis
                  2 days ago
















                4












                $begingroup$

                For me, the most important part was:




                ...[We] urge authors to
                discuss the point estimate, even when they have a large P value or a
                wide interval, as well as discussing the limits of that interval.




                In other words: Place a higher emphasis on discussing estimates (center and confidence interval), and
                a lower emphasis on "Null-hypothesis testing".



                How does this work in practice? A lot of research boils down to measuring effect sizes, for example
                "We measured a risk ratio of 1.20, with a 95% C.I.
                ranging from 0.97 to 1.33". This is a suitable summary of a study. You can
                immediately see the most probably effect size and the uncertainty of the measurement. Using this
                summary, you can quickly compare this study to other studies like it,
                and ideally you can combine all the findings in a weighted average.



                Unfortunately, such studies are often summarized as "We did not find a statiscally significant
                increase of the risk ratio". This is a valid conclusion of the study above. But it is not a
                suitable summary of the study, because you can't easily compare studies using these kinds of summaries. You don't
                know which study had the most precise measurement, and you can't intuit what the finding of a meta-study might be.
                And you don't immediately spot when studies claim "non-significant risk ratio increase" by having confidence
                intervals that are so large you can hide an elephant in them.






                share|cite|improve this answer











                $endgroup$












                • $begingroup$
                  That depends on one's null hypothesis. For example, rejecting $H_0:|theta|ge Delta$ provides evidence of an absence of effect larger than an arbitrarily small $Delta$.
                  $endgroup$
                  – Alexis
                  2 days ago














                4












                4








                4





                $begingroup$

                For me, the most important part was:




                ...[We] urge authors to
                discuss the point estimate, even when they have a large P value or a
                wide interval, as well as discussing the limits of that interval.




                In other words: Place a higher emphasis on discussing estimates (center and confidence interval), and
                a lower emphasis on "Null-hypothesis testing".



                How does this work in practice? A lot of research boils down to measuring effect sizes, for example
                "We measured a risk ratio of 1.20, with a 95% C.I.
                ranging from 0.97 to 1.33". This is a suitable summary of a study. You can
                immediately see the most probably effect size and the uncertainty of the measurement. Using this
                summary, you can quickly compare this study to other studies like it,
                and ideally you can combine all the findings in a weighted average.



                Unfortunately, such studies are often summarized as "We did not find a statiscally significant
                increase of the risk ratio". This is a valid conclusion of the study above. But it is not a
                suitable summary of the study, because you can't easily compare studies using these kinds of summaries. You don't
                know which study had the most precise measurement, and you can't intuit what the finding of a meta-study might be.
                And you don't immediately spot when studies claim "non-significant risk ratio increase" by having confidence
                intervals that are so large you can hide an elephant in them.






                share|cite|improve this answer











                $endgroup$



                For me, the most important part was:




                ...[We] urge authors to
                discuss the point estimate, even when they have a large P value or a
                wide interval, as well as discussing the limits of that interval.




                In other words: Place a higher emphasis on discussing estimates (center and confidence interval), and
                a lower emphasis on "Null-hypothesis testing".



                How does this work in practice? A lot of research boils down to measuring effect sizes, for example
                "We measured a risk ratio of 1.20, with a 95% C.I.
                ranging from 0.97 to 1.33". This is a suitable summary of a study. You can
                immediately see the most probably effect size and the uncertainty of the measurement. Using this
                summary, you can quickly compare this study to other studies like it,
                and ideally you can combine all the findings in a weighted average.



                Unfortunately, such studies are often summarized as "We did not find a statiscally significant
                increase of the risk ratio". This is a valid conclusion of the study above. But it is not a
                suitable summary of the study, because you can't easily compare studies using these kinds of summaries. You don't
                know which study had the most precise measurement, and you can't intuit what the finding of a meta-study might be.
                And you don't immediately spot when studies claim "non-significant risk ratio increase" by having confidence
                intervals that are so large you can hide an elephant in them.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited Mar 22 at 11:11

























                answered Mar 22 at 9:44









                Martin J.H.Martin J.H.

                1513




                1513











                • $begingroup$
                  That depends on one's null hypothesis. For example, rejecting $H_0:|theta|ge Delta$ provides evidence of an absence of effect larger than an arbitrarily small $Delta$.
                  $endgroup$
                  – Alexis
                  2 days ago

















                • $begingroup$
                  That depends on one's null hypothesis. For example, rejecting $H_0:|theta|ge Delta$ provides evidence of an absence of effect larger than an arbitrarily small $Delta$.
                  $endgroup$
                  – Alexis
                  2 days ago
















                $begingroup$
                That depends on one's null hypothesis. For example, rejecting $H_0:|theta|ge Delta$ provides evidence of an absence of effect larger than an arbitrarily small $Delta$.
                $endgroup$
                – Alexis
                2 days ago





                $begingroup$
                That depends on one's null hypothesis. For example, rejecting $H_0:|theta|ge Delta$ provides evidence of an absence of effect larger than an arbitrarily small $Delta$.
                $endgroup$
                – Alexis
                2 days ago












                4












                $begingroup$

                For a didactic introduction to the problem, Alex Reinhart wrote a book fully available online and edited at No Starch Press (with more content):
                https://www.statisticsdonewrong.com



                It explains the root of the problem without sophisticated maths and has specific chapters with examples from simulated data set:



                https://www.statisticsdonewrong.com/p-value.html



                https://www.statisticsdonewrong.com/regression.html



                In the second link, a graphical example illustrates the p-value problem. P-value is often used as a single indicator of statistical difference between dataset but is clearly not enough by its own.



                Edit for a more detailed answer:



                In many cases, studies aim to reproduce a precise type of data, either physical measurements (say the number of particles in an accelerator during a specific experiment) or quantitative indicators (like the number of patients developing specific symptoms during drug tests). In either this situation, many factors can interfere with the measurement process like human error or systems variations (people reacting differently to the same drug). This is the reason experiments are often done hundreds times if possible and drug testing is done, ideally, on cohorts of thousands patients.



                The data set is then reduced to its most simple values using statistics: means, standard deviations and so on. The problem in comparing models through their mean is that the measured values are only indicators of the true values, and are also statistically changing depending on the number and precision of the individual measurements. We have ways to give a good guess on which measures are likely to be the same and which are not, but only with a certain certainty. The usual threshold is to say that if we have less than one out of twenty chance to be wrong saying two values are different, we consider them "statistically different" (that's the meaning of $P<0.05$), else we do not conclude.



                This leads to the odd conclusions illustrated in Nature's article where two same measures give the same mean values but researchers conclusions differ due to the size of the sample. This, and other trops from statistical vocabulary and habits is becoming more and more important in the sciences. An other side of the problem is that people tend to forget that they use statistical tools and conclude about effect without proper verification of the statistical power of their samples.



                For an other illustration, recently social and life sciences are going through a true replication crisis due to the fact that a lot of effects were taken for granted by people who didn't check the proper statistical power of famous studies (while other falsified the data but this is another problem).






                share|cite|improve this answer











                $endgroup$












                • $begingroup$
                  Oh that looks really helpful for the uninitiated or who's initiation has expired decades ago. Thanks!
                  $endgroup$
                  – uhoh
                  Mar 21 at 13:12






                • 3




                  $begingroup$
                  While not just a link, this answer has all the salient characteristics of a "link only answer". To improve this answer, please incorporate the key points into the answer itself. Ideally, your answer should be useful as an answer even if the content of the links disappears.
                  $endgroup$
                  – R.M.
                  Mar 21 at 18:22






                • 2




                  $begingroup$
                  About p-values and the base rate fallacy (mentioned in your link), Veritasium published this video called the bayesian trap.
                  $endgroup$
                  – jjmontes
                  Mar 21 at 20:42







                • 2




                  $begingroup$
                  Sorry then, I'll try to improve and develop the answer as soon as possible. My idea was also to provide useful material for the curious reader.
                  $endgroup$
                  – G.Clavier
                  Mar 22 at 16:43






                • 1




                  $begingroup$
                  @G.Clavier and the self-described statistics newbie and curious reader appreciates it!
                  $endgroup$
                  – uhoh
                  Mar 22 at 22:17















                4












                $begingroup$

                For a didactic introduction to the problem, Alex Reinhart wrote a book fully available online and edited at No Starch Press (with more content):
                https://www.statisticsdonewrong.com



                It explains the root of the problem without sophisticated maths and has specific chapters with examples from simulated data set:



                https://www.statisticsdonewrong.com/p-value.html



                https://www.statisticsdonewrong.com/regression.html



                In the second link, a graphical example illustrates the p-value problem. P-value is often used as a single indicator of statistical difference between dataset but is clearly not enough by its own.



                Edit for a more detailed answer:



                In many cases, studies aim to reproduce a precise type of data, either physical measurements (say the number of particles in an accelerator during a specific experiment) or quantitative indicators (like the number of patients developing specific symptoms during drug tests). In either this situation, many factors can interfere with the measurement process like human error or systems variations (people reacting differently to the same drug). This is the reason experiments are often done hundreds times if possible and drug testing is done, ideally, on cohorts of thousands patients.



                The data set is then reduced to its most simple values using statistics: means, standard deviations and so on. The problem in comparing models through their mean is that the measured values are only indicators of the true values, and are also statistically changing depending on the number and precision of the individual measurements. We have ways to give a good guess on which measures are likely to be the same and which are not, but only with a certain certainty. The usual threshold is to say that if we have less than one out of twenty chance to be wrong saying two values are different, we consider them "statistically different" (that's the meaning of $P<0.05$), else we do not conclude.



                This leads to the odd conclusions illustrated in Nature's article where two same measures give the same mean values but researchers conclusions differ due to the size of the sample. This, and other trops from statistical vocabulary and habits is becoming more and more important in the sciences. An other side of the problem is that people tend to forget that they use statistical tools and conclude about effect without proper verification of the statistical power of their samples.



                For an other illustration, recently social and life sciences are going through a true replication crisis due to the fact that a lot of effects were taken for granted by people who didn't check the proper statistical power of famous studies (while other falsified the data but this is another problem).






                share|cite|improve this answer











                $endgroup$












                • $begingroup$
                  Oh that looks really helpful for the uninitiated or who's initiation has expired decades ago. Thanks!
                  $endgroup$
                  – uhoh
                  Mar 21 at 13:12






                • 3




                  $begingroup$
                  While not just a link, this answer has all the salient characteristics of a "link only answer". To improve this answer, please incorporate the key points into the answer itself. Ideally, your answer should be useful as an answer even if the content of the links disappears.
                  $endgroup$
                  – R.M.
                  Mar 21 at 18:22






                • 2




                  $begingroup$
                  About p-values and the base rate fallacy (mentioned in your link), Veritasium published this video called the bayesian trap.
                  $endgroup$
                  – jjmontes
                  Mar 21 at 20:42







                • 2




                  $begingroup$
                  Sorry then, I'll try to improve and develop the answer as soon as possible. My idea was also to provide useful material for the curious reader.
                  $endgroup$
                  – G.Clavier
                  Mar 22 at 16:43






                • 1




                  $begingroup$
                  @G.Clavier and the self-described statistics newbie and curious reader appreciates it!
                  $endgroup$
                  – uhoh
                  Mar 22 at 22:17













                4












                4








                4





                $begingroup$

                For a didactic introduction to the problem, Alex Reinhart wrote a book fully available online and edited at No Starch Press (with more content):
                https://www.statisticsdonewrong.com



                It explains the root of the problem without sophisticated maths and has specific chapters with examples from simulated data set:



                https://www.statisticsdonewrong.com/p-value.html



                https://www.statisticsdonewrong.com/regression.html



                In the second link, a graphical example illustrates the p-value problem. P-value is often used as a single indicator of statistical difference between dataset but is clearly not enough by its own.



                Edit for a more detailed answer:



                In many cases, studies aim to reproduce a precise type of data, either physical measurements (say the number of particles in an accelerator during a specific experiment) or quantitative indicators (like the number of patients developing specific symptoms during drug tests). In either this situation, many factors can interfere with the measurement process like human error or systems variations (people reacting differently to the same drug). This is the reason experiments are often done hundreds times if possible and drug testing is done, ideally, on cohorts of thousands patients.



                The data set is then reduced to its most simple values using statistics: means, standard deviations and so on. The problem in comparing models through their mean is that the measured values are only indicators of the true values, and are also statistically changing depending on the number and precision of the individual measurements. We have ways to give a good guess on which measures are likely to be the same and which are not, but only with a certain certainty. The usual threshold is to say that if we have less than one out of twenty chance to be wrong saying two values are different, we consider them "statistically different" (that's the meaning of $P<0.05$), else we do not conclude.



                This leads to the odd conclusions illustrated in Nature's article where two same measures give the same mean values but researchers conclusions differ due to the size of the sample. This, and other trops from statistical vocabulary and habits is becoming more and more important in the sciences. An other side of the problem is that people tend to forget that they use statistical tools and conclude about effect without proper verification of the statistical power of their samples.



                For an other illustration, recently social and life sciences are going through a true replication crisis due to the fact that a lot of effects were taken for granted by people who didn't check the proper statistical power of famous studies (while other falsified the data but this is another problem).






                share|cite|improve this answer











                $endgroup$



                For a didactic introduction to the problem, Alex Reinhart wrote a book fully available online and edited at No Starch Press (with more content):
                https://www.statisticsdonewrong.com



                It explains the root of the problem without sophisticated maths and has specific chapters with examples from simulated data set:



                https://www.statisticsdonewrong.com/p-value.html



                https://www.statisticsdonewrong.com/regression.html



                In the second link, a graphical example illustrates the p-value problem. P-value is often used as a single indicator of statistical difference between dataset but is clearly not enough by its own.



                Edit for a more detailed answer:



                In many cases, studies aim to reproduce a precise type of data, either physical measurements (say the number of particles in an accelerator during a specific experiment) or quantitative indicators (like the number of patients developing specific symptoms during drug tests). In either this situation, many factors can interfere with the measurement process like human error or systems variations (people reacting differently to the same drug). This is the reason experiments are often done hundreds times if possible and drug testing is done, ideally, on cohorts of thousands patients.



                The data set is then reduced to its most simple values using statistics: means, standard deviations and so on. The problem in comparing models through their mean is that the measured values are only indicators of the true values, and are also statistically changing depending on the number and precision of the individual measurements. We have ways to give a good guess on which measures are likely to be the same and which are not, but only with a certain certainty. The usual threshold is to say that if we have less than one out of twenty chance to be wrong saying two values are different, we consider them "statistically different" (that's the meaning of $P<0.05$), else we do not conclude.



                This leads to the odd conclusions illustrated in Nature's article where two same measures give the same mean values but researchers conclusions differ due to the size of the sample. This, and other trops from statistical vocabulary and habits is becoming more and more important in the sciences. An other side of the problem is that people tend to forget that they use statistical tools and conclude about effect without proper verification of the statistical power of their samples.



                For an other illustration, recently social and life sciences are going through a true replication crisis due to the fact that a lot of effects were taken for granted by people who didn't check the proper statistical power of famous studies (while other falsified the data but this is another problem).







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited 2 days ago

























                answered Mar 21 at 13:06









                G.ClavierG.Clavier

                444




                444











                • $begingroup$
                  Oh that looks really helpful for the uninitiated or who's initiation has expired decades ago. Thanks!
                  $endgroup$
                  – uhoh
                  Mar 21 at 13:12






                • 3




                  $begingroup$
                  While not just a link, this answer has all the salient characteristics of a "link only answer". To improve this answer, please incorporate the key points into the answer itself. Ideally, your answer should be useful as an answer even if the content of the links disappears.
                  $endgroup$
                  – R.M.
                  Mar 21 at 18:22






                • 2




                  $begingroup$
                  About p-values and the base rate fallacy (mentioned in your link), Veritasium published this video called the bayesian trap.
                  $endgroup$
                  – jjmontes
                  Mar 21 at 20:42







                • 2




                  $begingroup$
                  Sorry then, I'll try to improve and develop the answer as soon as possible. My idea was also to provide useful material for the curious reader.
                  $endgroup$
                  – G.Clavier
                  Mar 22 at 16:43






                • 1




                  $begingroup$
                  @G.Clavier and the self-described statistics newbie and curious reader appreciates it!
                  $endgroup$
                  – uhoh
                  Mar 22 at 22:17
















                • $begingroup$
                  Oh that looks really helpful for the uninitiated or who's initiation has expired decades ago. Thanks!
                  $endgroup$
                  – uhoh
                  Mar 21 at 13:12






                • 3




                  $begingroup$
                  While not just a link, this answer has all the salient characteristics of a "link only answer". To improve this answer, please incorporate the key points into the answer itself. Ideally, your answer should be useful as an answer even if the content of the links disappears.
                  $endgroup$
                  – R.M.
                  Mar 21 at 18:22






                • 2




                  $begingroup$
                  About p-values and the base rate fallacy (mentioned in your link), Veritasium published this video called the bayesian trap.
                  $endgroup$
                  – jjmontes
                  Mar 21 at 20:42







                • 2




                  $begingroup$
                  Sorry then, I'll try to improve and develop the answer as soon as possible. My idea was also to provide useful material for the curious reader.
                  $endgroup$
                  – G.Clavier
                  Mar 22 at 16:43






                • 1




                  $begingroup$
                  @G.Clavier and the self-described statistics newbie and curious reader appreciates it!
                  $endgroup$
                  – uhoh
                  Mar 22 at 22:17















                $begingroup$
                Oh that looks really helpful for the uninitiated or who's initiation has expired decades ago. Thanks!
                $endgroup$
                – uhoh
                Mar 21 at 13:12




                $begingroup$
                Oh that looks really helpful for the uninitiated or who's initiation has expired decades ago. Thanks!
                $endgroup$
                – uhoh
                Mar 21 at 13:12




                3




                3




                $begingroup$
                While not just a link, this answer has all the salient characteristics of a "link only answer". To improve this answer, please incorporate the key points into the answer itself. Ideally, your answer should be useful as an answer even if the content of the links disappears.
                $endgroup$
                – R.M.
                Mar 21 at 18:22




                $begingroup$
                While not just a link, this answer has all the salient characteristics of a "link only answer". To improve this answer, please incorporate the key points into the answer itself. Ideally, your answer should be useful as an answer even if the content of the links disappears.
                $endgroup$
                – R.M.
                Mar 21 at 18:22




                2




                2




                $begingroup$
                About p-values and the base rate fallacy (mentioned in your link), Veritasium published this video called the bayesian trap.
                $endgroup$
                – jjmontes
                Mar 21 at 20:42





                $begingroup$
                About p-values and the base rate fallacy (mentioned in your link), Veritasium published this video called the bayesian trap.
                $endgroup$
                – jjmontes
                Mar 21 at 20:42





                2




                2




                $begingroup$
                Sorry then, I'll try to improve and develop the answer as soon as possible. My idea was also to provide useful material for the curious reader.
                $endgroup$
                – G.Clavier
                Mar 22 at 16:43




                $begingroup$
                Sorry then, I'll try to improve and develop the answer as soon as possible. My idea was also to provide useful material for the curious reader.
                $endgroup$
                – G.Clavier
                Mar 22 at 16:43




                1




                1




                $begingroup$
                @G.Clavier and the self-described statistics newbie and curious reader appreciates it!
                $endgroup$
                – uhoh
                Mar 22 at 22:17




                $begingroup$
                @G.Clavier and the self-described statistics newbie and curious reader appreciates it!
                $endgroup$
                – uhoh
                Mar 22 at 22:17











                3












                $begingroup$

                It is "significant" that statisticians, not just scientists, are rising up and objecting to the loose use of "significance" and $P$ values. The most recent issue of The American Statistician is devoted entirely to this matter. See especially the lead editorial by Wasserman, Schirm, and Lazar.






                share|cite|improve this answer









                $endgroup$

















                  3












                  $begingroup$

                  It is "significant" that statisticians, not just scientists, are rising up and objecting to the loose use of "significance" and $P$ values. The most recent issue of The American Statistician is devoted entirely to this matter. See especially the lead editorial by Wasserman, Schirm, and Lazar.






                  share|cite|improve this answer









                  $endgroup$















                    3












                    3








                    3





                    $begingroup$

                    It is "significant" that statisticians, not just scientists, are rising up and objecting to the loose use of "significance" and $P$ values. The most recent issue of The American Statistician is devoted entirely to this matter. See especially the lead editorial by Wasserman, Schirm, and Lazar.






                    share|cite|improve this answer









                    $endgroup$



                    It is "significant" that statisticians, not just scientists, are rising up and objecting to the loose use of "significance" and $P$ values. The most recent issue of The American Statistician is devoted entirely to this matter. See especially the lead editorial by Wasserman, Schirm, and Lazar.







                    share|cite|improve this answer












                    share|cite|improve this answer



                    share|cite|improve this answer










                    answered yesterday









                    rvlrvl

                    10.3k1244




                    10.3k1244





















                        2












                        $begingroup$

                        It is a fact that for several reasons, p-values have indeed become a problem.



                        However, despite their weaknesses, they have important advantages such as simplicity and intuitive theory. Therefore, while overall I agree with the Comment in Nature, I do think that rather than ditching statistical significance completely, a more balanced solution is needed. Here are a few options:



                        1. "Changing the default P-value threshold for
                        statistical significance from 0.05 to 0.005 for claims of new
                        discoveries".
                        In my view, Benjamin et al addressed very well the
                        most compelling arguments against adopting a higher standard of
                        evidence.



                        2. Adopting the second-generation p-values. These seem
                        to be a reasonable solution to most of the problems affecting
                        classical p-values. As Blume et al say here, second-generation
                        p-values could help "improve rigor, reproducibility, & transparency in statistical analyses."



                        3. Redefining p-value as "a quantitative measure of certainty — a
                        “confidence index” — that an observed relationship, or claim,
                        is true."
                        This could help change analysis goal from achieving significance to appropriately estimating this confidence.



                        Importantly, "results that do not reach the threshold for statistical significance or “confidence” (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods."



                        I think that could help mitigate the obsession with p-values by leading journals, which is behind the misuse of p-values.






                        share|cite|improve this answer











                        $endgroup$

















                          2












                          $begingroup$

                          It is a fact that for several reasons, p-values have indeed become a problem.



                          However, despite their weaknesses, they have important advantages such as simplicity and intuitive theory. Therefore, while overall I agree with the Comment in Nature, I do think that rather than ditching statistical significance completely, a more balanced solution is needed. Here are a few options:



                          1. "Changing the default P-value threshold for
                          statistical significance from 0.05 to 0.005 for claims of new
                          discoveries".
                          In my view, Benjamin et al addressed very well the
                          most compelling arguments against adopting a higher standard of
                          evidence.



                          2. Adopting the second-generation p-values. These seem
                          to be a reasonable solution to most of the problems affecting
                          classical p-values. As Blume et al say here, second-generation
                          p-values could help "improve rigor, reproducibility, & transparency in statistical analyses."



                          3. Redefining p-value as "a quantitative measure of certainty — a
                          “confidence index” — that an observed relationship, or claim,
                          is true."
                          This could help change analysis goal from achieving significance to appropriately estimating this confidence.



                          Importantly, "results that do not reach the threshold for statistical significance or “confidence” (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods."



                          I think that could help mitigate the obsession with p-values by leading journals, which is behind the misuse of p-values.






                          share|cite|improve this answer











                          $endgroup$















                            2












                            2








                            2





                            $begingroup$

                            It is a fact that for several reasons, p-values have indeed become a problem.



                            However, despite their weaknesses, they have important advantages such as simplicity and intuitive theory. Therefore, while overall I agree with the Comment in Nature, I do think that rather than ditching statistical significance completely, a more balanced solution is needed. Here are a few options:



                            1. "Changing the default P-value threshold for
                            statistical significance from 0.05 to 0.005 for claims of new
                            discoveries".
                            In my view, Benjamin et al addressed very well the
                            most compelling arguments against adopting a higher standard of
                            evidence.



                            2. Adopting the second-generation p-values. These seem
                            to be a reasonable solution to most of the problems affecting
                            classical p-values. As Blume et al say here, second-generation
                            p-values could help "improve rigor, reproducibility, & transparency in statistical analyses."



                            3. Redefining p-value as "a quantitative measure of certainty — a
                            “confidence index” — that an observed relationship, or claim,
                            is true."
                            This could help change analysis goal from achieving significance to appropriately estimating this confidence.



                            Importantly, "results that do not reach the threshold for statistical significance or “confidence” (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods."



                            I think that could help mitigate the obsession with p-values by leading journals, which is behind the misuse of p-values.






                            share|cite|improve this answer











                            $endgroup$



                            It is a fact that for several reasons, p-values have indeed become a problem.



                            However, despite their weaknesses, they have important advantages such as simplicity and intuitive theory. Therefore, while overall I agree with the Comment in Nature, I do think that rather than ditching statistical significance completely, a more balanced solution is needed. Here are a few options:



                            1. "Changing the default P-value threshold for
                            statistical significance from 0.05 to 0.005 for claims of new
                            discoveries".
                            In my view, Benjamin et al addressed very well the
                            most compelling arguments against adopting a higher standard of
                            evidence.



                            2. Adopting the second-generation p-values. These seem
                            to be a reasonable solution to most of the problems affecting
                            classical p-values. As Blume et al say here, second-generation
                            p-values could help "improve rigor, reproducibility, & transparency in statistical analyses."



                            3. Redefining p-value as "a quantitative measure of certainty — a
                            “confidence index” — that an observed relationship, or claim,
                            is true."
                            This could help change analysis goal from achieving significance to appropriately estimating this confidence.



                            Importantly, "results that do not reach the threshold for statistical significance or “confidence” (whatever it is) can still be important and merit publication in leading journals if they address important research questions with rigorous methods."



                            I think that could help mitigate the obsession with p-values by leading journals, which is behind the misuse of p-values.







                            share|cite|improve this answer














                            share|cite|improve this answer



                            share|cite|improve this answer








                            edited 17 hours ago

























                            answered Mar 22 at 18:55









                            KrantzKrantz

                            2919




                            2919





















                                0












                                $begingroup$

                                One thing that has not been mentioned is that error or significance are statistical estimates, not actual physical measurements: They depend heavily on the data you have available and how you process it. You can only provide precise value of error and significance if you have measured every possible event. This is usually not the case, far from it!



                                Therefore, every estimate of error or significance, in this case any given P-value, is by definition inaccurate and should not be trusted to describe the underlying research – let alone phenomena! – accurately. In fact, it should not be trusted to convey anything about results WITHOUT knowledge of what is being represented, how the error was estimated and what was done to quality control the data. For example, one way to reduce estimated error is to remove outliers. If this is removal is also done statistically, then how can you actually know the outliers were real errors instead of unlikely real measurements that should be included in the error? How could the reduced error improve the significance of the results? What about erroneous measurements near the estimates? They improve the error and can impact statistical significance but can lead to wrong conclusions!



                                For that matter, I do physical modeling and have created models myself where 3-sigma error is completely unphysical. That is, statistically there's around one event in a thousand (well...more often than that, but I digress) that would result in completely ridiculous value. The magnitude of 3 interval error in my field is roughly equivalent of having best possible estimate of 1 cm turning out to be a meter every now and then. However, this is indeed an accepted result when providing statistical +/- interval calculated from physical, empirical data in my field. Sure, narrowness of uncertainty interval is respected, but often the value of best guess estimate is more useful result even when nominal error interval would be larger.



                                As a side note, I was once personally responsible for one of those one in a thousand outliers. I was in process of calibrating an instrument when an event happened which we were supposed to measure. Alas, that data point would have been exactly one of those 100 fold outliers, so in a sense, they DO happen and are included in the modeling error!






                                share|cite|improve this answer










                                New contributor




                                Geenimetsuri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                Check out our Code of Conduct.






                                $endgroup$












                                • $begingroup$
                                  "You can only provide accurate measure, if you have measured every possible event." Hmm. So, accuracy is hopeless? And also irrelevant? Please expand on the difference between accuracy and bias. Are the inaccurate estimates biased or unbiased? If they are unbiased, then aren't they a little bit useful? "For example, one way to reduce error is to remove outliers." Hmm. That will reduce sample variance, but "error"? "...often the value of best guess estimate is more useful result even when nominal error interval would be larger" I don't deny that a good prior is better than a bad experiment.
                                  $endgroup$
                                  – Peter Leopold
                                  Mar 21 at 16:27










                                • $begingroup$
                                  Modified the text a bit based on your comment. What I meant was that statistical measure of error is always an estimate unless you have all the possible individual tests, so to speak, available. This rarely happens, except when e.g. polling a set number of people (n.b. not as samples from larger crowd or general population).
                                  $endgroup$
                                  – Geenimetsuri
                                  Mar 21 at 17:49







                                • 1




                                  $begingroup$
                                  I am a practitioner who uses statistics rather than a statistician. I think a basic problem with p values is that many who are not familiar with what they are confuse them with substantive significance. Thus I have been asked to determine which slopes are important by using p values regardless of whether the slopes are large or not. A similar problem is using them to determine relative impact of variables (which is critical to me, but which gets surprisingly little attention in the regression literature).
                                  $endgroup$
                                  – user54285
                                  Mar 22 at 23:10















                                0












                                $begingroup$

                                One thing that has not been mentioned is that error or significance are statistical estimates, not actual physical measurements: They depend heavily on the data you have available and how you process it. You can only provide precise value of error and significance if you have measured every possible event. This is usually not the case, far from it!



                                Therefore, every estimate of error or significance, in this case any given P-value, is by definition inaccurate and should not be trusted to describe the underlying research – let alone phenomena! – accurately. In fact, it should not be trusted to convey anything about results WITHOUT knowledge of what is being represented, how the error was estimated and what was done to quality control the data. For example, one way to reduce estimated error is to remove outliers. If this is removal is also done statistically, then how can you actually know the outliers were real errors instead of unlikely real measurements that should be included in the error? How could the reduced error improve the significance of the results? What about erroneous measurements near the estimates? They improve the error and can impact statistical significance but can lead to wrong conclusions!



                                For that matter, I do physical modeling and have created models myself where 3-sigma error is completely unphysical. That is, statistically there's around one event in a thousand (well...more often than that, but I digress) that would result in completely ridiculous value. The magnitude of 3 interval error in my field is roughly equivalent of having best possible estimate of 1 cm turning out to be a meter every now and then. However, this is indeed an accepted result when providing statistical +/- interval calculated from physical, empirical data in my field. Sure, narrowness of uncertainty interval is respected, but often the value of best guess estimate is more useful result even when nominal error interval would be larger.



                                As a side note, I was once personally responsible for one of those one in a thousand outliers. I was in process of calibrating an instrument when an event happened which we were supposed to measure. Alas, that data point would have been exactly one of those 100 fold outliers, so in a sense, they DO happen and are included in the modeling error!






                                share|cite|improve this answer










                                New contributor




                                Geenimetsuri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                Check out our Code of Conduct.






                                $endgroup$












                                • $begingroup$
                                  "You can only provide accurate measure, if you have measured every possible event." Hmm. So, accuracy is hopeless? And also irrelevant? Please expand on the difference between accuracy and bias. Are the inaccurate estimates biased or unbiased? If they are unbiased, then aren't they a little bit useful? "For example, one way to reduce error is to remove outliers." Hmm. That will reduce sample variance, but "error"? "...often the value of best guess estimate is more useful result even when nominal error interval would be larger" I don't deny that a good prior is better than a bad experiment.
                                  $endgroup$
                                  – Peter Leopold
                                  Mar 21 at 16:27










                                • $begingroup$
                                  Modified the text a bit based on your comment. What I meant was that statistical measure of error is always an estimate unless you have all the possible individual tests, so to speak, available. This rarely happens, except when e.g. polling a set number of people (n.b. not as samples from larger crowd or general population).
                                  $endgroup$
                                  – Geenimetsuri
                                  Mar 21 at 17:49







                                • 1




                                  $begingroup$
                                  I am a practitioner who uses statistics rather than a statistician. I think a basic problem with p values is that many who are not familiar with what they are confuse them with substantive significance. Thus I have been asked to determine which slopes are important by using p values regardless of whether the slopes are large or not. A similar problem is using them to determine relative impact of variables (which is critical to me, but which gets surprisingly little attention in the regression literature).
                                  $endgroup$
                                  – user54285
                                  Mar 22 at 23:10













                                0












                                0








                                0





                                $begingroup$

                                One thing that has not been mentioned is that error or significance are statistical estimates, not actual physical measurements: They depend heavily on the data you have available and how you process it. You can only provide precise value of error and significance if you have measured every possible event. This is usually not the case, far from it!



                                Therefore, every estimate of error or significance, in this case any given P-value, is by definition inaccurate and should not be trusted to describe the underlying research – let alone phenomena! – accurately. In fact, it should not be trusted to convey anything about results WITHOUT knowledge of what is being represented, how the error was estimated and what was done to quality control the data. For example, one way to reduce estimated error is to remove outliers. If this is removal is also done statistically, then how can you actually know the outliers were real errors instead of unlikely real measurements that should be included in the error? How could the reduced error improve the significance of the results? What about erroneous measurements near the estimates? They improve the error and can impact statistical significance but can lead to wrong conclusions!



                                For that matter, I do physical modeling and have created models myself where 3-sigma error is completely unphysical. That is, statistically there's around one event in a thousand (well...more often than that, but I digress) that would result in completely ridiculous value. The magnitude of 3 interval error in my field is roughly equivalent of having best possible estimate of 1 cm turning out to be a meter every now and then. However, this is indeed an accepted result when providing statistical +/- interval calculated from physical, empirical data in my field. Sure, narrowness of uncertainty interval is respected, but often the value of best guess estimate is more useful result even when nominal error interval would be larger.



                                As a side note, I was once personally responsible for one of those one in a thousand outliers. I was in process of calibrating an instrument when an event happened which we were supposed to measure. Alas, that data point would have been exactly one of those 100 fold outliers, so in a sense, they DO happen and are included in the modeling error!






                                share|cite|improve this answer










                                New contributor




                                Geenimetsuri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                Check out our Code of Conduct.






                                $endgroup$



                                One thing that has not been mentioned is that error or significance are statistical estimates, not actual physical measurements: They depend heavily on the data you have available and how you process it. You can only provide precise value of error and significance if you have measured every possible event. This is usually not the case, far from it!



                                Therefore, every estimate of error or significance, in this case any given P-value, is by definition inaccurate and should not be trusted to describe the underlying research – let alone phenomena! – accurately. In fact, it should not be trusted to convey anything about results WITHOUT knowledge of what is being represented, how the error was estimated and what was done to quality control the data. For example, one way to reduce estimated error is to remove outliers. If this is removal is also done statistically, then how can you actually know the outliers were real errors instead of unlikely real measurements that should be included in the error? How could the reduced error improve the significance of the results? What about erroneous measurements near the estimates? They improve the error and can impact statistical significance but can lead to wrong conclusions!



                                For that matter, I do physical modeling and have created models myself where 3-sigma error is completely unphysical. That is, statistically there's around one event in a thousand (well...more often than that, but I digress) that would result in completely ridiculous value. The magnitude of 3 interval error in my field is roughly equivalent of having best possible estimate of 1 cm turning out to be a meter every now and then. However, this is indeed an accepted result when providing statistical +/- interval calculated from physical, empirical data in my field. Sure, narrowness of uncertainty interval is respected, but often the value of best guess estimate is more useful result even when nominal error interval would be larger.



                                As a side note, I was once personally responsible for one of those one in a thousand outliers. I was in process of calibrating an instrument when an event happened which we were supposed to measure. Alas, that data point would have been exactly one of those 100 fold outliers, so in a sense, they DO happen and are included in the modeling error!







                                share|cite|improve this answer










                                New contributor




                                Geenimetsuri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                Check out our Code of Conduct.









                                share|cite|improve this answer



                                share|cite|improve this answer








                                edited Mar 21 at 17:52





















                                New contributor




                                Geenimetsuri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                Check out our Code of Conduct.









                                answered Mar 21 at 13:28









                                GeenimetsuriGeenimetsuri

                                211




                                211




                                New contributor




                                Geenimetsuri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                Check out our Code of Conduct.





                                New contributor





                                Geenimetsuri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                Check out our Code of Conduct.






                                Geenimetsuri is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                Check out our Code of Conduct.











                                • $begingroup$
                                  "You can only provide accurate measure, if you have measured every possible event." Hmm. So, accuracy is hopeless? And also irrelevant? Please expand on the difference between accuracy and bias. Are the inaccurate estimates biased or unbiased? If they are unbiased, then aren't they a little bit useful? "For example, one way to reduce error is to remove outliers." Hmm. That will reduce sample variance, but "error"? "...often the value of best guess estimate is more useful result even when nominal error interval would be larger" I don't deny that a good prior is better than a bad experiment.
                                  $endgroup$
                                  – Peter Leopold
                                  Mar 21 at 16:27










                                • $begingroup$
                                  Modified the text a bit based on your comment. What I meant was that statistical measure of error is always an estimate unless you have all the possible individual tests, so to speak, available. This rarely happens, except when e.g. polling a set number of people (n.b. not as samples from larger crowd or general population).
                                  $endgroup$
                                  – Geenimetsuri
                                  Mar 21 at 17:49







                                • 1




                                  $begingroup$
                                  I am a practitioner who uses statistics rather than a statistician. I think a basic problem with p values is that many who are not familiar with what they are confuse them with substantive significance. Thus I have been asked to determine which slopes are important by using p values regardless of whether the slopes are large or not. A similar problem is using them to determine relative impact of variables (which is critical to me, but which gets surprisingly little attention in the regression literature).
                                  $endgroup$
                                  – user54285
                                  Mar 22 at 23:10
















                                • $begingroup$
                                  "You can only provide accurate measure, if you have measured every possible event." Hmm. So, accuracy is hopeless? And also irrelevant? Please expand on the difference between accuracy and bias. Are the inaccurate estimates biased or unbiased? If they are unbiased, then aren't they a little bit useful? "For example, one way to reduce error is to remove outliers." Hmm. That will reduce sample variance, but "error"? "...often the value of best guess estimate is more useful result even when nominal error interval would be larger" I don't deny that a good prior is better than a bad experiment.
                                  $endgroup$
                                  – Peter Leopold
                                  Mar 21 at 16:27










                                • $begingroup$
                                  Modified the text a bit based on your comment. What I meant was that statistical measure of error is always an estimate unless you have all the possible individual tests, so to speak, available. This rarely happens, except when e.g. polling a set number of people (n.b. not as samples from larger crowd or general population).
                                  $endgroup$
                                  – Geenimetsuri
                                  Mar 21 at 17:49







                                • 1




                                  $begingroup$
                                  I am a practitioner who uses statistics rather than a statistician. I think a basic problem with p values is that many who are not familiar with what they are confuse them with substantive significance. Thus I have been asked to determine which slopes are important by using p values regardless of whether the slopes are large or not. A similar problem is using them to determine relative impact of variables (which is critical to me, but which gets surprisingly little attention in the regression literature).
                                  $endgroup$
                                  – user54285
                                  Mar 22 at 23:10















                                $begingroup$
                                "You can only provide accurate measure, if you have measured every possible event." Hmm. So, accuracy is hopeless? And also irrelevant? Please expand on the difference between accuracy and bias. Are the inaccurate estimates biased or unbiased? If they are unbiased, then aren't they a little bit useful? "For example, one way to reduce error is to remove outliers." Hmm. That will reduce sample variance, but "error"? "...often the value of best guess estimate is more useful result even when nominal error interval would be larger" I don't deny that a good prior is better than a bad experiment.
                                $endgroup$
                                – Peter Leopold
                                Mar 21 at 16:27




                                $begingroup$
                                "You can only provide accurate measure, if you have measured every possible event." Hmm. So, accuracy is hopeless? And also irrelevant? Please expand on the difference between accuracy and bias. Are the inaccurate estimates biased or unbiased? If they are unbiased, then aren't they a little bit useful? "For example, one way to reduce error is to remove outliers." Hmm. That will reduce sample variance, but "error"? "...often the value of best guess estimate is more useful result even when nominal error interval would be larger" I don't deny that a good prior is better than a bad experiment.
                                $endgroup$
                                – Peter Leopold
                                Mar 21 at 16:27












                                $begingroup$
                                Modified the text a bit based on your comment. What I meant was that statistical measure of error is always an estimate unless you have all the possible individual tests, so to speak, available. This rarely happens, except when e.g. polling a set number of people (n.b. not as samples from larger crowd or general population).
                                $endgroup$
                                – Geenimetsuri
                                Mar 21 at 17:49





                                $begingroup$
                                Modified the text a bit based on your comment. What I meant was that statistical measure of error is always an estimate unless you have all the possible individual tests, so to speak, available. This rarely happens, except when e.g. polling a set number of people (n.b. not as samples from larger crowd or general population).
                                $endgroup$
                                – Geenimetsuri
                                Mar 21 at 17:49





                                1




                                1




                                $begingroup$
                                I am a practitioner who uses statistics rather than a statistician. I think a basic problem with p values is that many who are not familiar with what they are confuse them with substantive significance. Thus I have been asked to determine which slopes are important by using p values regardless of whether the slopes are large or not. A similar problem is using them to determine relative impact of variables (which is critical to me, but which gets surprisingly little attention in the regression literature).
                                $endgroup$
                                – user54285
                                Mar 22 at 23:10




                                $begingroup$
                                I am a practitioner who uses statistics rather than a statistician. I think a basic problem with p values is that many who are not familiar with what they are confuse them with substantive significance. Thus I have been asked to determine which slopes are important by using p values regardless of whether the slopes are large or not. A similar problem is using them to determine relative impact of variables (which is critical to me, but which gets surprisingly little attention in the regression literature).
                                $endgroup$
                                – user54285
                                Mar 22 at 23:10





                                protected by gung Mar 21 at 13:31



                                Thank you for your interest in this question.
                                Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                                Would you like to answer one of these unanswered questions instead?



                                -bias, p-value, statistical-significance

                                Popular posts from this blog

                                Frič See also Navigation menuinternal link

                                Identify plant with long narrow paired leaves and reddish stems Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?What is this plant with long sharp leaves? Is it a weed?What is this 3ft high, stalky plant, with mid sized narrow leaves?What is this young shrub with opposite ovate, crenate leaves and reddish stems?What is this plant with large broad serrated leaves?Identify this upright branching weed with long leaves and reddish stemsPlease help me identify this bulbous plant with long, broad leaves and white flowersWhat is this small annual with narrow gray/green leaves and rust colored daisy-type flowers?What is this chilli plant?Does anyone know what type of chilli plant this is?Help identify this plant

                                fontconfig warning: “/etc/fonts/fonts.conf”, line 100: unknown “element blank” The 2019 Stack Overflow Developer Survey Results Are In“tar: unrecognized option --warning” during 'apt-get install'How to fix Fontconfig errorHow do I figure out which font file is chosen for a system generic font alias?Why are some apt-get-installed fonts being ignored by fc-list, xfontsel, etc?Reload settings in /etc/fonts/conf.dTaking 30 seconds longer to boot after upgrade from jessie to stretchHow to match multiple font names with a single <match> element?Adding a custom font to fontconfigRemoving fonts from fontconfig <match> resultsBroken fonts after upgrading Firefox ESR to latest Firefox