Table of Contents >> Show >> Hide
- What Actually Got “Banned” (and by Whom)
- Why This Hit a Nerve
- So… Are Significance Tests “Bad” or Just Misused?
- What Editors and Methodologists Want Instead
- How a “No Significance Testing” Policy Changes Writing and Reviewing
- Practical Examples: Same Data, Different Story
- What This Means for Students, Readers, and the Public
- Conclusion: A Ban is a Signal, Not a Salvation
- Field Notes: of Researcher Experiences Around the “No Significance Testing” Shift
Somewhere, a lonely asterisk just fell off a results table and rolled under the lab fridge. In 2015, a psychology journal did what most of us only fantasize about during Reviewer #2 season: it told “statistical significance” to pack its bags.
The headline version is irresistible: a psychology journal banned significance testing. The longer version is even more interestingbecause it’s not just academic drama. It’s a snapshot of a bigger argument about how we decide what counts as “evidence,” why so many findings don’t replicate, and why p-values got promoted from “tool” to “trophy.”
What Actually Got “Banned” (and by Whom)
The journal at the center of the story was Basic and Applied Social Psychology (BASP). Its editors announced they would no longer accept manuscripts using the traditional null hypothesis significance testing frameworkoften shortened to NHST. That means no p-values, no test-statistic worship (think t and F values), and no “statistically significant / not significant” language as a stand-in for scientific reasoning.
And here’s the twist people forget: BASP didn’t just side-eye p-values. The editorial also rejected confidence intervals (CIs) for similar “inverse inference” reasons, while leaving the door cracked open for Bayesian approaches on a case-by-case basis. In other words: “Don’t bring the usual significance-testing party tricks, and don’t sneak in a CI as a fake mustache.”
Whether you think that policy was brave, misguided, refreshing, or all three, it did something undeniably useful: it forced a lot of researchers to say out loud what they had been muttering into their coffee for years“Maybe we’re leaning too hard on p < .05.”
Why This Hit a Nerve
The p-value became a personality trait
In theory, a p-value is a specific thing: a measure of how incompatible your data are with a particular model (often a “no effect” null hypothesis), given a pile of assumptions. In practice, it often gets treated like a verdict from a tiny statistical courtroom: “Guilty of an effect” if p is small, “case dismissed” if it isn’t.
The trouble is that a p-value can’t do all the jobs we assigned it. It doesn’t tell you the size of an effect. It doesn’t tell you whether the effect matters in real life. It doesn’t tell you how likely your hypothesis is to be true. It’s one piece of evidence, not the whole evidence buffet.
“Significant” can mean “tiny but detectable,” not “important”
Here’s the classic mismatch: with a huge sample, you can detect a microscopic difference and still get p < .05. With a tiny sample, you can miss a meaningful difference and get a non-significant result. In other words, “statistical significance” is partly a function of sample size, which is why it can reward being large and punish being underpowered.
Imagine testing a new study app meant to improve exam scores. You run it on 10,000 students and the average score rises by 0.2 points on a 100-point test. That could easily be “statistically significant” and also totally irrelevant to any student who enjoys passing. A result can be real, measurable, and still not worth the hype.
Publication incentives turned the threshold into a finish line
Even when everyone knows “p-values aren’t everything,” the ecosystem often acts like they are. Journals prefer clean narratives. Researchers face pressure to publish. “Significant” results look like success, while null results get treated like awkward silence.
Add that to flexible analytic choicesdifferent exclusion rules, different covariates, different endpoints and you get the modern laboratory miracle known as “we tried a few reasonable options and the stars aligned at .049.” The result might not be fraud; it might simply be the math equivalent of turning your head until the picture frame looks straight.
So… Are Significance Tests “Bad” or Just Misused?
If you listen closely, the most serious critics aren’t saying “burn the p-values.” They’re saying “stop using them like a magic wand.” That’s basically the spirit of the American Statistical Association’s guidance: a p-value can be informative, but it’s easy to misuse, and it should never be the sole basis for scientific or practical conclusions.
The deeper issue is NHST as a culture. For decades, many fields acted as if science is a two-bin sorting machine: effects go in the “real” bin if p < .05, and in the “no effect” bin if they don’t. Real-world evidence is not a two-bin sorting machine. It’s messy, continuous, and full of uncertaintylike group projects, but with more regression.
BASP’s ban was controversial partly because it swung the pendulum hard. Plenty of methodologists argue that the fix is not to ban tools, but to demand better use: clearer assumptions, more transparency, and more emphasis on estimation, uncertainty, and replication.
What Editors and Methodologists Want Instead
1) Put effect sizes in the headline, not the footnote
If your paper’s main claim is “there is an effect,” the first follow-up question should be: “How big is it?” Effect sizes translate findings into magnitude differences in means, standardized differences (like Cohen’s d), odds ratios, correlation sizes, risk differences, and so on.
Effect sizes don’t magically solve everything, but they force you to talk about the substance. A tiny effect can matter in some contexts (public health, for example), and a moderate effect can be meaningless if it’s too expensive or too fragile. Either way, you’re discussing reality, not just a threshold.
2) Describe uncertainty like you actually mean it
Most modern reporting standards push researchers to quantify uncertainty typically through confidence intervals, credible intervals, or other interval estimates. BASP’s editorial rejected confidence intervals specifically, but many other mainstream guidelines treat interval estimates as a key part of responsible reporting.
Why? Because readers need to know not just “what you found,” but “how precise it is.” An estimated effect with a wide interval is a different kind of evidence than an estimated effect with a tight interval, even if both have the same p-value.
3) Be honest about decision-making and tradeoffs
In real research, you’re rarely asking “Is the effect exactly zero?” You’re asking something more practical: “Is the effect big enough to matter?” “Is it robust across reasonable analyses?” “Is it consistent with prior evidence?” “What would change my mind?”
This is where preregistration, sensitivity analyses, and transparent reporting shine. Instead of optimizing for a single number, you show the reader your reasoning and your robustness checks. That’s not only better science; it’s also reviewer-repellent in the best way: fewer vague accusations of cherry-picking when you’ve already shown your orchard map.
4) Use Bayesian and model-based approaches when they fit
Bayesian methods can help when you want to update beliefs with data, compare models, or incorporate prior information in a principled way. They also come with responsibilitiespriors matter, assumptions matter, and “Bayesian” is not a cheat code for “my result is definitely true.”
Still, the broader takeaway is that good inference can take multiple forms. The goal isn’t to swap one ritual for another. The goal is to match the method to the question and communicate uncertainty clearly.
How a “No Significance Testing” Policy Changes Writing and Reviewing
A ban forces authors to write differently. If you can’t rely on “significant” as a mic-drop, you have to make an argument using design, measurement quality, effect sizes, descriptive patterns, and theoretical coherence.
It also changes what reviewers can demand. Reviewers can’t default to “add a test” or “report the p-values.” Instead, they’re pushed toward questions like: Are the measures valid? Are the results stable across reasonable specifications? Do the graphs and descriptive summaries tell a consistent story? Is the interpretation proportionate to the evidence?
That can be liberatingand also uncomfortable. A single threshold is a blunt tool, but it’s easy to apply. Richer evidence takes more judgment, and judgment is where disagreements live. (See also: literally every academic meeting with snacks.)
Practical Examples: Same Data, Different Story
Example A: The “tiny but significant” win
Suppose a lab studies whether adding “growth mindset” language to homework instructions improves completion rates. With thousands of students, the completion rate rises from 80.0% to 80.6%. A traditional test might yield a small p-value, especially with a large sample.
Under an estimation-first approach, the key question becomes: Is a 0.6 percentage-point increase meaningful for the cost and effort? In a massive school system, maybe yes. In a small classroom, maybe it’s noise. The interpretation depends on context, not a sacred decimal.
Example B: The “promising but underpowered” result
Now suppose a clinical psychology pilot tests a brief intervention for panic symptoms on 30 participants. The symptom score drops by a moderate amount on average, but variability is high and the interval estimate is wide. A significance test might be non-significant simply because the sample is small.
The estimation story could still be valuable: the effect might be meaningful, but uncertain. That’s a real scientific conclusion: “Promising signal, imprecise estimate, needs replication and better-powered follow-up.” It’s less satisfying than “significant!” but more honestand honesty is a nice change of pace.
What This Means for Students, Readers, and the Public
If you’re reading psychology research and you see fewer declarations of “statistical significance,” you’re not watching science collapse. You’re watching it grow up.
Here are better questions than “Was it significant?”:
- How big is the effect? (Magnitude beats asterisk-counting.)
- How uncertain is the estimate? (Precision matters.)
- How was the study designed? (Randomization, controls, measurement validity.)
- Were analyses transparent and pre-specified? (Less garden-of-forking-paths energy.)
- Does it replicate or align with prior evidence? (One study is a clue, not a conclusion.)
The big idea is not “never use p-values.” It’s “stop letting one number do the job of scientific reasoning.”
Conclusion: A Ban is a Signal, Not a Salvation
BASP’s decision became famous because it was dramatic. Bans always are. But the lasting lesson isn’t that every journal should copy-paste the policy. It’s that psychologyand science more broadlyhas been rethinking how it turns data into claims.
The healthiest outcome is not swapping one rigid rule for another. The healthiest outcome is richer reporting: effect sizes, uncertainty, transparency, and conclusions that match the strength of the evidence. If a ban helped start that conversation, then at least the fallen asterisk didn’t die in vain.
Field Notes: of Researcher Experiences Around the “No Significance Testing” Shift
Ask people who lived through the “p-value reckoning” what changed, and you’ll hear a familiar theme: the hardest part wasn’t running different statisticsit was changing how they talked about their results. One social psych grad student described rewriting a Results section like moving out of a studio apartment. At first, everything was stacked in one corner labeled “p < .05.” When that corner disappeared, they had to find homes for the actual furniture: effect sizes, descriptive patterns, graphs, and the uncomfortable truth that some findings were “suggestive” rather than “settled.”
Another common experience: lab meetings got longer but better. Instead of celebrating a threshold hit (“We got .049!”), teams started arguing about magnitude and robustness. Someone would ask, “Is this big enough to matter?” and the room would go quiet in the way people go quiet when they realize they’ve been using a shortcut for years. Then the debate would restart with more productive questions: “Does the effect survive if we use the preregistered exclusion rule?” “What happens if we analyze the outcome two ways?” “Does the plot look like a real pattern or like a few enthusiastic outliers doing all the work?”
Reviewers, meanwhile, split into recognizable character types. There’s the former-threshold-loyalist who feels unmoored without “significant/non-significant” labels and keeps asking for them like a person repeatedly requesting ketchup at a sushi restaurant. Then there’s the estimation enthusiast who wants interval estimates, model checks, and a rationale for every analytic choice, which can feel like being asked to show your math and your emotional work. The best reviewers do something rarer: they demand clarity without turning the paper into a statistical obstacle course. They ask for transparency, interpretability, and restraint.
Researchers also report a subtle psychological shift: fewer false “wins,” fewer false “losses.” A non-significant result no longer automatically reads as failure; it can be “inconclusive but informative,” especially in pilots. A small p-value no longer automatically reads as victory; it can be “precise but tiny,” or “sensitive to assumptions.” Over time, this reframing tends to reduce the emotional roller coaster of analysis. People still care about outcomesbecause they’re humanbut they’re less likely to let a single decimal determine the paper’s destiny.
And yes, there’s humor in the trenches. You’ll hear jokes about “p-hacking rehab,” memes about retiring the asterisk, and the occasional sincere toast to the day someone finally wrote, “The effect is small, uncertain, and probably not worth a press release,” and still got published. In a field that’s trying to improve credibility, that’s not just comedythat’s progress.