For example, take a population of 100 people, and let us say one of them has gene changes in their Fatty Acid Desaturase genes (FADS1 an d FADS2) that change how important Long Chain Omega 3 Fatty Acids (like from fish) are for them. This happens more often in people from indigenous arctic populations.
https://www.sciencedirect.com/science/article/pii/S000291652...
So the researcher tests if omega 3 effects cardiovascular outcome in these hundred people by adding a lot more fish oil to the diet of these 100 people. Since only one of them really needs it, the P value will be insignificant and everyone will say fish oil does nothing. Yet for that one person it was literally everything.
This is talked about only quietly in research, but I think the wider population needs to understand this to know how useless p < 0.05 is when testing nutritional effects in genetically diverse populations.
Interpreting Clinical Trials With Omega-3 Supplements in the Context of Ancestry and FADS Genetic Variation https://www.frontiersin.org/journals/nutrition/articles/10.3...
So although I look like a typical European Caucasian, my genetics are very untypical of that population. And this also explains my family history of heart diseases and mood disorders which are also non-typical of European Caucasians.
I agree, but then all these cheaper, easier studies are useless.
In most African populations this Polymorphism does not exist at all. And even in Europeans it is only about 12% of the population.
Multimodal distributions are everywhere, and we are losing key insights by ignoring this. A classic example is the difference in response between men and women to a novel pharmaceutical.
It’s certainly not the case that scientists are not aware of this fact, but there seems to be a strong bias to arrange studies to fit into normal distributions by, for example, being selective about the sample population (test only on men, to avoid complicating variables). That makes pragmatic sense, but I wonder if it perpetuates an implicit bias for ignoring complexity.
For example, checking account balances are far from a normal distribution!
Yes, but this is not usually done.
A cool, interesting, horrible problem to have :)
The fundamental problem is the lack of embrace of causal inference techniques - i.e., the choice of covariates/confounders is on itself a scientific problem that needs to be handled with love
I do get what you're saying, if you miss something in the study that is important, but I don't see how this is a case to drop the value of statistical significance?
Put another way, the population is simply too variable to attempt to eliminate all confounding factors. We can, at best, eliminate some of the ones we know about, and acknowledge the ones we can't.
What does this mean? Is it contrary to what OP is saying above?
To use OP's example, we know that the gene mentioned is frequently found in the Inuit population. But if an Inuk does not have that gene, it does not somehow make them less Inuit. We can't quantify percentage Inuitness, and doing so is logically unsound. This is because the term "Inuit" doesn't mean its biological correlates. It simply has biological correlates.
To use an example of a personal friend, slightly anonymized: My friend is an Ashkenazi Jew. There is absolutely no uncertainty about this; Jewishness is matrilineal, and their mother was an Ashkenazi Jew, and her mother before her, going back over eight documented generations of family history. But alas - their grandfather was infertile, a fact that was posthumously revealed. Their maternal grandmother had a sperm donor. The sperm donor was not an Ashkenazi Jew. Consequently, can said friend be said to be "only 75% Jewish," having missed the "necessary" genetic correlates? Of course not. By simple matrilineage they are fully an Ashkenazi Jew.
Why are these terms used in medicine, then? Because, put simply, it's the best we can do. Genetic profiling is a useful tool under some limited circumstances, and asking medical subjects their ethnicity is often useful in determining medical correlates. But there is nothing in the gene that says "I am Inuk, I am Ashkenazi," because these ideas are social first, not genetic first.
I often wonder how many entrenched culture battles could be ~resolved (at least objectively) by fixing people's cognitive variable types.
But even though I'm not happy with NHST (the testing paradigm you describe), in that paradigm it is a valid conclusion for the group the hypothesis was tested on. It has been known for a long, long time that you can't find small, individual effects when testing a group. You need to travel a much harder path for those.
What do you mean? They all "work". None of them work for everyone, but that doesn't mean they don't work at all. As the case I was looking at revolved around nutritional deficiencies (brought on by celiac in my case) and their effects on the heart, it is also the case that the downside of the 4 separate interventions if it was wrong was basically nil, as were the costs. What about trying a simple nutritional supplement before we slam someone on beta blockers or some other heavy-duty pharmaceutical? I'm not against the latter on principle or anything, but if there's something simpler that has effectively no downsides (or very, very well-known ones in the cases of things like vitamin K or iron), let's try those first.
I think we've lost a great deal more to this weakness in the "official" scientific study methodology than anyone realizes. On the one hand, p-hacking allows us to "see" things where they don't exist and on the other this massive, massive overuse of "averaging" allows us to blur away real, useful effects if they are only massively helpful for some people but not everybody.
Even something as simple as a few strands of hair sealed in a plastic bag in a filing cabinet somewhere would be better than nothing at all.
Even if there is no name saved with the genetic sample, the bar for identification is low. The genes are even more identifying than a name after all. Worse, it contains deep information about the person.
But... that's not a problem with the use of the p-value, because that's (quite probably) a correct conclusion about the target (unrestricted) population addressed by the study as a whole.
That's a problem with not publishing complete observations, or not reading beyond headline conclusions to come up with future research avenues. That effects which are not significant in a broad population may be significant in a narrow subset (and vice versa) are well-known truths (they are the opposites of the fallacies of division and composition, respectively.)
Don’t believe that an association or effect exists just because it was statistically significant.
Don’t believe that an association or effect is absent just because it was not statistically significant.
Don’t believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.
Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)."
Hopefully this can help address the replication crisis[0] in (social) science.
[0]: https://en.wikipedia.org/wiki/Replication_crisis
Edit: Formatting (sorry formatting is hopeless).
I think it isn't just p-hacking.
I've participated in a bunch of psychology studies (questionaires) for university and I've frequently had situations where my answer to some question didn't fit into the possible answer choices at all. So I'd sometimes just choose whatever seems the least wrong answer out of frustration.
It often felt like the study author's own beliefs and biases strongly influence how studies are designed and that might be the bigger issue. It made me feel pretty disillusioned with that field, I frankly find it weird they call it a science. Although that is of course just based on the few studies I've seen.
While studies should try to be as "objective" as possible, it isn't clear how this can be avoided. How can the design of a study not depend on the author's beliefs? After all, the study is usually designed to test some hypothesis (that the author has based on their prior knowledge) or measure some effect (that the author thinks exists).
Can we recognize the beliefs we have that bias our work and then take action to eliminate those biases? I think that is possible when we aren't studying humans, but beliefs we have about humans are on a much deeper level and psychology largely doesn't have the rigor to account for them.
If you can't do science, don't call it science.
This isn't obviously a bad thing, in the context of a belief that most results are misleading or wrong.
But surely let's have a "hard-line stance" on not drowning in BS?
We live in a money-dependent world. We cannot go without it.
People do routinely misuse and misinterpret p-values — the worst of it I've seen is actually in the biomedical and biological sciences, but I'm not sure that matters. Attending to the appropriate use of them, as well as alternatives, is warranted.
However, even if everyone started focusing on, say, Bayesian credibility intervals I don't think it would change much. There would still be some criterion people would adopt in terms of what decision threshold to use about how to interpret a result, and it would end up looking like p-values. People would abuse that in the same ways.
Although this paper is well-intended and goes into actionable reasonable advice, it suffers some of the same problems I think is typical of this area. It tends to assume your data is fixed, and the question is how to interpret your modeling and results. But in the broader scientific context, that data isn't a fixed quantity ideally: it's collected by someone, and there's a broader question of "why this N, why this design", and so forth. So yes, ps are arbitrary, but they're not necessarily arbitrary relative to your study design, in the sense that if p < 0.05 is the standard the field has adopted, and you have a p = 0.053, the onus is on you to increase your N or choose a more powerful or more convincing design to demonstrate something at whatever threshold the field has settled on.
I'm not trying to argue for p-values per se necessarily, science is much more than p-values or even statistics, and think the broader problem lies with vocational incentives and things like that. But I do think at some level people will often, if not usually, want some categorical decision criterion to decide "this is a real effect not equal to null" and that decision criterion will always produce questionable behavior around it.
It's uncommon in science in general to be in a situation where the question of interest is to genuinely want to estimate a parameter with precision per se. There are cases of this, like in physics for example, but I think usually in other fields that's not the case. Many (most?) fields just don't have the precision of prediction of the physical sciences, to the point where differences of a parameter value from some nonzero theoretical one make a difference. Usually the hypothesis of a nonzero effect, or of some difference from an alternative; moreover, even when there is some interest in estimating a parameter value, there's often (like in physics) some implicit desire to test whether or not the value deviates significantly from a theoretical one, so you're back to a categorical decision threshold.
Semi-related, I found UniverseHacker's take[0] on the myth of averaging apt in regards to leaning to heavily on sample means. Moving beyond p-values and inter/intra group averages, there's fortunately a world of JASP[1].
I am sure there are plenty of people who misunderstand or misinterpret statistics. But in my experience these are mostly consumers. The people who produce "science" know damn well what they are doing.
This is not a scientific problem. This is a people problem.
As a statistician, I could not disagree more. I would venture to say that most uses of statistics by scientists that I see are fallacious in some way. It doesn't always invalidate the results, but that doesn't change the fact that it is built on a fallacy nonetheless.
In general, most scientists actually have an extremely poor grasp of statistics. Most fields require little more than a single introductory course to statistics with calculus (the same one required for pre-med students), and the rest they learn in an ad-hoc manner - often incorrectly.
People conduct science, and a lot of those people don't understand statistics that well. This quote from nearly 100 years ago still rings true in my experience:
"To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of."
- Ronald Fisher (1938)
A good researcher describes their study, shows their data and lays their own conclusions. There is just no need (nor possibility) of a predefined recipe to resume the study result into a "yes" or a "no".
Research is about increasing knowledge; marketing is about labelling.
Agreed. A single published paper is not science, a tree data structure of published papers that all build off of each other is science.
But if you take a lot of bricks and arrange them appropriately, then every single one of those bricks is wall.
In other words, just like the article points out down in the "dos" section, it depends on how you're treating that single unreplicated paper. Are you cherry-picking it, looking at it in isolation, and treating it as if it were definitive all by itself? Or are you considering it within a broader context of prior and related work, and thinking carefully about the strengths, limitations, and possible lacunae of the work it represents?
If a new paper with an outrageous claim pops up, people are automatically suspicious. Until it’s been reproduced by a few labs, it’s just “interesting”.
Then once it’s been validated and new science is built off of it, it’s not really accepted as foundational.
I guess it depends on what you're referring to as the "scientific method. As the article indicates, a whole lot of uses of p-values in the field - including in many scientific papers - actually invoke statistics in invalid or fallacious ways.
Sure, which is why I asked OP to define what they meant by "scientific method". The statement doesn't mean a whole lot if we're defining "scientific method" in a way that excludes 99% of scientific work that's actually produced.
No quotes needed, scientific method is well defined: https://en.wikipedia.org/wiki/Scientific_method
Tangent: I think that this attitude of scientific study can be applied to journalism to create a mode of articles between "neutral" reports and editorials. In the in-between mode, journalists can and should present their evidence without sharing their own conclusions, and then they should present their first-order conclusions (e.g. what the author personally thinks that this data says about reality) in the same article even if their conclusions are opinionated, but should restrain from second-order opinions (e.g. about what the audience should feel or do).
The way I do research is roughly Bayesian- I try to see what the aggregate of published experiments, anecdotes, intuition, etc. suggests are likely explanations for a phenomenon. Then I try to identify what realistic experiment is likely to provide the most evidence distinguishing between the top possibilities. There are usually many theories or hypotheses in play, and none are ever formally confirmed or rejected- only seen as more or less likely in the light of new evidence.
The peaks in your spectra, the calculation results, or the microscopy image either support your findings or they don't, so P-values don't get as much milage. I can't remember the last time I saw a P-value in one of those papers.
This does create a problem similar to publishing null result P-values, however: if a reaction or method doesn't work out, journals don't want it because it's not exciting. So much money is likely being wasted independently duplicating failed reactions over and over because it just never gets published.
The p-value cutoff of 0.05 just means "an effect this large, or larger, should happen by chance 1 time out of 20". So if 19 failed experiments don't publish and the 1 successful one does, all you've got are spurious results. But you have no way to know that, because you don't see the 19 failed experiments.
This is the unresolved methodological problem in empirical science that deal with weak effects.
More like "an effect this large, or larger, should happen by chance 1 time out of 20 in the hypothetical universe where we already know that the true size of the effect is zero".
Part of the problem of p-values is that most people can't even parse what it means (not saying it's your case). P-values are never a statement about probabilities in the real world, but always a statement about probabilities in a hypothetical world where we all effects are zero.
"Effect sizes", on the other hand, are more directly meaningful and more likely to be correctly interpreted by people on general, particularly if they have the relevant domain knowledge.
(Otherwise, I 100% agree with the rest of your comment.)
"Woman gives birth to fish" is interesting because it has a p-value of zero: under the null hypothesis ("no supernatural effects"), a woman can never give birth to a fish.
I suggest reading your comments before you post them.
My point was basically that the reputation / carrer / etc of the experimenter should be mostly independent of the study results. Otherwise you get bad incentives. Obviously we have limited ability to do this in practice, but at least we could fix the way journals decide what to publish.
But I am curious about something else. I am not a statistical mechanics person, but my understanding of information theory is that something actually refined emerges with a threshold (assuming it operates on SOME real signal) and the energy required to provide that threshold is important to allow "lower entropy" systems to emerge. Isn't this the whole principle behind Maxwell's Demon? That if you could open a little door between two equal temperature gas canisters you could perfectly separate the faster and slower gas molecules and paradoxically increase the temperature difference? But to only open the door for fast molecules (thresholding them) the little door would require energy (so it is no free lunch)? And that effectively acts as a threshold on the continuous distributions? I guess what I am asking is that isn't there a fundamental importance to thresholds in generating information? Isn't that how neurons work? Isn't that how AI models work?
It means more than that to some people, and it shouldn't.
- DON'T is very clear and specific. Don't say "Stat-Sig", don't conclude causal effect, don't conclude anything based on p>0.05.
- DO is very vague and unclear. Do be thoughtful, do accept uncertainty, do consider all relevant information.
Obviously, thoughtful consideration of all available information is ideal. But until I get another heuristic for "should I dig into this more?" - I'm just gonna live with my 5-10% FPR, thank you very much.
Here's the way I'd put things - correlation by itself does causation at all. You need correlation plus a plausible model of the world to have a chance.
Now science, at its best, involves building up these plausible models, so a scientist creates an extra little piece of the puzzle and has to be careful also the piece is a plausible fit.
The problem you hit is that the ruthless sink-or-swim atmosphere, previous bad science and fields that have little merit make it easy to be in the "just correlation" category. And whether you're doing a p test or something else doesn't matter.
A way to put is that a scientist has to care about the truth in order to put together all the pieces of models and data in their field.
So the problem is ultimately institutional.
Why do you need a heuristic? In what areas are you doing research where you don't have any other intuition or domain knowledge to draw on?
And if you don't have that background, contextual knowledge, are you the right person to be doing the work? Are you asking the right questions?
That's the domain knowledge. p-values are useful not the fixed cut-off. You know that in your research field p < 0.01 has importance.
A p-value does not measure "importance" (or relevance), and its meaning is not dependent on the research field or domain knowledge: it mostly just depends on effect size and number of replicates (and, in this case, due to the need to apply multiple comparison correction for effective FDR control, it depends on the number of things you are testing).
If you take any fixed effect size (no matter how small/non-important or large/important, as long as it is nonzero), you can make the p-value be arbitrarily small by just taking a sufficiently high number of samples (i.e., replicates). Thus, the p-value does not measure effect importance, it (roughly) measures whether you have enough information to be able to confidently claim that the effect is not exactly zero.
Example: you have a drug that reduces people's body weight by 0.00001% (clearly, an irrelevant/non-important effect, according to my domain knowledge of "people's expectations when they take a weight loss drug"); still, if you collect enough samples (i.e., take the weight of enough people who took the drug and of people who took a placebo, before and after), you can get a p-value as low as you want (0.05, 0.01, 0.001, etc.), mathematically speaking (i.e., as long as you can take an arbitrarily high number of samples). Thus, the p-value clearly can't be measuring the importance of the effect, if you can make it arbitrarily low by just having more measurements (assuming a fixed effect size/importance).
What is research field (or domain knowledge) dependent is the "relevance" of the effect (i.e., the effect size), which is what people should be focusing on anyway ("how big is the effect and how certain am I about its scale?"), rather than p-values (a statement about a hypothetical universe in which we assume the null to be true).
We're all just fancy monkeys with lightning rocks, it's fine dude
They just find mistakes, people come that their report for this guy was missing this and that and the commissions are not precise etc. Or they just get a bogus alert sometimes.
We could just shrug, but this erodes the trust about the whole service and the system itself. So we worked a lot to never have these. We have like 25 people on customer service and I can tell you it could have been easily 50 people otherwise.
The interesting part is how we had to kind of break people's common sense about what is significant problem and what is not. The "this is almost never happens I can assure you" from new colleagues were a good source of laughter after some time.
In case you want your own PDF of the paper.
[1] https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1...
This is a damning criticism of the most common techniques used by my scientists. The answer isn't to shrug and keep doing the same thing as before.
Questions like how many samples to collect, what methods to fit the uncertain data etc.