The Reproducibility Crisis: An Interview with Prof. Dorothy Bishop
On my recent visit to Great Britain (the first one post-Brexit) I had the pleasure of talking to Dorothy Bishop. Bishop is Professor of Psychology at the University of Oxford and has been a leading force in combating the reproducibility crisis in her and other disciplines. You find her on twitter under the handle @deevybee . The comment for Nature magazine which I mention in the video is here.
Here is something from personal experience: about a decade ago I decided to get involved in biochemistry. First project was to work with a researcher who had gotten great results from a wunderkind grad student. We were simply going to replicate his experiment before doing a grant proposal based on his results. His work resulted in a published paper, by the way. And we couldn't do it. We found that his lab notes were not adequate and we even spoke with the student, who had moved on to another lab, and he didn't quite recall how he had done it. We spent six months in the attempt to replicate those results with no success. Ultimately the decision was made to assume the results were valid and to write that grant proposal anyway. Which we did. And the grant was awarded. I had a bad feeling about it and didn't want to be involved beyond that point. I found something else to do. I really don't know what came of this, what was ultimately published. I learned years later that the director of this particular project had retired. I do not know whether anyone ever managed to replicate the original study. I was somewhat disillusioned.
Thank you Sabine for this interview, Reproducibility and its 4 horsemen is indeed a growing problem in all areas of science. If I may share my mathematician perspective it is getting increasingly worse. Its bad, really bad. Despite these problems of reproducibility, mathematical areas that have been exploding (in terms of publications numbers) are the ones deriving from statistics. Long gone are the days when being an Applied Mathematician meant someone studying differential equations, numerical analysis or probabilities. Statisticians are the ones taking over universities departments of mathematics. Most of the time, all they do is data analysis and crushing numbers into Statistics softwares to publish their results (I have colleagues writing articles in ONE day). And they do publish a lot of results.. most of the time very similar results in different journals. People like me, pure mathematicians, that can take 2-3 years (not to say a lifetime) to find something interesting to say can not keep up with this level of production. This is killing foundamental research.
Journals are indeed not interested in null results. If you invest 1 year (or more) researching an idea, developing a great deal of knowledge on a problem but end up finding nothing no journal will publish your study. No one will know that the path or angle you chose is fruitless. This is also killing foundamental research.
Lastly, and on this I would like to have your(s) opinion(s). Before this age of communication and easy access to information, research used to be "vertical": we built our knowledge based on the work of our predecessors. It was a slow process but it is what made the 20th century breakthroughs possible. Nowadays, research seems "horizontal": huge amounts of articles are beeing published and no-one is able to keep track of all of this. There is a lot of noise, rubbish, a frenzy to publish as fast as possible. Our peer-review system cannot keep up with this. It does not work anymore. And the worse is that as researchers we are not trying to solve difficult problems, we are biasing our work towards easier problems. In this context, anyone can be a researcher in any area. Most of these areas... applied. And that is a big problem for me... because we are loosing, little by little, all the know-how that made the discoveries of the 20th century possible.
First let me say that I don't think it's an issue that in some research areas it's easier to churn out papers than in others. I say this both from personal experience and from my experience with bibliometric analysis. The field-dependence of publication-pace is very easy to normalize to and it's something that people tend to get intuitively right (though of course intuition isn't something we should rely too much on).
About the "huge amount" of research published. This is certainly true in absolute numbers just because there are more scientists today. If you look at the number of papers per authors, though, it has been mostly stable or has actually decreased in the past decades. I wrote about this here. What we are seeing is basically diminishing returns: More people does not translate into equally more results.
What is happening though because of the absolute growth of communities, is that the communities fall apart into super-specialized niches, which creates the problem you refer to. Lots of papers that are being read by few people which creates the opportunity to publish crap, basically.
I think (and have written about this previously) that this is a problem that AI (or more generally smart literature analysis) can dramatically help with. You want an algorithm that finds you what's relevant to your research and filters that work better than "my friend wrote this" or "I've heard of this guy".
It pains me quite a bit that things are moving so incredibly slowly when it comes to this problem, but at least they are moving.
When it comes to the more general problem of combating perverse incentives, however, scientists are still not taking responsibility.
Bourbaki, I see a complementarity between traditional review articles and preprints as one possible way of proceeding. Some notes about this are here, with a focus on arXiv but perhaps extensible to other areas: https://www.niso.org/niso-io/2020/01/standards-and-role-preprints-scholarly-communication I'm hoping to do more writing on this topic.
Though my gut reaction is that fixing psychology this way will be about like making socialism work, I am happy that this problem is at least now being acknowledged and worked on rather than not. I had to smile about how professor Bishop suggested that certain standard text book cases would be removed. Hopefully so, but apparently that’s not yet displayed by the Stanford Prison Experiment. Julia Galef did a great podcast on a French book by Thibault Le Texier that suggests psychologists are still in denial. http://rationallyspeakingpodcast.org/show/rs-241-thibault-le-texier-on-debunking-the-stanford-prison-e.html
Beyond efforts from within, I believe that science will need better structural principles from which to work so that our soft sciences can finally harden up. Or now that evidence has become scarce in physics, so that even physics may be defended from getting “Lost in Math”. I propose one for metaphysics, two for epistemology, and one for axiology.
As a social scientists, I run across these problems constantly, and I agree with everything except the importance of a pre-registered research plan and hypothesis. These in effect mean that the researchers have the conceit to play god, pretending that they know enough to form a plan and hypothesis, and it rules out serendipitious results if they fall outside the researchers' initial mindsets. Such results are likely to be more important than a pre-set hypothesis. This usually involves some p-hacking, but the problem is with the "p". Researchers need to use a much lower "p" than the standard .05. Physicists when hunting for new particles use something like .000001. In many situations researchers can use more stringent significance tests, such as the Bonferroni correction, to handle multiple testing problems. Even with a pre-registered plan and hypothesis the .05 criteria is misleading if many researchers are working on the same issue, because by chance some 5% will find significant relationships even if there is no relationship. And, as stressed, such results are far more likely to be published.
I don't see the concern you raise. If a pre-registered study stumbles across something unexpected yet relevant, there is nothing preventing researchers from pursuing the matter in their next study.
That would require a new data set that is developed in the same way as the data set used in the initial study (otherwise an inability to replicate might be due to data issues). That might possible with medical and physics experiments, if the researcher can get additional funding to produce a new, independent data set for a rerun of the experiment. . As with much of social science, I work with existing data sets, mainly government data, such that the next study would not be independent of the initial study and, thus, would encounter the problems I describe. I suspect that my situation is the same as most cosmological research. In other words, one cannot use a hypothesis that is developed by p-hacking, and then use the same data to test that hypothesis.
In machine learning one runs into the issue that training data and model validation data ought to be distinct, but a work-around is k-fold cross-validation. https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f
I observed, as a cognitive behavioural therapist, that research papers have either a high P value, but the standard deviation is large, or a small standard deviation, but a low P value. This was a feature back when I started training back in the late 1990s.
As for the reproducibility of some of the classic experiments, I observe that people's personalities are a confounding variable when coming to a conclusion of what the results meant.
So my inclination is to examine behaviours or symptoms, which can be measured, not the cognitions or feelings that are difficult to quantize objectively.
Like physics, there are those in psychology who subscribe to beliefs that are not scientific.
Great episode. I appreciate the concern about weaponization, but science to me is about shining light, so transparency with the public is important.
I understand not using data to support a hypothesis not originally contemplated by the experiment, but I assume it is okay to look at data, acknowledge it did not support the original hypothesis, and formulate a new hypothesis from the data, provided that the new hypothesis is then tested independent from the first attempt.
This replication crisis has been a disaster for social psychology, and medical research, but a boon for the researchers. P-hacking, contrary to the claims made in the interview, is a conscious act and will only be eliminated by constant attention and brutal consequences for the offenders - not a highly probable outcome, at least not less than .05.
When I started my PhD in solvent extraction forty odd years ago, my supervisor got me to do an undergraduate experiment extracting copper. I could not get the accepted answer of 2.0 for the slope. Kept getting 1.8. Tried for three months … different spectrometers etc. Eventually gave up and got on with the thesis. A couple of years later, supervising the under graduate class, a student pointed out he was getting 1.8 (none of the preceding or subsequent students complained). A bit later a fellow PhD student mentioned he could not get a slope of 2 either.
Eventually I went through the math and showed it should not be 2.0. That was about the only original thing in my thesis and not the main point of it.
For a recent quantitative analysis of how often experimental reports of the synthesis of a new material are repeated in the scientific literature, see this recent PNAS paper: https://www.pnas.org/content/117/2/877.short
The process of writing this paper involved some hard-working students in my group analyzing thousands of individual papers.
Andrew Gelman has also written extensively on this topic on his blog. One additional "horseman" I think he would add is the binary thinking promoted by null-hypothesis statistical testing (NHST): results are either "significant" and the hypothesis can be considered verified, or "not significant" and the hypothesis can be considered disproven. (Confidence intervals and Bayesian methods are two alternatives that avoid this problem.)
Gelman furthermore notes,
"The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster. See here."
The worse thing with scientific studies is that they can only control for so many variables. Lets say that we publish that sunlight causes melanoma. Everybody puts sunscreen on and 40 years later we find out that the chemicals from the sunscreen cause 10 other kinds of cancer and not getting enough sun causes 10 other kinds of cancer. Studies rarely can be good enough to make decisions about health.
How does someone get a PhD in whatever branch of knowledge without knowing which research approaches, methodologies and methods are pertinent to that subject and having been properly trained in the applicable research methods!!? It appears to me that the standard of awarding PhDs needs to be improved so that any PhD student shows mastery of the applicable research methodologies etc. Yes the incentives may be wrong but is no one ethical anymore? How can you knowingly produce faulty research? How can journal reviewers knowingly accept faulty research? I do not buy the idea they don't know - they appear rather to simple not care as the driving force is article publication no matter what the validity and reliability of the research.
"How does someone get a PhD in whatever branch of knowledge without knowing which research approaches, methodologies and methods are pertinent to that subject and having been properly trained in the applicable research methods!!?"
Things were done that way in the "old times". You learned how to do research doing research. That was the point of doing a PhD. It was HARD to get a PhD. Some very brilliant people took decades to have their PhD (I am thinking in the particular example of Michael Herman, in Mathematics). Many people FAILED. There was a SELECTION based on the student's knowledge, capacity. You had to find and demonstrate a NEW result relevent and pertinent to the area studied.
That was how things were done before. Nowadays any person with the money to finance his/her education can have a PhD. Areas of research have been created with that sole purpose. Many of those in non-exact science or applied science do not require any particular knowledge for an individual to get a PhD.
I think that is the point you and most of the people here are missing.
What Maths PhD are you talking about? Pure Maths certainly are since they are rare: small number of candidates for a small number of grants. Applied Maths are not: lots of candidates compared to the number of grants.
That is what I see in Europe. For example: it is quite easy to get a grant (or have it payed for by the financial institution (Bank etc...) where you work) in Mathematical Finance (in particular econometry or Risk management, actuarial "science"). The same is not true in Pure Mathematics (unless the subject of the thesis is oriented in practical application but then it is not Pure Mathematics anymore). For instance, a lot of Pure Maths grants are being diverted in Computational Science: most of these being Mathematical Engineering and that is not what Mathematics are about... Physicists like our delightful host Sabine Hossenfelder are not Mathematicians. Knowing enough maths to be able to use it as a tool does not make you a Mathematician.
you said that nowadays anyone with money can get a PhD. All I was saying in my comment is that you don't even need to have money. Anyone can get a PhD money or not. I didn't have any money, but I did get my PhD in pure math in a US university.
When I say "anyone" I mean even someone with a very weak knowledge in mathematics.
I live in a country (which is not France, where I did my PhD) where it is not unusual for my colleagues to write the PhD thesis of their students. All of it.
Some of my colleagues, with a PhD in mathematics (statistics) do not know what a limit means. Here we have "algebrists" that have a hard time computing a basic integral and "analysts" that have no clue about elementary algebra.
Here is something from personal experience: about a decade ago I decided to get involved in biochemistry. First project was to work with a researcher who had gotten great results from a wunderkind grad student. We were simply going to replicate his experiment before doing a grant proposal based on his results. His work resulted in a published paper, by the way. And we couldn't do it. We found that his lab notes were not adequate and we even spoke with the student, who had moved on to another lab, and he didn't quite recall how he had done it. We spent six months in the attempt to replicate those results with no success. Ultimately the decision was made to assume the results were valid and to write that grant proposal anyway. Which we did. And the grant was awarded. I had a bad feeling about it and didn't want to be involved beyond that point. I found something else to do. I really don't know what came of this, what was ultimately published. I learned years later that the director of this particular project had retired. I do not know whether anyone ever managed to replicate the original study. I was somewhat disillusioned.
ReplyDeleteThank you Sabine for this interview,
ReplyDeleteReproducibility and its 4 horsemen is indeed a growing problem in all areas of science.
If I may share my mathematician perspective it is getting increasingly worse. Its bad, really bad.
Despite these problems of reproducibility, mathematical areas that have been exploding (in terms of publications numbers) are the ones deriving from statistics. Long gone are the days when being an Applied Mathematician meant someone studying differential equations, numerical analysis or probabilities. Statisticians are the ones taking over universities departments of mathematics. Most of the time, all they do is data analysis and crushing numbers into Statistics softwares to publish their results (I have colleagues writing articles in ONE day). And they do publish a lot of results.. most of the time very similar results in different journals.
People like me, pure mathematicians, that can take 2-3 years (not to say a lifetime) to find something interesting to say can not keep up with this level of production. This is killing foundamental research.
Journals are indeed not interested in null results. If you invest 1 year (or more) researching an idea, developing a great deal of knowledge on a problem but end up finding nothing no journal will publish your study. No one will know that the path or angle you chose is fruitless. This is also killing foundamental research.
Lastly, and on this I would like to have your(s) opinion(s). Before this age of communication and easy access to information, research used to be "vertical": we built our knowledge based on the work of our predecessors. It was a slow process but it is what made the 20th century breakthroughs possible.
Nowadays, research seems "horizontal": huge amounts of articles are beeing published and no-one is able to keep track of all of this. There is a lot of noise, rubbish, a frenzy to publish as fast as possible. Our peer-review system cannot keep up with this. It does not work anymore.
And the worse is that as researchers we are not trying to solve difficult problems, we are biasing our work towards easier problems. In this context, anyone can be a researcher in any area. Most of these areas... applied. And that is a big problem for me... because we are loosing, little by little, all the know-how that made the discoveries of the 20th century possible.
Bourbaki,
DeleteFirst let me say that I don't think it's an issue that in some research areas it's easier to churn out papers than in others. I say this both from personal experience and from my experience with bibliometric analysis. The field-dependence of publication-pace is very easy to normalize to and it's something that people tend to get intuitively right (though of course intuition isn't something we should rely too much on).
About the "huge amount" of research published. This is certainly true in absolute numbers just because there are more scientists today. If you look at the number of papers per authors, though, it has been mostly stable or has actually decreased in the past decades. I wrote about this here. What we are seeing is basically diminishing returns: More people does not translate into equally more results.
What is happening though because of the absolute growth of communities, is that the communities fall apart into super-specialized niches, which creates the problem you refer to. Lots of papers that are being read by few people which creates the opportunity to publish crap, basically.
I think (and have written about this previously) that this is a problem that AI (or more generally smart literature analysis) can dramatically help with. You want an algorithm that finds you what's relevant to your research and filters that work better than "my friend wrote this" or "I've heard of this guy".
It pains me quite a bit that things are moving so incredibly slowly when it comes to this problem, but at least they are moving.
When it comes to the more general problem of combating perverse incentives, however, scientists are still not taking responsibility.
Bourbaki,
DeleteI see a complementarity between traditional review articles and preprints as one possible way of proceeding.
Some notes about this are here, with a focus on arXiv but perhaps extensible to other areas: https://www.niso.org/niso-io/2020/01/standards-and-role-preprints-scholarly-communication
I'm hoping to do more writing on this topic.
Though my gut reaction is that fixing psychology this way will be about like making socialism work, I am happy that this problem is at least now being acknowledged and worked on rather than not. I had to smile about how professor Bishop suggested that certain standard text book cases would be removed. Hopefully so, but apparently that’s not yet displayed by the Stanford Prison Experiment. Julia Galef did a great podcast on a French book by Thibault Le Texier that suggests psychologists are still in denial. http://rationallyspeakingpodcast.org/show/rs-241-thibault-le-texier-on-debunking-the-stanford-prison-e.html
ReplyDeleteBeyond efforts from within, I believe that science will need better structural principles from which to work so that our soft sciences can finally harden up. Or now that evidence has become scarce in physics, so that even physics may be defended from getting “Lost in Math”. I propose one for metaphysics, two for epistemology, and one for axiology.
As a social scientists, I run across these problems constantly, and I agree with everything except the importance of a pre-registered research plan and hypothesis. These in effect mean that the researchers have the conceit to play god, pretending that they know enough to form a plan and hypothesis, and it rules out serendipitious results if they fall outside the researchers' initial mindsets. Such results are likely to be more important than a pre-set hypothesis. This usually involves some p-hacking, but the problem is with the "p". Researchers need to use a much lower "p" than the standard .05. Physicists when hunting for new particles use something like .000001. In many situations researchers can use more stringent significance tests, such as the Bonferroni correction, to handle multiple testing problems.
ReplyDeleteEven with a pre-registered plan and hypothesis the .05 criteria is misleading if many researchers are working on the same issue, because by chance some 5% will find significant relationships even if there is no relationship. And, as stressed, such results are far more likely to be published.
t marvell,
DeleteI don't see the concern you raise. If a pre-registered study stumbles across something unexpected yet relevant, there is nothing preventing researchers from pursuing the matter in their next study.
That really depends on how expensive it is to repeat the study.
DeleteThat would require a new data set that is developed in the same way as the data set used in the initial study (otherwise an inability to replicate might be due to data issues). That might possible with medical and physics experiments, if the researcher can get additional funding to produce a new, independent data set for a rerun of the experiment.
Delete. As with much of social science, I work with existing data sets, mainly government data, such that the next study would not be independent of the initial study and, thus, would encounter the problems I describe. I suspect that my situation is the same as most cosmological research.
In other words, one cannot use a hypothesis that is developed by p-hacking, and then use the same data to test that hypothesis.
In machine learning one runs into the issue that training data and model validation data ought to be distinct, but a work-around is k-fold cross-validation. https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f
DeleteI observed, as a cognitive behavioural therapist, that research papers have either a high P value, but the standard deviation is large, or a small standard deviation, but a low P value. This was a feature back when I started training back in the late 1990s.
ReplyDeleteAs for the reproducibility of some of the classic experiments, I observe that people's personalities are a confounding variable when coming to a conclusion of what the results meant.
So my inclination is to examine behaviours or symptoms, which can be measured, not the cognitions or feelings that are difficult to quantize objectively.
Like physics, there are those in psychology who subscribe to beliefs that are not scientific.
Great episode. I appreciate the concern about weaponization, but science to me is about shining light, so transparency with the public is important.
ReplyDeleteI understand not using data to support a hypothesis not originally contemplated by the experiment, but I assume it is okay to look at data, acknowledge it did not support the original hypothesis, and formulate a new hypothesis from the data, provided that the new hypothesis is then tested independent from the first attempt.
This replication crisis has been a disaster for social psychology, and medical research, but a boon for the researchers. P-hacking, contrary to the claims made in the interview, is a conscious act and will only be eliminated by constant attention and brutal consequences for the offenders - not a highly probable outcome, at least not less than .05.
ReplyDeleteWhen I started my PhD in solvent extraction forty odd years ago, my supervisor got me to do an undergraduate experiment extracting copper. I could not get the accepted answer of 2.0 for the slope. Kept getting 1.8. Tried for three months … different spectrometers etc. Eventually gave up and got on with the thesis. A couple of years later, supervising the under graduate class, a student pointed out he was getting 1.8 (none of the preceding or subsequent students complained). A bit later a fellow PhD student mentioned he could not get a slope of 2 either.
ReplyDeleteEventually I went through the math and showed it should not be 2.0. That was about the only original thing in my thesis and not the main point of it.
For a recent quantitative analysis of how often experimental reports of the synthesis of a new material are repeated in the scientific literature, see this recent PNAS paper: https://www.pnas.org/content/117/2/877.short
ReplyDeleteThe process of writing this paper involved some hard-working students in my group analyzing thousands of individual papers.
Hi Sabine,
ReplyDeleteRecently, the presentations at Metascience 2019 have been made available.
Glad you interviewed Professor Bishop. Her presentation is available here, entitled : "The role of cognitive biases in sustaining bad science"
https://www.metascience2019.org/presentations/dorothy-bishop/
Cheers
Andrew Gelman has also written extensively on this topic on his blog. One additional "horseman" I think he would add is the binary thinking promoted by null-hypothesis statistical testing (NHST): results are either "significant" and the hypothesis can be considered verified, or "not significant" and the hypothesis can be considered disproven. (Confidence intervals and Bayesian methods are two alternatives that avoid this problem.)
ReplyDeleteGelman furthermore notes,
"The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster. See here."
Thank you for your link from Gelman's article.
ReplyDeleteI found it useful and helps delineate some of the issues discussed above.
The worse thing with scientific studies is that they can only control for so many variables. Lets say that we publish that sunlight causes melanoma. Everybody puts sunscreen on and 40 years later we find out that the chemicals from the sunscreen cause 10 other kinds of cancer and not getting enough sun causes 10 other kinds of cancer. Studies rarely can be good enough to make decisions about health.
ReplyDeleteHow does someone get a PhD in whatever branch of knowledge without knowing which research approaches, methodologies and methods are pertinent to that subject and having been properly trained in the applicable research methods!!? It appears to me that the standard of awarding PhDs needs to be improved so that any PhD student shows mastery of the applicable research methodologies etc. Yes the incentives may be wrong but is no one ethical anymore? How can you knowingly produce faulty research? How can journal reviewers knowingly accept faulty research? I do not buy the idea they don't know - they appear rather to simple not care as the driving force is article publication no matter what the validity and reliability of the research.
ReplyDeleteDear Lance,
Delete"How does someone get a PhD in whatever branch of knowledge without knowing which research approaches, methodologies and methods are pertinent to that subject and having been properly trained in the applicable research methods!!?"
Things were done that way in the "old times". You learned how to do research doing research. That was the point of doing a PhD.
It was HARD to get a PhD. Some very brilliant people took decades to have their PhD (I am thinking in the particular example of Michael Herman, in Mathematics). Many people FAILED.
There was a SELECTION based on the student's knowledge, capacity. You had to find and demonstrate a NEW result relevent and pertinent to the area studied.
That was how things were done before. Nowadays any person with the money to finance his/her education can have a PhD.
Areas of research have been created with that sole purpose. Many of those in non-exact science or applied science do not require any particular knowledge for an individual to get a PhD.
I think that is the point you and most of the people here are missing.
Bourbaki,
ReplyDeleteYou don't need money. Most math PhDs are supported by the school.
What Maths PhD are you talking about? Pure Maths certainly are since they are rare: small number of candidates for a small number of grants. Applied Maths are not: lots of candidates compared to the number of grants.
DeleteThat is what I see in Europe. For example: it is quite easy to get a grant (or have it payed for by the financial institution (Bank etc...) where you work) in Mathematical Finance (in particular econometry or Risk management, actuarial "science"). The same is not true in Pure Mathematics (unless the subject of the thesis is oriented in practical application but then it is not Pure Mathematics anymore). For instance, a lot of Pure Maths grants are being diverted in Computational Science: most of these being Mathematical Engineering and that is not what Mathematics are about... Physicists like our delightful host Sabine Hossenfelder are not Mathematicians. Knowing enough maths to be able to use it as a tool does not make you a Mathematician.
Bourbaki,
ReplyDeleteyou said that nowadays anyone with money can get a PhD. All I was saying in my comment is that you don't even need to have money. Anyone can get a PhD money or not. I didn't have any money, but I did get my PhD in pure math in a US university.
Dear Stor,
DeleteWhen I say "anyone" I mean even someone with a very weak knowledge in mathematics.
I live in a country (which is not France, where I did my PhD) where it is not unusual for my colleagues to write the PhD thesis of their students. All of it.
Some of my colleagues, with a PhD in mathematics (statistics) do not know what a limit means. Here we have "algebrists" that have a hard time computing a basic integral and "analysts" that have no clue about elementary algebra.
That is what I mean by "anyone"...