Towards replicability and sustainability in cancer research
Background
Biomedical science is vital for human well-being, serving as a compass towards a healthier future. But whether our current practice of science meets this laudable aspiration is a more complicated question to answer. At the height of the pandemic, an explosion of COVID-related research highlighted severe methodological shortcomings in the practice and communication of science. Given the immediacy of the crisis, these suspect findings quickly percolated into the public domain, such as Donald Trump’s 2020 endorsement of hydroxychloroquine as a panacea for COVID-19, stemming from deeply flawed publications [1, 2]. Similar works extolling anti-parasitic medicine Ivermectin for COVID also ignited world attention, despite later analysis exposing them as wholly unreliable [3]. Consequently, millions worldwide embraced ineffectual medicines with deleterious consequences.
While the pandemic might have produced such high-profile examples of dubious science, these problems long predate it. In biomedical science, an estimated 85% of medical research is deemed research waste [4], so poorly conducted as to be uninformative or so poorly reported that it is impossible to reproduce. Across biomedical science, there is increasing recognition that we are in the midst of a replication crisis [5], where important results fail to sustain under inspection, with harmful ramifications for both researchers and patients. A recent high-profile scandal in Alzheimer’s research saw a seminal and hugely cited paper in the field exposed as likely fabricated and retracted earlier this year [6,7,8]. This retraction was the culmination of a suspect finding that misled the entire field for almost two decades, wasting hundreds of millions in research efforts and countless human hours on a fool’s errand, steering the research community away from productive avenues to chase a phantom.
Cancer research is certainly not immune to these dark trends. A systematic replication trial as early as 2012 of what were deemed landmark cancer biology experiments exposed an alarming finding [9] – that only 6 of the 53 experiments, approximately 11% those analysed, had replicable results. A 2021 replication effort [10] of preclinical cancer research which looked at 193 experiments in 53 high-impact published works came to a somewhat disquieting conclusion: most papers failed to report vital statistics and methodology, and none of the experiments had been reported in sufficient detail for replicators to validate the experiment directly. When authors were contacted, they were frequently unhelpful or chose not to respond. Of the papers ultimately assessed, 67% required modification to the published protocol to even undertake.
Much of this is presumably due to ineptitude with data, reporting, or methodological rigour rather than anything untoward. But not every questionable result in cancer research can be considered so benign. Searching for ‘cancer’ on the RetractionWatch database, a repository of retracted scientific papers, lists 2874 retracted cancer papers to June 2024, the majority of which were not pulled for the scientific record for innocent error, but for classic ‘fraud’ in the form of falsification, fabrication, and plagiarism (FFP). The last two years alone yields some deeply unedifying reports, from a slew of cancer research articles retracted in an authorship-for-sale scheme [11] to the Federal Office of Research Integrity banning researchers from faculty [12, 13] to the director of cancer centres [14] for “recklessly reporting falsified and/or fabricated data”.
Despite its life-or-death importance to patients, spurious results in cancer science are alarmingly ubiquitous, causing harm by misleading research efforts down unfruitful avenues. This does not just create research waste, it can in the extreme render research projects and even entire fields unsustainable endeavours. In this perspective, we will elucidate some of the practices that render results unreliable, factors that might make research non-replicable, and how we might strive towards a more sustainable research climate. Some of the factors leading to non-sustainability in research are shown in Fig. 1 and discussed herein.

A non-exhaustive overview of some of the factors that contribute to research waste.
Torturing the data until it confesses: P-Hacking
The economist Ronald Coarse famously quipped that if you torture the data enough, it will confess to anything. This is certainly true of all biomedical science, with inappropriate statistical analysis lying at the heart of many spurious results, and cancer science no exception to this trend. P-Hacking [15] or data dredging is the act of misusing data to create a misleading impression with that data. In the context of cancer research where null-hypothesis significance testing is most often used to probe hypotheses, this most often manifests as artificially inflating the result so that it appears significant at some threshold, at a typical but profoundly arbitrary level of α = 0.05.
This threshold stems from a suggestion by Sir Ronald Fisher almost a century ago, who in 1925 suggested this cut-off as a heuristic above which results would be ignored, with the caveat it was only a rule-of-thumb and should be adapted as situation dictates [16]. Unfortunately, a misguided fixation on p-values has enabled serious misinterpretations and false positive results to creep into literature [17]. P-values are not measures of effect size nor clinical significance, and in recent years many journals have suggested centring reporting on measures like effect size, which are both more clinically relevant and readily interpretable. Abuses of P-values, however, persist [18].
Creating an illusion of an effect is disconcertingly trivial. Take for example an experimenter reporting that compound X demonstrated superior anticancer properties compared to a control agent. This might stir excitement, but any unbiased interpretation pivots largely by the degree to which the experimenter has been forthcoming. For example, if the scientist had in reality tested n agents against a control agent, the probability of obtaining at least one false positive for an apparent difference is given by a binomial distribution, ({p}_{f}left(nright)=1-{left(1-alpha right)}^{n}). For example, had they tested 5 agents which had no true efficacy, the probability of obtaining at least one false positive in the set would exceed 22.6%, not the 5% the alpha value might imply.
This is a classic multiple comparison problem. It is of course readily corrected when reported honestly; a diligent experimenter might report that 5 comparisons were undertaken, employing a Bonferroni correction, Šidák correction, Benjamini–Hochberg procedure, or similar method to reduce false positives [18, 19]. An experimenter using Bonferroni correction would for example arrive at a cut-off value of ({alpha }_{B}=0.01) as the threshold for apparent significance. But an experimenter who obtains a value of (ple 0.02) for anticancer efficacy of compound X relative to control may be tempted to jettison information about any multiple comparisons they undertook, so that a spurious finding might be presented as meaningful. This is especially easy in preclinical work in cancer, given most early-stage laboratory work is not preregistered.
Away from prying eyes, post-hoc analysis of already gathered data can allow misguided researchers to find a particular effect in an exploratory sweep of the data, and then present it as if it were an a priori hypothesis. This practice is called HARKING [20, 21] – hypothesising after the results are known. There are myriad equally inept or cynical ways to achieve an illusion of an effect. One can for example engage in redaction bias [19], selectively jettisoning data unsupportive of a favoured hypothesis until the remaining data have been sufficiently curated to support it, or simply dichotomise continuous variables at some arbitrary cut-off point to create an impression of a real effect when none exists [22,23,24,25,26,27,28].
Unplanned subgroup analysis, common in clinical trials, can also in some cases amount to a special case of HARKING. The justification for performing these analyses is usually an attempt to extract the maximum amount of information from a clinical trial. But such analyses are intrinsically prone to multiplicity, with multiple comparisons nudging the probability of a false positive result monotonically upwards [29]. There are many instances when an ostensible effect might appear in one group after sub-group analysis due solely to the vagaries of chance, and for this reason unplanned subgroup analysis should at most be considered exploratory or hypothesis-generating rather than confirmatory, but as other authors have noted “Unfortunately, however, they are too often overinterpreted or misused in the hope of resurrecting a failed study” [30].
Moving the goalposts – outcome switching and surrogate outcomes
Spurious findings are not solely a problem in preclinical oncology, but represent a problem that has metastasised even to clinical trials. Despite recent moves worldwide towards preregistration of outcomes and initiatives like Consolidated Standards of Reporting Trials (CONSORT) statements for randomised controlled trials (RCTs) which includes pre-specification of clinical outcomes of interest, switching of these outcomes remains prevalent in clinical oncology. A 2023 cross-sectional study in JAMA Network Open [31] examined 755 cancer phase 3 randomised clinical trials registered on Clinicaltrials.gov, noting primary end point PEP changes in 19% (n = 155) of these studies were implemented after the trial’s initiation. More concerningly, 70% of these switched outcome trials (n = 102) failed to report this substantial change in the subsequent publication.
In RCTs with outcome switching, researchers frequently failed [32] to acknowledge the fundamental problems such switching created for interpretation. This in effect constitutes a similar problem to HARKING, in so much as it allows the changing of the hypothesis under investigation after the results have been seen. This has the same intrinsic same potential to bias results, because outcomes are no longer based on pre-specified a priori hypotheses but on data that has already been collected and usually seen, with the net result that the integrity of the study is compromised and the risk of reporting false-positive result increased.
Interpretation of oncology trials are further complicated by widespread practices of surrogate outcomes [33]. While overall survival (OS) might be the gold standard in gauging the efficacy of any anti-cancer agent, alternative endpoints like progression-free survival (PFS) or even tumour volume are often invoked as alternatives. The rationale behind this is that while OS can take many years to evaluate, using proxies like PFS can in theory allow far more rapid interpretation [34]. But the allure of such substitutes is undermined by the troubling reality that in many instances, these proxies have only weak relationships [35] with the variable of interest and can be in practice be utterly misleading. A recent analysis [36] of FDA approvals for oncology drugs between 2005 and 2022 found that only one of fifteen such studies demonstrated a strong relationship between its surrogate outcome and OS. PFS cannot always be said to be a good proxy for quality of life either, and its usage here is frequently misleading to patients [37].
The troubling reality is that surrogate outcomes in oncology are increasingly common [38] and frequently over-estimate the relationship [39] between the surrogate and OS. In the USA, about 70% of oncology drug approvals arising from RCTs with surrogate outcomes [40] for OS, with a similar figure of 2/3 for cancer drug approval in Japan [41] stemming from surrogate outcomes. But while commonly cited as evidence, PFS poorly corelates to OS for several cancers [42, 43]. This has led to regulatory and patient consequences – in 2008, the FDA fast-tracked approval of bevacizumab for metastatic breast cancer, based upon PFS improvements, only to withdraw this approval when it failed to demonstrate any benefits in survival or patient quality of life [44]. While the FDA have long allowed PFS as a metric for assessment even when confirmatory trials do not materialise, recent re-evaluations of its applicability to OS and the deluge of trials employing for regulation approval have led to a recent pivot from the FDA back towards the gold-standard of OS [45, 46].
Outright fabrication
Most damningly, outright fraud exists in both preclinical and clinical oncology research. There are many forms this can take, but perhaps most concerning has been the staggering rise of inappropriate image manipulation. A 2016 study [47] of 20,261 papers with scientific image sleuth Dr Elisabeth Bik found problematic image manipulation in 3.8% of all biomedical science papers screened, and a staggering rise of this practice in the prior decade. These are hard to explain away as innocent mistakes, given the conscious effort required to edit images, but they have become worryingly prevalent in cancer research. This year has already witnessed several image-manipulation scandals, including one that saw several high-profile cancer research papers from the Harvard-affiliated Dana-Farber Cancer Institute retracted for research misconduct [48], with many similar cases listed in the RetractWatch rogue’s gallery.
Plagiarism and fabrication too are substantial problems – the explosion of scientific publishing worldwide in recent years has enable a cottage industry of predatory publishers, paid-for authorships, and fake peer review. In one appalling 2017 case, a single cancer journal retracted over 107 papers involving over 500 scientists (predominantly Chinese clinicians) after it was found that the peer-review underpinning their acceptance had been compromised [49]. Unethical scientists can steal the research of others, manipulating it just enough to evade detection before submitting it to a different journal. One tell-tale sign of this is the use of tortured phrases, often nonsensical dictionary-based reshuffling. In several dubious papers, “breast cancer” has become “bosom peril” [50], with the even more bizarre “buttcentric waterway” substituted for “Anal canal” [51].
These automated paraphrasing tools are readily found online, with one 2022 analysis deeming papers from India and China most likely to engage in the practice [50]. The rise of Artificial Intelligence (AI) too has proved a boon for those willing to engage in research misconduct, made all the more worrying by the tendency of current AI platforms to confidently hallucinate utterly fictitious papers and claims, presented in a convincing manner [52]. In the past year, a number of scientific publications with clear evidence of undisclosed AI-generation [53], with authors careless enough to leave in incriminating phases like “As an AI language Model.”. This makes the problem of fraud detection so much harder, and the option of fabrication much more appealing to scientists devoid of ethics.
Why research waste dominates
These three elements outlined above constitute a non-exhaustive list of factors that undermine the reliability and replicability of cancer research. A deeper question concerns why they are so prevalent, and the underlying motivations for the proliferation of dubious findings. There are several reasons for this, and a non-exhaustive list would include
-
Publish-or-Perish pressure: Publications are effectively the currency of academia, and prolific publication is seen as a proxy for productivity. Scientists and clinicians are also under intense pressure to acquire funding, acquisition of which is often dependent on having a seemingly impressive research portfolio [54]. With career advancement so intrinsically linked to publications, this sets up a perverse incentive for spurious publishing. Such pressure is not only directly linked to the dominance of subpar research but has the unintended consequence of rewarding experimenters who are insufficiently cautious against false positive or worse again, those willing to engage in research misconduct. This becomes even more pronounced in countries where career advancement is explicitly entangled with an arbitrary publication metric. China is a grim exemplar of this, where state targets for expected publications have led to a deluge of dubious outputs [50].
-
Publication bias: Journals bear a great deal of responsibility for the current crisis, based on their fixation with “novel” significant findings [54, 55]. High-impact journals are overwhelmingly more likely to publish a seemingly positive finding than a null result, with scant regard for the underlying methodological rigour of either. This on the face of it is absurd – it is equally vital to know whether an oncology drug doesn’t work as to know whether it does, but only the latter is well-rewarded. For example, a spurious significant finding of phantom efficacy for a drug is likely to be published even if the research quality underlying that result is underwhelming, whereas a diligently conducted null result appeals much less to journals. This also lends itself to the ‘file drawer’ problem [56] where null results are stored away and do not see publication, despite their critical importance. Modelling suggests this publication bias actively skews the publication record and creates research waste, as well as impeding the self-correcting nature of science [55].
-
Pathological science: The scientists’ version of confirmation bias is unfortunately prevalent in research. In pathological science [57], experimenters are biased towards certain beliefs or pet hypotheses, inadvertently or even deliberately distorting the evidence to buttress this delusion. While in principle science concerns itself with the dispassionate appraisal of falsifiable ideas, scientists are human and ideological beliefs or wishful thinking can often lead them to overstate a result that resonates with their hopes or reject one that conflicts with it.
Striving towards sustainable research
The great physicist Richard Feynman said of the scientific method that “The first principle is that you must not fool yourself- and you are the easiest person to fool”. In a commencement address to Caltech students in 1974 [58], Feynman explained how scientific endeavour required scientists to not only report all the details that support their interpretation, but anything that might undermine or contradict it too, saying
“Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can—if you know anything at all wrong, or possibly wrong—to explain it. If you make a theory, for example, and advertise it, or put it out, then you must also put down all the facts that disagree with it, as well as those that agree with it. … In summary, the idea is to try to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgement in one particular direction or another.”
This spirit of transparency is vital for sustainable science, and there are several ways scientists can strive towards this, including
-
Preregistration of protocols and adherence to reporting guidelines.
-
Sharing of raw data when possible, and all analysis code.
-
Independent checking of statistical protocols prior to experimentation.
-
Clear reporting of effect sizes and biological rationale, with p-values in context.
-
Publication of null results as well as significant findings.
-
Clear elucidation of research limitations and alternative explanations.
-
Partaking in replication efforts.
But perhaps greater still is the impact that journals themselves and academic institutions can have on the sustainability of research. Modelling studies suggest that removal of publication bias would markedly improve the trustworthiness of published research, when null results are rightly valued as much as significant findings. This is something of which journal editors should be mindful, as diligent null findings are absolutely crucial for research sustainability. The inescapable reality is that the publish-or-perish mantra most scientists toil under is supremely counterproductive, the ostensible productivity they measure being a poor metric for scientific contribution. Rather than the tired old platitude of insisting more studies are needed, it might be more accurate to say that fewer but better studies are needed. Rewarding diligent and reproducible research would drastically improve the trustworthiness of cancer science, shatter the barriers to lasting sustainable research efforts, and ultimately benefits patients much more than our current approaches.
Responses