Most ankle sprain research is either false or clinically unimportant:A 30-year audit of randomized controlled trials

2021-10-09 11:26ChrisBleakleyMarkMatthewsJamesSmoliga
Journal of Sport and Health Science 2021年5期

Chris M.Bleakley,Mark Matthews,James M.Smoliga

a School of Health Science,Ulster University,County Antrim,Northern Ireland,BT37 0QB,UK

b School of Sport,Ulster University,County Antrim,Northern Ireland,BT37 0QB,UK

c Department of Physical Therapy,Congdon School of Health Science,High Point University,High Point,NC 27268,USA

Abstract Background: Lateral ankle sprain is the most common musculoskeletal injury.Although clinical research in this field is growing, there is a broader concern that clinical trial outcomes are often false and fail to translate into patient benefits.Methods:We audited 30 years of experimental research related to lateral ankle sprain management(n=74 randomized controlled trials)to determine if reports of treatment effectiveness could be validated beyond statistical certainty.Results:A total of 77%of trials reported positive treatment effects,but there was a high risk of false discovery.Most trials were unregistered and relied solely on statistical significance, or lack of statistical significance, rather than on interpreting key measures of minimum clinical importance(e.g.,minimal detectable change,minimal clinically important difference).Conclusion: Future clinical trials must adopt higher standards of reporting and data interpretation.This includes consideration of the ethical responsibility to preregister their research and interpretation of clinical outcomes beyond statistical significance.

Keywords: Ankle sprain;False discovery;MCID;MDC

1.Background

Lateral ankle sprain(LAS)is the most prevalent musculoskeletal injury in physically active populations.1Although often considered innocuous, LAS has the highest re-injury rate across all lower-limb musculoskeletal injuries,2and the annual costs associated with sports-related ankle sprain in the Netherlands is estimated at EUR 187,200,000.3LAS also occurs frequently in the general population,with large cohorts suffering chronic problems;4indeed,30%-75%5,6of the general population develops a clinical condition known as chronic ankle instability (CAI), characterized by recurrent injury and self-reported instability.5The long-term costs associated with LAS and CAI are significant7,8and relate to lower quality of life,9physically inactivity,4and an increased risk of post-traumatic ankle osteoarthritis.5,10-12,13

Randomized controlled trials (RCTs) are currently considered to be the gold standard methodology for determining treatment superiority.14The first RCT involving acute LAS was published in 1972.15The Physiotherapy Evidence Database(PEDro) now archives more than 150 RCTs involving patients with LAS or CAI, and a 2017 meta-evaluation16in this field included 46 systematic reviews.Having access to high volumes of experimental research should improve the quality of healthcare,but there is much concern that many clinical trial outcomes are either false17,18or they fail to translate into clinical benefits for patients.19False discovery in science (e.g., erroneously claiming that a treatment is effective)often occurs due to overreliance on frequentist reasoning andpvalue thresholds,20a problem further compounded by unplanned multiple testing,selected reporting,and confirmation bias.21

Recently we introduced a 4-point checklist (FAIR (falsepositive risk,a prioriregistration,important clinically,replication)), which aims to validate experimental research beyond statistical certainty.22The checklist assesses the following criteria:(1)false-positive risk(FPR),which is“the probability of observing a statistically significantpvalue and declaring that an effect is real, when it is not”;23(2)a prioriregistration,which is essential for controlling the “degrees of freedom”researchers have during data analysis and reporting,21thereby reducing the risk of false-positive findings;(3)clinical importance,whereby the magnitude of treatment effect is compared to relevant minimal detectable change (MDC) and minimal clinically important difference(MCID)24data;and(4)replication,which should underpin all scientific discovery.

Evidence-based healthcare relies on the production of valid experimental data that translates into clinical benefits.This review examines the validity of conclusions from 30 years of clinical trials into the most common musculoskeletal injury among physically active populations and its resulting condition:LAS and CAI.Our primary objective was to examine the extent to which reports of treatment effectiveness in this field could be validated beyond statistical certainty.The FAIR checklist22was applied, with higher validity placed on trials presenting with low FPR, pre-registration, treatment effect magnitudes that exceeded relevant MDC and MCID values,and the corroboration of treatment effectiveness through independent replication.

2.Methods

2.1.Trial selection

Review methods aligned with preferred reporting items for systematic reviews and meta-analyses (PRISMA).25Electronic searching was undertaken independently by 2 authors(CMB,MM)on MEDLINE and the PEDro.26,27In MEDLINE,we undertook a broad search strategy based on MeSH terms(“ankle”AND“RCT”),and we used the PEDro search interface to run 3 separate searches for clinical trials using the terms“ankle sprain”,“chronic ankle instability”,and“CAI”.Citation tracking was also undertaken using a recent meta-evaluation.16To be eligible for inclusion,trials had to meet the following criteria:(1)the studies had to be designed as RCTs,(2)participants with LAS and/or CAI had to have received at least 1 conservative treatment intervention, and (3) study assessments had to include at least 1 clinically relevant outcome measure (e.g.,pain, function, range of motion, strength, and balance).Trials were excluded if they involved any surgical intervention.No restrictions were placed on injury severity, participant demographics,or follow-up duration.We did not include RCTs using more than 2 treatment arms, equivalency or non-inferiority trials, pilot trials, or trials published prior to 1990.Any disagreements in trial selection were resolved through consensus with a third reviewer(JMS).

2.2.Data extraction and analysis

PICO (population, intervention, comparison, outcome)characteristics were extracted from the full text of all eligible trials,in addition to aims and hypothesis,the number of participants, follow-up time points, and the total number of between-group statistical comparisons undertaken.Included trials were then classified as being either statistically significant or null.A statistically significant trial was defined as a trial having apvalue of less than 0.05 in the trial results tab for any clinical outcome.28We also calculated the proportion of between-group comparisons that resulted in statistically significant findings within each individual trial and whether they were recorded in primary or secondary outcome measures.When trials included multiple outcome measures but did not clearly specify a primary outcome, the primary outcome was determined by the authors based on the nature of the research question and the following definition of a primary outcome:“a specific key measurement(s)or observation(s)used to measure the effect of experimental variables in a trial”.29The FAIR checklist22was applied as follows.

2.2.1.False-positive risk

Calculation of FPR followed methods used in a previous research audit in this field.30FPR calculation is a special case of Bayesian analysis.It allows thepvalue to be supplemented by a single number that gives a much better idea of the strength of the evidence than apvalue alone.23We calculated FPR for all trials reporting a statistically significant finding from their primary outcome.All FPR calculations were performed using the False Positive Risk Web Calculator and the following data: thenof participants in each group, a relevantpvalue, and the corresponding effect size (Hedgesg).31Further details of the analysis script and simulated examples of FPR calculations can be found in Colquhoun’s recent articles.20,23If a trial reported apvalue threshold such asp< 0.05 rather than an exactpvalue, we assumed that thepvalue was 1 decimal place below the threshold value (e.g.,p< 0.05 was input as 0.049).The calculation of FPR also requires an estimation of the prior probability that there is a real effect (P(H1)) for a given treatment.In all trials, we initially assumed that P(H1) was 0.5; that is, treatment interventions had a 50:50 chance of a (positive) real effect before the experiment was done.18,20In all cases, FPR estimations were calculated using thep-equals method because our aim was to interpret a singlepvalue from a single experiment (rather than trying to estimate the long-term error rate).31Descriptive statistics were used to determine the median FPR and the number (%) of statistically significantpvalues associated with FPR less than 5%.

2.3.A priori trial registration

We determined the number (%) of eligible trials reporting preregistration, defined as the trial protocol being publicly available within a trial registry (e.g., www.ClinicalTrials.gov)prior to the initiation of participant recruitment.In a secondary analysis,we used odds ratio(OR)and 95%confidence interval(95%CI) to determine whether the likelihood of reporting a statistically significant outcome was influenced bya prioritrial registration.

2.4.Clinical importance

Initially,we determined the number(%)of trials that referenced or reported MDC and/or MCID values within the full-text manuscript.When enough data were available,we calculated the mean differences and 95%CI for each clinical outcome,where the mean difference=meanexperimental- meancontrol.The mean difference(95%CI) data were then compared to corresponding MDC and MCID data.If a trial did not report MDC or MCID data for a particular outcome,we searched the literature for relevant figures and input them.MDC was set at 95%CI and considered to be “the amount of change that must be observed before it is considered above the bounds of measurement error”.32The MCID was considered to be “the smallest change that would be important to patients” and could have been quantified by externally referenced(anchor)or internally referenced(distribution)methods.33

2.5.Replication

PICO criteria were compared across trials.If possible,homogeneous trials were subgrouped, and their trial effects(magnitude and direction) were compared in order to screen for successful replication.

3.Results

We screened 1098 titles and abstracts(937 from MEDLINE and 161 from PEDro),with 169 selected for full-text review.A total of 74 RCTs were eligible for inclusion (Supplementary Data 1),with the remainder(n=95)excluded for the following reasons: RCTs with more than 2 treatment arms (n=45), no clinical outcomes (n=9), non-RCT (n=8), non-English language (n=8), surgical intervention (n=7), non-inferiority/non-equivalency (n=5), non-ankle sprain/non-CAI (n=5),and others (n=8) (Fig.1).Trials included participants with either LAS(n=53 trials)or CAI(n=21 trials).In the included trials, the primary intervention involved external supports(n=17), exercise intervention (n=27), pharmacotherapy(n=14), manual therapy (n=9), electro-physical agents(n=4), and other intervention (n=3).The mean sample size wasn=85.1 (SD=96.8, range: 13-522), and 50% (37/74) of the trials reported using ana priorisample size calculation.Most sample size estimations included α(Type 1 error)and β(Type 2 error) levels of 5% and 20%, respectively, with the average effect size estimated at 0.7(SD=0.45)a priori.

A total of 23%(17/74)of the RCTs were classed as null(no treatment effects reported).The remaining 77% (57/74)reported statistically significant findings from at least 1 outcome measure.We extracted an aggregate of 966pvalues relating to between-group statistical comparisons involving primary or secondary outcomes,of which 35%(342/966)were statistically significant(p<0.05)(Fig.2A).Most statistically significant findings were derived from secondary outcomes,with just 17% (58/342) derived from primary outcome measures (Fig.2B).Of the 966pvalues reported in the literature,only 11 (1%) represented statistically significant findings in a primary outcome measure reported from a pre-registered trial(Fig.2C)(Supplementary Data 2).

Fig.1.Flow diagram summarizing trial selection.CAI=chronic ankle instability;RCT=randomized controlled trial.

Fig.2.Area plots subgrouping p values(n=966)by(A)level of significance,(B)primary outcomes,and(C)pre-registration.Each square represents ~10 p values generated from between-group comparisons White squares=no statistical significance(p>0.05).Shaded squares represent:(A)statistically significant-primary or secondary outcomes;(B)statistically significant-primary outcomes only,any trial;and(C)statistically significant-primary outcomes,pre-registered trials only.

3.1.False-positive risk

Enough data were available to calculate effect sizes and FPR in 68%of trials(39/57)reporting significant effects(p<0.05) in their primary outcome.FPR is summarized in Fig.3.The median FPR was 14% (range: 0.6%-100%), and 28% of trials(11/39)had an FPR<5%(Supplementary Data 3).

3.2.A priori trial registration

Only 19%(14/74)of trials were pre-registered.The average number of between-group comparisons reported across registered and unregistered trials was similar (12.8 (SD=9.0)vs.13.3 (SD=10.9), respectively); however, unregistered trials were more likely to reportpvalues <0.05 (OR=1.7, 95%CI:1.2-2.4,p=0.004).

3.3.Clinical importance

Fig.3.Violin plot summarizing false-positive risk in trials reporting significant(p<0.05)effects in their primary outcome.

Of the 57 trials reporting statistical significance, only 9%(5/57)made any reference to either MDC or MCID values.In an additional 16 trials, we were able to extract relevant MDC and/or MCID values extracted from the existing literature, for the following outcomes measures:(1)foot and ankle outcome measure,34,35(2)Cumberland ankle instability tool,36(3)lower extremity functional scale,37(4) isometric/isokinetic ankle strength,38,39(5) limb circumference/swelling,40,41(6) range of motion,38,42(7) postural control,27and (8) pain.43Effect magnitudes (mean difference) exceeded the respective MDC or MCID values in 12 trials and 7 trials, respectively.Effect magnitudes exceeded both MDC and MCID in just 3 trials(Supplementary Data 3).

At first the coachman wouldn t say anything, but when the youth pressed him he told him that a huge dragon dwelt in the neighbourhood, and required yearly the sacrifice of a beautiful maiden3

3.4.Replication

Fig.4 summarizes the number of trials meeting more than one of the FAIR criteria.Three trials were both pre-registered and reported a low FPR (<5%), and one of the pre-registered trials also reported a clinically important effect.No trial met all the following conditions: pre-registered, low FPR (<5%),and clear evidence that the magnitude of treatment effect exceeded both MDC and MCID values.There were no instances when a positive treatment effect was independently replicated.

4.Discussion

Fig.4.Venn diagram illustrating n trials meeting more than 1 FAIR criteria.FAIR=false-positive risk,a priori registration,importance,replication.

Our study raises concerns that a large proportion of scientific research is based on false-positive,non-replicable conclusions.17Strategies known to reduce the risk of false discovery include mandatory trial registration,21FPR calculation,20and use of MDC and MCID values to determine if reported treatment magnitudes are clinically meaningful.22,24There is a dearth of empirical meta-research investigating the credibility of research practices in sport and exercise medicine research.Recent audits have highlighted a high propensity for questionable research practices (e.g., hypothesizing after the results are known or HARKing, cherry picking,p-hacking) in high-impact sport and exercise medicine journals,44and we have previously found a high risk of false-positive claims in the sports physiotherapy literature.30

Our meta-research study is the first to use a saturation of RCTs from a single field of musculoskeletal medicine.In the 74 trials that met our inclusion criteria, 77% reported statistically significant findings from at least 1 outcome measure.However, in most trials, data interpretation was limited to all-or-nothing null hypothesis significance testing,and most positive conclusions could not be validated beyond statistical certainty.

Only 19%of trials in the LAS/CAI research literature were pre-registered.Trial registration is now required as a condition of ethical approval,45and audits of clinical trials undertaken in other fields of medicine (cardiology, rheumatology, and gastroenterology) show better adherence to current guidelines.46One of our key findings was that unregistered trials were 70%more likely to report statistical significance(OR=1.7,95%CI:1.2-2.4)than those that were registereda priori.Unregistered trials typically carry a higher risk of false discovery due to significance seeking, selective reporting of outcomes,47or HARKing.21In contrast, preregistration helps to control the“degrees of freedom”a researcher has during data analysis and reporting, 21 thereby reducing such risks.A related finding47was that out of the 342 statistically significantpvalues(<0.05) reported across trials, only 11 were generated from primary outcomes within preregistered trials.Consequently,the vast majority of statistically significant findings within the LAS/CAI evidence base are derived from secondary outcomes in unregistered trials and should therefore be considered exploratory or hypothesis generating.21

Measures of minimum clinical importance (MDC and MCID)are increasingly recognized as important thresholds for evaluating the efficacy of an intervention.However,the reporting of clinical significance is poor in RCTs involving patients with LAS or CAI, with just 9% of trials referring to MDC or MCID data.After extracting MDC and MCID for clinical outcomes relating to pain, function, instability, strength, and swelling,we were able to examine clinical efficacy in 21 trials;however,the results were disappointing,with 50%of the trials recording treatment effects that could not be differentiated from measurement error.Furthermore,in most trials,the treatment effects did not exceed relevant MCID figures and are therefore unlikely to be considered important by patients with LAS and CAI.An initial audit48of interventional research in the sports medicine literature found that MDC or MCID was considered in 53%and 40%of trials,respectively.However,a much larger audit of orthopedic literature found that only 7.5%of clinical science articles made reference to MCID.24

It is expected that musculoskeletal injuries are managed from an evidence-based perspective, whereby the best available evidence is integrated with patient preference, clinical expertise, and the clinical context.Because RCTs represent the gold standard methodology for determining treatment superiority, they have a considerable influence on the relevance of adopting an evidence-based framework when treating patients with LAS or CAI.Our results raise fundamental questions about the current value of evidence-based practice in this field and clarify that future clinical trials must adopt higher standards of reporting and data interpretation.Interestingly,there is a lack of robust clinical interpretation in other fields of medicine,49and continuing to rely solely on null hypothesis significance testing not only wastes research funding,but also erodes credibility and slows down scientific progress.50Although null hypothesis significance testing remains an important step for determining treatment effectiveness, it is most efficient in the context of long-run repeated testing.50We support the idea thatpvalues should be supplemented with a formal estimation of the FPR,18,31which represents the idea that “the probability, in the light of thepvalue that you observe, you declare that an effect is real, when in fact, it isn’t”.23Although it is often assumed that the FPR is equal to the reportedpvalue, they are different constructs and often vary considerably.Indeed, our audits show that the median FPR associated with statistically significant findings(p< 0.05) was 14% (range: 0.6%-100%), and only 27% of trials had an FPR of less than 5%.These figures suggest that statistical significance alone is not a solid foundation for determining treatment effect,particularly when it is based on binary thresholds(p<0.05).

Higher validity of an RCT was assumed under the following conditions: the RCT was registereda priori, it had low FPR, and its treatment effects exceeded the MDC and MCID values.This list of conditions is not exhaustive,and we did not fully consider false discoveries related to multiple treatment arms, the analysis of multiple outcomes, or multiple analyses of the same outcome at different times.51We acknowledge that although preregistration increases the transparency and validity of trial conclusions,it is not a cure-all for efficient and accurate dissemination.Audits of RCTs at www.Clinical-Trials.gov show that approximately 20% of registered trials disseminate their results within one year of completion,52while others indicate that there is quite a high risk of discordance between the original registry data and the published data.53

We must also acknowledge that our FPR calculations were based on assumptions that the prior probability of effect was 50%,but it is likely that some trials were underpinned by more extreme hypotheses.In previous data simulations,28we have shown that a positive conclusion from an optimistic research question(i.e.,a higher prior probability)is likely to be correct,whereas an unlikely hypotheses (where researchers are driven by pursuit of novelty) will have a much higher risk of falsepositive reporting.Alternatives to FPR have been discussed by Colquhoun.23Perhaps the most clinically intuitive option is the use of a reverse Bayesian approach,54where the observedpvalue is used to calculate the prior probability required to achieve a specific or minimal FPR(e.g.,5%).This then allows the researcher to determine whether the calculated prior probability is plausible or not.30Finally, many latent constructs influence false discovery;these include a scientific culture that places the most value on statistically significant findings or novel discoveries.21

5.Conclusion

Our audit indicates that there is a high risk of false-positive discovery in a core field of musculoskeletal research.A key concern is that most of the research in this field remains unregistered and relies solely on statistical significance, or lack of statistical significance, rather than on the interpretation of the magnitude of change.Researchers must heed the ethical responsibility of preregistering their research, and their interpretation of clinical outcomes must evolve beyond statistical significance.

Authors’contributions

CMB conceived the audit, planned and carried out the review, extracted data, undertook much of the analysis, and drafted the original manuscript;MM assisted with the review,the analysis and in writing the final manuscript; JMS conceived the audit,helped to plan and carry out the review,verified the analytical methods, and contributed to writing the manuscript.All authors have read and approved the final version of the manuscript, and agreed with the order of presentation of the authors.

Competing interests

The authors declare that they have no competing interests.

Supplementary materials

Supplementary material associated with this article can be found in the online version at doi:10.1016/j.jshs.2020.11.002.