Psychedelic-assisted treatments for psychiatric ailments look promising for appropriately screened samples (Ilingworth et al., 2021; Luoma, Chwyl, Bathje, Davis, & Lancelotta, 2020; Romeo, Karila, Martelli, & Benyamina, 2020; Zeifman et al., 2022). Most reviews of clinical trials with psychedelics end with the clarion call for replications. Tacit hopes for these replications are numerous but often include repeating the same therapy with comparable clients to confirm ameliorative effects. If a psychedelic-assisted treatment repeatedly demonstrates greater efficacy than an alternative intervention, the relevant data would serve as a first step toward empirical support, much like evidence-based approaches have benefitted other forms of therapy and medicine (Sakulak et al., 2019). But even within a single trial, outcomes vary. For instance, the occasional member of a treatment group often ends up worse off than some members of the control group. The public purportedly trusts experts to conclude that, in the long run and on average, those who complete a treatment end up improving more than others who did not receive treatment. But traditions around replication and null hypothesis testing might lead any therapist, client, or public official to expect more than a good treatment could deliver. Expecting every trial to reveal that psychedelic-assisted treatment is superior to reasonable controls, let alone alternative treatments, might be more than simple sampling error can permit. Maintaining reasonable expectations as these lines of research continue will be essential for establishing true efficacy. Data suggest that even authors of published research do not have good intuitions about some of the difficulties inherent in replication, leading them to overestimate the probability that one study's statistically significant result will appear in a second experiment (Cumming, Williams, & Fidler, 2004).
Despite the promising reputation developing for psychedelic-assisted therapy during this new renaissance, many other publications lament a replication crisis in the social and medical sciences more broadly (See Amrhein, Trafimow, & Greenland, 2019; Ioannidis, 2005). Recent data also suggest a broader distrust of science in the public (Krause, Brossard, Scheufele, Xenos, & Franke, 2019). Meanwhile, many people with psychological ailments continue to suffer or find available treatments lacking (Earleywine & De Leo, 2020). Reasonable expectations for treatments and treatment outcome studies would help. A close look at available data and the natural variation in sampling suggests that a series of replications would prove informative. Nevertheless, expecting that some experiments will fail to replicate the initial promising, statistically significant results is actually quite reasonable. Tolerating these moments of uncertainty and occasional failures to replicate might prove difficult, but a complete absence of replication failures would likely suggest larger problems associated with incomplete reporting or publication bias. In fact, a series of results completely free of replication failures might suggest that some investigators decided not to write up experiments that showed no differences in treatment outcomes, or some editors did not accept those null findings for publication. Without infinite resources, we should expect some replication failures even when a treatment is effective, simply because of sampling error (Amrehein, et al., 2019).
Part of the problem with replication might arise because of the ritual surrounding P-values less than 0.05. Available randomized clinical trials of psychedelic-assisted therapy almost invariably examine the experimental treatment and a control, compare their average impacts, and report whether or not group differences reached statistical significance. Readers who are familiar with the spirited arguments about null hypothesis testing might relish this chance to repeat critiques of the practice that started several decades ago (Neyman & Pearson, 1928). But reconsidering what we do (and do not) call a replication might be in order. The question seems straightforward. Either the psychedelic-assisted treatment creates benefits or it does not. But small samples, including the trials already published, invariably show greater variation in their estimates of effect sizes in the population. Incredible statistical power could help, as enormous samples provide more trustworthy answers. But resources are limited, and these trials already require tremendous time, effort, and cash.
With these issues in mind, those eager to answer questions about the efficacy of psychedelic-assisted treatments are left in a quandary about what does and does not qualify as a replication. Strict adherence to null hypothesis testing could prove time consuming and expensive. But trusting in clinical lore and subjective impressions feels like something other than science. One helpful idea concerns the prediction interval—an anticipated effect size based on an initial estimate from an original experiment. An initial experiment's effect and sample size can inform guesses about a subsequent replication's effect for a specified sample size. Any effect that falls within this interval could qualify as a replication, regardless of statistical significance. This approach has proven helpful in the past when Psychology appeared to have a “replication crisis.” Although only 36% of findings appeared to replicate based on statistical significance, a close look at prediction intervals revealed that actually 77% of findings were consistent with the initial publications (Patil, Peng, & Leek, 2016). A focus on identifying effects within the interval, rather than reaching statistical significance, could have advantages for psychedelic-assisted treatments currently.
For example, a recent Phase III trial of MDMA-assisted treatment for post-traumatic stress disorder represents years of herculean work performed in multiple nations and reveals a promising effect size of d = 0.91 (Mitchell et al., 2021). The treatment itself makes sense. Alternative treatments lead to challenging drop-out rates and limited success. MDMA, when administered over multiple sessions as part of a manualized treatment lasting 18 weeks, outperformed the placebo control group. This original experiment had 46 people receive MDMA and 43 controls. The effect size, Cohen's d, is a standardized difference between the means for the MDMA and control groups. (Literally the difference between the means divided by the pooled standard deviation).
Any team eager to replicate these results might initially plan to use the same sample sizes present in the original experiment. Understandably, they might wonder what sort of effect size they could anticipate if their results were from the same population as the original experiment, given expected variation due to sampling error. Note that anticipating the same statistical significance could be unrealistic. Ideally, the replicating team would change from expecting another P value below 0.05 to a focus on identifying an effect within the prediction interval. The replication team (and consumers of their research) would essentially ask how different their d could be from the original d if their replication effect size really were identical except for the natural variation that we would expect based on sampling.
Thanks to a generous shiny app (https://replication.shinyapps.io/dvalue/) and simple calculations (Spence & Stanley, 2016), we learn that this interval includes all the values from d = 0.29 to 1.53. That is, if we assume a replication experiment with sample sizes identical to the original ones, the 95% prediction interval ranges from 0.29 to 1.53. The replication d has a 95% chance of falling into this interval, assuming sampling error alone (note that 5% will still fall outside this range and could still stem from sampling error.) This range might appear vast. Cohen's (1992) classic work on power and effect sizes suggests that 0.29 is between “small” and “medium” but 1.53 is much larger than the 0.8 considered “large.” The original sample size is fixed, but a team might imagine a bigger N for the replication. With 50 per group, the 95% prediction interval would be a little smaller and range from 0.31 to 1.51. But the returns for a larger sample size diminish. With 500 participants per group, the new range is from d = 0.45 to 1.36. No research team is at fault here. These are the simple vagaries of sampling error. The variation and sample size within the original study leave the replication team with a potentially daunting task. In a sense, any replication that falls into this interval could be considered “a success” (Patil et al., 2016). That is, any result in this range is a reasonable expectation for a replication if this second sample differed from the original only based on sampling error.
In contrast, demands for identical statistical significance could be excessive. Any replicating team might dream of publishing the standard statistically significant result that appeared in the previous trial. They might sit down with power tables to see what they might need to reach the sacred P < 0.05 for the same dependent variable used in the original experiment. If they only were to accept a statistically significant result, even one-tailed (given that the therapy should do better than the control), they might want to specify power up front. Assuming a Type I error rate of 0.05, the knee-jerk response is often to set power to 0.80. But this plan literally means that 20% of the time, a true effect will go undetected. The team might prefer something more definite, like power of 0.99. But the results for this range of effect sizes is humbling.
For the high end of the prediction interval (d = 1.53), power of 0.99 and an alpha of 0.05 (one-tailed) requires a mere 15 participants per group. But trusting in an effect size estimate from such a small sample seems ill-advised given that the Phase III trial was almost three times as large. Should the effect exactly replicate the effect size of the Phase III trial at d = 0.91, an N of 39 per group would have a power of 0.99. But the lower bound of 0.29 would require 376 per group to reach power of 0.99, which is more than four times as large as the original experiment they want to replicate (Faul, Erdfelder, Buchner, & Lang, 2009). Demanding statistical significance could require dramatically more resources than focusing on the prediction interval. The time to recruit, screen, and complete the process for 758 people (376 for the treatment and 376 controls) could take years and years. Many people suffering from PTSD could miss out on an efficacious treatment waiting for these data to publish.
Alternatively, the attempts at replication could focus on sample sizes comparable to those used to demonstrate efficacy for other treatments of the same or comparable problems and focus on the prediction interval rather than statistical significance. For example, a meta-analytic review of psychological therapies for PTSD examined over 100 trials, with an average sample size much like the one used for this Phase III MDMA study (see Lewis, Roberts, Andrew, Starling, & Bisson, 2020). A replication using the same sample sizes as this original trial, and a hypothesis involving the prediction interval, might well be within reach. Nevertheless, a demand for a miniscule P-value risks abandoning a treatment simply because random error (related to sampling) got in the way. Resources are too valuable and the need for treatment is too dire for us to ignore prediction intervals as an alternative approach. A focus on prediction intervals, rather than P-values, would help establish psychedelic-assisted therapies as empirically validated treatments without risking dismissing their efficacy because of sampling error.
Amrhein, V. , Trafimow, D. , & Greenland, S. (2019). Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician, 73(sup1), 262–270.
Cumming, G. , Williams, J. , & Fidler, F. (2004). Replication and researchers' understanding of confidence intervals and standard error bars. Understanding Statistics, 3(4), 299–311.
Earleywine, M. , & De Leo, J. (2020). Psychedelic-assisted psychotherapy for depression: How dire is the need? How could we do it? Journal of Psychedelic Studies, 4(2), 88–92.
Faul, F. , Erdfelder, E. , Buchner, A. , & Lang, A. G. (2009). Statistical power analyses using G* Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41(4), 1149–1160.
Illingworth, B. J. , Lewis, D. J. , Lambarth, A. T. , Stocking, K. , Duffy, J. M. , Jelen, L. A. , & Rucker, J. J. (2021). A comparison of MDMA-assisted psychotherapy to non-assisted psychotherapy in treatment-resistant PTSD: A systematic review and meta-analysis. Journal of Psychopharmacology, 35(5), 501–511.
- Search Google Scholar
- Export Citation
Illingworth, B. J. Lewis, D. J. Lambarth, A. T. Stocking, K. Duffy, J. M. Jelen, L. A. Rucker, J. J. 2021). A comparison of MDMA-assisted psychotherapy to non-assisted psychotherapy in treatment-resistant PTSD: A systematic review and meta-analysis. Journal of Psychopharmacology, 35( 5), 501– 511. 10.1177/0269881120965915
Krause, N. M. , Brossard, D. , Scheufele, D. A. , Xenos, M. A. , & Franke, K. (2019). Trends—Americans’ trust in science and scientists. Public Opinion Quarterly, 83(4), 817–836.
Lewis, C. , Roberts, N. P. , Andrew, M. , Starling, E. , & Bisson, J. I. (2020). Psychological therapies for post-traumatic stress disorder in adults: Systematic review and meta-analysis. European Journal of Psychotraumatology, 11(1), 1729633.
Luoma, J. B. , Chwyl, C. , Bathje, G. J. , Davis, A. K. , & Lancelotta, R. (2020). A meta-analysis of placebo-controlled trials of psychedelic-assisted therapy. Journal of Psychoactive Drugs, 52(4), 289–299.
Mitchell, J. M. , Bogenschutz, M. , Lilienstein, A. , Harrison, C. , Kleiman, S. , Parker-Guilbert, K. , …, & Doblin, R. (2021). MDMA-Assisted therapy for severe PTSD: A randomized, double-blind, placebo-controlled phase 3 study. Nature Medicine, 27(6), 1025–1033.
- Search Google Scholar
- Export Citation
Mitchell, J. M. Bogenschutz, M. Lilienstein, A. Harrison, C. Kleiman, S. Parker-Guilbert, K. Doblin, R. 2021). MDMA-Assisted therapy for severe PTSD: A randomized, double-blind, placebo-controlled phase 3 study. Nature Medicine, 27( 6), 1025– 1033. 10.1038/s41591-021-01336-3
Neyman, J. , & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference, part I. Biometrika A, 20, 1–2.
Patil, P. , Peng, R. D. , & Leek, J. T. (2016). What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspectives on Psychological Science, 11(4), 539–544.
- Search Google Scholar
- Export Citation
Patil, P. Peng, R. D. Leek, J. T. 2016). What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspectives on Psychological Science, 11( 4), 539– 544. 10.1177/1745691616646366
Romeo, B. , Karila, L. , Martelli, C. , & Benyamina, A. (2020). Efficacy of psychedelic treatments on depressive symptoms: A meta-analysis. Journal of Psychopharmacology, 34(10), 1079–1085.
Sakaluk, J. K. , Williams, A. J. , Kilshaw, R. E. , & Rhyner, K. T. (2019). Evaluating the evidential value of empirically supported psychological treatments (ESTs): A meta-scientific review. Journal of Abnormal Psychology, 128(6), 500.
- Search Google Scholar
- Export Citation
Sakaluk, J. K. Williams, A. J. Kilshaw, R. E. Rhyner, K. T. 2019). Evaluating the evidential value of empirically supported psychological treatments (ESTs): A meta-scientific review. Journal of Abnormal Psychology, 128( 6), 500. 10.1037/abn0000421
Spence, J. R. , & Stanley, D. J. (2016). Prediction interval: What to expect when you’re expecting… A replication. Plos One, 11(9), e0162874.
Zeifman, R. J. , Yu, D. , Singhal, N. , Wang, G. , Nayak, S. M. , & Weissman, C. R. (2022). Decreases in suicidality following psychedelic therapy: A meta-analysis of individual patient data across clinical trials. The Journal of Clinical Psychiatry, 83(2), 39235.
- Search Google Scholar
- Export Citation
Zeifman, R. J. Yu, D. Singhal, N. Wang, G. Nayak, S. M. Weissman, C. R. 2022). Decreases in suicidality following psychedelic therapy: A meta-analysis of individual patient data across clinical trials. The Journal of Clinical Psychiatry, 83( 2), 39235.
Zou, G. Y. (2007). Toward using confidence intervals to compare correlations. Psychological Methods, 12(4), 399.