Elsevier

Journal of Clinical Epidemiology

Volume 103, November 2018, Pages 131-133
Journal of Clinical Epidemiology

Commentary
Validation in prediction research: the waste by data splitting

https://doi.org/10.1016/j.jclinepi.2018.07.010Get rights and content

Highlights

  • In the absence of sufficient sample size, independent validation is misleading and should be dropped as a model evaluation step.

  • We should accept that small size studies on prediction are exploratory in nature, at best show potential of new biological insights, and cannot be expected to provide clinically applicable tests, prediction models or classifiers.

  • Validation studies should have at least 100 events to be meaningful. In Big Data, heterogeneity in model performance should be quantified rather than average performance.

Abstract

Accurate prediction of medical outcomes is important for diagnosis and prognosis. The standard requirement in major medical journals is nowadays that validity outside the development sample needs to be shown. Is such data splitting an example of a waste of resources? In large samples, interest should shift to assessment of heterogeneity in model performance across settings. In small samples, cross-validation and bootstrapping are more efficient approaches. In conclusion, random data splitting should be abolished for validation of prediction models.

Section snippets

Large sample validation

Examples with large sample size for development and validation are found in virtually all prediction models coming from the QResearch general practices resulting in Q score algorithms [3]. These can be seen as big data approaches. Here, routinely collected data from hundreds of general practices are used for model development and hundreds for validation. Such a split sample approach is attractive for its simplicity in providing independent and, hence, unbiased assessment of model performance.

Small sample validation

A recent and rather extreme example of data splitting was the evaluation of the prognostic value of single-cell analyses in leukemia [9]. To predict relapse, a data set was available with 54 patients. A model was constructed in 80% of the sample (44 patients) and validated in 20% (10 patients). Discriminative performance was assessed by a standard measure, the C-statistic [4], [10]. The study found that there were three relapses among the 10 patients in the validation cohort, with perfect

Simulation study

A simulation study was designed with three sample sizes and a 30% event rate (as in the leukemia study): extremely small (10 patients, three events), moderate (333 patients, 100 events), and large (1,667 patients, 500 events). We examine the variability of three different prediction models (or “classifiers”) by simulation, assuming that the true C-statistic of the prediction model would be 0.7, 0.8, or 0.9 (Fig. 1). We find that with only three events, a substantial fraction of validations

Implications

From the aforementioned, three implications can be learned for the practice of validation of prediction models:

  • 1)

    In the absence of sufficient sample size, independent validation is misleading and should be dropped as a model evaluation step [14]. It is preferable to use all data for model development with some form of cross-validation or bootstrap validation for the assessment of the statistical optimism in average predictive performance [15].

  • 2)

    Basically, we should accept that small size studies on

References (22)

  • P.C. Austin et al.

    Validation of prediction models: examining temporal and geographic stability of baseline risk and estimated covariate effects

    Diagn prognostic Res

    (2017)
  • Cited by (86)

    • Methods and procedures of clinical predictive model

      2024, Chinese Journal of Evidence-Based Medicine
    View all citing articles on Scopus
    View full text