Introduction

Asthma is a heterogeneous disease, defined by the most recent Global Initiative for Asthma (GINA) global strategy for asthma management and prevention consensus as a condition characterised by the presence of respiratory symptoms such as wheeze, shortness of breath, chest tightness and cough that vary over time and in intensity, together with variable airflow obstruction [1]. However, various definitions of asthma do not capture the heterogeneity of this common complex condition. It is becoming increasingly clear that asthma is not a single disease, but a syndrome which consists of a number of disease subtypes with similar observable clinical characteristics [2]. These observable characteristics of the disease are often referred to as asthma phenotypes. The term 'asthma endotype' is not synonymous with phenotype, and it should be used to refer to the distinct disease entity under the umbrella diagnosis of asthma, which has defined pathophysiological mechanisms that give rise to clinical symptoms [3]. It should be emphasised that the same observable characteristic (i.e. phenotype) can arise as a consequence of different underlying pathologies (i.e. endotypes), which is consistent with observations showing that there are subtypes of asthma that share similar clinical symptoms but have differing underlying pathophysiological mechanisms [4]. There are numerous examples in other disease areas of a similar or identical clinical presentation arising as a consequence of different pathology (e.g. fever in childhood can be caused by numerous different mechanisms).

The traditional constructs of ‘asthma phenotypes’ have been largely descriptive, with little uniformity, and usually informed by subjective observations of single dimensions of the disease, such as triggering factors (e.g. extrinsic and intrinsic asthma [5], exercise-induced asthma [6]), patterns of airway obstruction (e.g. reversible and irreversible asthma [7]), or pathology (e.g. eosinophilic and non-eosinophilic asthma [8]).

In paediatric asthma, changes over time in symptoms such as wheeze have been used to define phenotypes of wheezing illness during childhood [9]. For example, based on clinical observation of changes in the temporal pattern of wheezing illness during childhood, as confirmed in the birth cohort study (Tucson Children’s Respiratory Study), Martinez et al. divided children into three groups (or phenotypes) of wheezing: transient early wheezers, late-onset wheezers, and persistent wheezers [10]. Although these phenotypes are clinically meaningful in their association with lung function and subsequent development of asthma [11], their distinct underlying pathophysiological mechanisms have not been elucidated or confirmed—they cannot be considered as endotypes.

Based on expert opinion and consensus, Lotvall et al. [4] suggested the existence of six asthma endotypes: aspirin-sensitive asthma, allergic bronchopulmonary mycosis, allergic asthma, asthma predictive index-positive preschool wheezers, severe late-onset hypereosinophilic asthma, and asthma in cross-country skiers. However, the well-defined pathophysiological mechanisms and biomarkers which differentiate these proposed endotypes have not been discovered, and there is no universal agreement that these subtypes of asthma represent true endotypes [12]. At this time, the endotype concept remains largely hypothetical, but may have a tangible value in helping us to formulate strategies to better understand the mechanisms underlying different asthma-related diseases, and thus to identify more effective stratified treatment strategies [13].

In recent years, approaches to subtyping asthma have evolved from subjective expert opinion to more data-driven methodologies such as machine learning [14, 15]. Statistical machine-learning methods facilitate the efficient exploration of data for the identification and analysis of disease patterns. These methods are able to draw upon the vast array of data generated from birth and patient cohorts in order to cluster, classify, regress, and make predictions from data based on inherent patterns within the large complex data set. This is in contrast to the traditional methods based on human observation and testing of hypotheses using prior knowledge. Within the context of asthma subtyping, methods such as unsupervised clustering approaches, factor analysis, and principal component analysis come into wide use within the last decade. These are hypothesis-generating, with the overarching notion that the inherent patterns within the data may be a reflection of different underlying aetiologies, genetic basis, and/or immunopathophysiologies, and that identified clusters may represent distinct asthma endotypes. If this assumption is correct, clustering methodologies could facilitate better understanding of the disease mechanisms, identification of novel therapeutic targets, and better clinical trial design incorporating group-specific targeted treatment, all of which are essential steps towards delivery of stratified medicine in asthma.

Here we present a review of the different clustering methodologies—model-free and model-based—and their applications in asthma subtyping. We provide an overview of the major studies and discuss the implications and approaches used.

What Is Clustering?

Cluster analysis is a popular unsupervised machine-learning method that seeks to identify similar characteristics in subjects (or variables) and to group them together on that basis. In selecting groups, the primary aim is to minimize intra-group variance while simultaneously maximizing inter-group variance. Clustering ‘classifies’ data by labelling objects with cluster ‘labels’ or giving each object a probability of belonging to a certain cluster. Cluster labels are not known a priori, and are derived solely from the data. This is in contrast to supervised methods such as logistic regression and support vector machines, which seek to derive rules for classifying new objects based on a set of previously classified objects.

Selection of Variables/Features and Dimension Reduction

Cluster analysis lacks the ability to differentiate between clinically relevant and irrelevant variables; thus the choice of variables to input into the clustering algorithm is one of the most important considerations. Variable or feature selection can be performed subjectively or objectively. Subjective methods involve choosing relevant variables based on expert advice and published work. In contrast, objective methods use data-driven approaches to variable/feature selection, the most common of which are stepwise methods (such as backward and forward selection) and dimension reduction techniques (such as principal components analysis [PCA] and factor analysis [FA]). Forward selection progressively adds variables of greatest significance (based on pre-set p values) to the model. Backward selection starts with all variables and progressively drops the least significant ones until all the remaining variables are statistically significant.

To reduce the large number of variables, the majority of studies we reviewed employed manual extraction based on expert advice. For example, Moore et al. [16] manually reduced the number of variables from 600 to 34 by excluding variables with missing data and those that were either deemed redundant because information was captured by another variable (multicollinearity) or considered not clinically relevant. Other studies used dimension reduction techniques such as PCA and FA, which reduce data by generating small subsets of generally uncorrelated variables from a large data set of potentially correlated variables. It is useful when we assume that there are underlying latent (unobserved) constructs (factors/components) in the data which cannot be measured directly but which can influence responses on measured variables. Although these two methods were used almost interchangeably in the literature we reviewed, there are differences between them. As a general rule, PCA is used to reduce data into smaller subsets, while FA is used to determine the unobserved factors which explain the data.

Clustering Methods

Three main clustering methods have generally been used in asthma subtyping: hierarchical approaches, non-hierarchical or partitioning-based approaches, and model-based or probabilistic approaches.

Hierarchical Clustering

Hierarchical clustering aims to create a pyramidal or (as its name implies) ‘hierarchical’ grouping of homogeneous clusters that can be displayed in a tree-like graph (dendrogram). It does not require the number of clusters to be specified a priori, and cluster assignment is based on similarity of measured characteristics. Within hierarchical clustering there are two subcategories: agglomerative and divisive methods (Fig. 1).

Fig. 1
figure 1

Overview of the difference between agglomerative and divisive hierarchical clustering

Agglomerative Method

The agglomerative method is a bottom-up approach that starts with each data point assigned to its own cluster, and iteratively merges the two closest clusters until all the data belong to a single cluster [17]. Once clusters are formed, there is no inter-cluster switching. The choice of which clusters to combine is determined by measuring distances, similarities/dissimilarities, and/or using linkage criteria.

This method formulates decisions based on the pattern of variables used, without accounting for the overall distribution.

Divisive Method

This variant is a top-down approach whereby all objects initially belong to one cluster, which is then recursively divided into sub-clusters until the desired number of clusters is obtained [18]. By initially having a single cluster, the model gains insight into the spread and type of data, and subsequently makes decisions on when and how to divide the sub-clusters.

Similarity/Dissimilarity Measures

To determine whether objects within the same clustered group are similar or dissimilar, distance measures and linkage criteria (Table 1) are used. Distance metrics measure the distance between observations, while linkage criteria measure the distance between clusters. In order to define a similarity measure, the actual similarities between objects can be evaluated using a distance measure. Choosing a measure for calculating the distance between data can sometimes be arbitrary, as there are no general theoretical guidelines. The Euclidean distance measure, which is the default method in most statistical packages, was used in all but one of the studies reviewed here [19].

Table 1 Most commonly used linkage criteria

Non-Hierarchical Clustering

The prototype of non-hierarchical clustering is k-means (Fig. 2), which is a partitioning method in which the number of clusters is specified a priori and the optimal solution is chosen. It is a variance-minimizing algorithm whereby each subject is assigned to its nearest cluster based on the minimum squared Euclidean distance. This method is sensitive to outliers and is generally limited to numeric attributes.

Fig. 2
figure 2

A silhouette plot used for non-hierarchical clustering (k-means) (from [20], with permission). A silhouette plot shows how close observations from neighbouring clusters are to each other using a measure of −1 to +1. A value of +1 indicates that observations are far away, 0 indicates that the observations are very close to the boundary of deciding exactly which cluster they belong to, and −1 indicates that the observations may be assigned to the wrong cluster

Model-Based Clustering

Model-based clustering (also known as latent class analysis or mixture modelling), is based on the assumption that the observed data are generated by a collection of models, with each cluster corresponding to a different model. Each resulting cluster is represented by a (most commonly) parametric distribution, and can be either spherical or ellipsoidal of varying sizes and variance. The advantage of model-based clustering is that it can produce probabilistic cluster assignments for individuals—i.e. it captures the uncertainty in assigning individuals to clusters. Bayesian extensions (e.g. Markov chain Monte Carlo [MCMC], expectation-maximisation [EM]) of model-based clustering can also be used to incorporate prior distributions to reflect uncertainty around model assumptions.

A major challenge in model-based clustering is identifying and representing the underlying model assumptions with reasonable complexity. However, unlike a model-free approach, log-likelihood-based statistics such as the Bayesian information criteria (BIC) and model evidence allow us to select the most parsimonious set of assumptions by penalising model complexity for accuracy. This is in contrast to model-free clustering, where an arbitrary distance measure is used to find clusters. Importantly, choosing the best statistically fitting model is not enough; there must be an element of expert input into choosing the number of clusters to maximise the potential clinical relevance of the identified subgroups.

Stability of Resulting Clusters

Cluster stability is an important aspect of validity, because cluster methods can generate groups in fairly homogenous data sets. Furthermore, there is always a risk of identifying less meaningful clusters. Stability in this context refers to clusters not disappearing when, for example, outliers are added, data is sub-set, or random error is introduced to every point to simulate measurement error [21]. The most common means of doing this is to apply the same cluster method to a sample data set taken from the original one (also termed bootstrapping), and identifying similar clusters using similarity measures. The similarity values are then compared, and stability is taken to be the mean similarity in the new data set [21].

Clustering Methods in Asthma Subtyping

The Use of Principal Components Analysis/Factor Analysis in Asthma Subtyping

Studies which used PCA/FA as stand-alone analyses for demonstrating the heterogeneity of asthma syndrome and its risk factors are summarised in Table 2 [2240]. Sample sizes ranged from 69 to 16,635, and the number of variables used initially ranged from 5 to 97. The number of resulting components/factors ranged from one to six.

The PCA was first used in the context of asthma by Smith et al. to examine whether syndromes of coexisting respiratory symptoms could be discovered using the response to a large number of questions (>100) from validated questionnaires administered in a birth cohort (Manchester Asthma and Allergy Study [MAAS]) [22]. The analysis demonstrated that symptom components (wheeze, cough, wheeze with allergens, wheeze with irritants, chest congestion) were better indicators of the presence and developmental changes in observable secondary asthma phenotypes (such as lung function, airway reactivity, and immunoglobulin E (IgE)-mediated sensitisation) than the presence of individual symptoms such as wheeze.

Using factor analysis, Bailey et al. [32] found that the intensity of asthma symptoms, asthma management, and airflow impairment (forced expiratory volume [FEV1]) were independent components of the disease. This was also seen in the study by Grazzini et al. [36], where lung function (FEV1) was a factor independent from asthma symptoms in a mixed teenager-adult population of 69 asthmatics. Lung function was also independent of inflammatory markers (fraction of exhaled nitric oxide [FeNO], sputum eosinophils) in other studies [33, 39, 40]. The study by Juniper et al. [37], which included 763 patients older than 12 years who participated in clinical trials, showed that, despite medication, daytime and nighttime symptoms were distinct and independent factors of asthma. Clemmer et al. [31] used PCA to demonstrate that a clinical ‘endophenotype’ relating to corticosteroid responsiveness best predicted corticosteroid response in all replication populations. Other studies in Brazilian [26], British [28], and Japanese [41] children have shown that ‘Western diets’ were independently associated with an increased risk of wheezing by school age.

More recently, both PCA and FA have been used as dimension reduction techniques to generate small subsets from a large number of variables; these small subsets (components/factors) were then used for further clustering. For example, Just et al. used PCA to reduce 40 variables to 19, characterising age and body mass index (BMI), asthma duration, medication use, hospitalisation, atopy, and lung function [42], which were then used in hierarchical clustering. This approach acts as feature extraction in that it can initially visualize/reveal clusters prior to the cluster analysis.

Asthma Subtype Classification with Model-Free Approaches

The studies identified from our literature search which used model-free approaches for subtyping asthma are shown in Table 3 [16, 19, 4361]. Of 22 studies, 12 were carried out in adult populations. Population sample sizes ranged from 57 to 1843. The approach of choice was Ward’s hierarchical method with some form of data reduction, whether with PCA, multiple regression analysis, discriminant analysis, factor analysis, or decision trees. k-means clustering was performed in 9 of 22 studies, but always as a supplementary method. The resulting numbers of clusters ranged from two to six.

Table 2 Studies using principal components analysis/factor analysis in asthma subtyping

Paediatric Studies

The Trousseau Asthma Program (TAP) in France used Ward’s hierarchical clustering as the method of choice [42, 50, 53]. In the TAP preschool population of 551 wheezers, ‘three clusters of wheezing’ were identified: mild episodic viral wheeze, atopic multiple-trigger wheeze, and non-atopic uncontrolled wheeze [50]. The mild episodic viral wheeze class was identified in one British [62] and one French cohort [63] using model-based approaches (see below), and the non-atopic uncontrolled wheeze cluster was reproduced in a separate TAP cohort [53]. The multiple-trigger wheeze was previously identified using supervised methods in the Avon Longitudinal Study of Parents and Children (ALSPAC) [64]. This cluster described children with either early- or late-onset persistent wheezing characterised by atopy and poor lung function. A similar description of wheezing was used in the MAAS cohort to demonstrate that persistent wheezing and multiple early atopy were associated with diminished lung function by age 11 years [65].

The clusters of wheezing described in the TAP cohort remained stable at age 5 years [53]. However, at school age, the clusters were different: ‘asthma with severe exacerbations and multiple allergies’, ‘severe asthma with bronchial obstruction’, and ‘mild asthma’ [42]. These accounted for two ‘phenotypes’: asthma with severe exacerbations, and multiple allergic severe asthma with bronchial obstruction [42]. It is important to note, however, that not only were the children from a separate cohort within the TAP, but the clustering methodology was also different; PCA was used for data reduction and a two-step clustering approach including k-means [42]. Furthermore, differing post hoc analyses were used.

The Severe Asthma Research Program (SARP) is a US multi-centre study comprising both children and adults with persistent asthma. The study by Fitzpatrick et al. [46] included 161 children aged 6–17 years. Variables were selected subjectively with no data reduction technique, and the authors derived ‘composite variables’ from binary and questionnaire data discerned by physicians. After Ward’s hierarchical clustering, four clusters were identified: ‘late-onset symptomatic asthma’, ‘early-onset atopic asthma and normal lung function’, ‘early-onset atopic asthma with mild airflow limitation and comorbidities’, and ‘early-onset atopic asthma with advanced airflow limitation’. These results and the accompanying clinical characteristics exhibited by the children were consistent with previously reported data from clinical observations [6668]. However, these results differed from findings in a Turkish cohort of children aged 6–18 years with moderate–severe asthma [19]. In contrast to previous studies, the predictive ability of clusters and of original variables in relation to asthma severity in this population was relatively poor [19]. The authors concluded that the search for asthma subtypes needs careful selection of variables, which should be consistent across studies, and that a cautious interpretation of results is warranted [19].

Studies in Adults

The initial work that sparked further interest in clustering methodology was the study conducted by Haldar et al. in Leicester, UK [43]. A two-step Ward’s hierarchical and subsequent k-means cluster analysis was performed in three different data sets (refractory asthmatics from secondary care, primary care data, refractory asthmatics from clinical trial). After variable selection to identify ‘most clinically relevant’, PCA was performed, which reduced the variables into five components. Results of the subsequent cluster analysis revealed three clusters in the primary care data set and four clusters in the secondary care data. Two clusters were identified in each data set: ‘early-onset atopic asthma’ and ‘obese female with no eosinophilic inflammation’. The primary care data set identified a third ‘benign asthma’ cluster, while the secondary care set identified an ‘early-onset, symptom-predominant group with minimal eosinophils’ cluster as well as a ‘late-onset, male predominant, eosinophilic inflammation with few symptoms’ cluster. These results were then validated in the clinical trial data set, which revealed a three-cluster model similar to that in the secondary care set.

Expanding on Haldar’s findings, the SARP study [16], which included 726 patients older than 12 years, began with 628 variables, which were reduced to 34 by excluding missing data, text data, and redundant and ‘irrelevant’ variables. Half of the variables were composite. Ward’s method and post hoc discriminant analysis for tree analysis was performed to describe five clusters highly determined by frequency of symptoms, medication use, and lung function. Both studies identified a group of obese women with adult-onset asthma and less atopy, as well as a group of severe late-onset atopic asthmatics with poor lung function. However, SARP did not use sputum eosinophilia, which was an important feature in the Leicester study. A few years later, the SARP group used a different approach, and identified six clusters [60]. k-means clustering partitioned the 378 subjects, while Ward’s method clustered the 112 variables into 10 InfoGain (information gain—measures how well variables predict clusters)-ranked variable clusters based on symptoms, atopy, medication use, lung function, corticosteroid use and cause, Th2 inflammation, inflammatory markers, and demographics. Preprocessing of the data included imputing variables with less than 5% missing data while excluding those with more than 5%. Markov blanket algorithms identified redundant variables. Three clusters overlapped with previous results (severe asthmatics, female late-onset with normal lung function), while two were novel (late-onset severe eosinophilic asthmatics with nasal polyps, severe atopic Hispanics). It is interesting to note that similar clusters were seen in children from SARP and the Asthma Severity Modifying Polymorphisms (AsthMaP) Project [45], though the degree of lung function impairment was less.

Patrawalla et al. [49] based their clustering and variable selection technique on SARP, and identified clusters similar to those found by Wu et al. [60], though the Hispanic women had milder disease. This was explained by the fact that the sample was from an urban New York City population with a higher proportion of Hispanics.

The results obtained in the Leicester and SARP populations were reproduced in part in a Dutch cohort of patients with severe asthma that included more thorough inflammatory markers [58]. The resulting three clusters confirmed the existence of two previously reported clusters: ‘severe eosinophilic inflammation-predominant asthma with few symptoms and poor lung function’, and ‘obese late-onset asthma with low eosinophils additionally provoked by comorbidities such as gastrointestinal oesophageal reflux disease (GORD)’. The third cluster in the Dutch cohort (‘mild adult-onset well-controlled asthma’), which was not found in Leicester or SARP, had been seen in studies in Asian populations which included smoking status in their analysis [54, 55].

The recurring obesity-related subtypes were explored in more detail in two US trials comprising 250 adults [52]. With the incorporation of detailed data on inflammation, major differences were found between the obese and non-obese populations. Non-obese asthmatics had significantly better lung function. Obese patients with early-onset asthma and poor lung function had greater degrees of systemic inflammation (represented by the inverse association between hsCRP and GCRα); this was directly associated with increased glucocorticoid resistance (measured by reduced MKP-1 expression via dexamethasone).

Asthma Subtyping and Model-Based Approaches

Latent Variable Modelling

This topic was recently discussed in detail in another review article, which identified a total of 36 studies within the last 5 years that used model-based approaches to asthma subtyping (four in adult populations, 32 in children) [69]. Sample sizes in these studies ranged from 201 to 11,632. Methods included latent class analysis (14 studies), longitudinal latent class analysis (11 studies), latent class growth analysis (one study), latent growth mixture modelling (eight studies), and mixture models (two studies). The number of resulting classes ranged from three to eight, and were in most cases characterised by physician-diagnosed asthma, atopy, and/or FeNO. The most common outcome was ‘wheeze phenotype’ [64, 7182], followed by ‘atopy class’ [64, 76, 8186].

In these studies, the wheeze classes (often referred to as ‘phenotypes’, although by definition these were not observable, but latent) were described as either early-onset (transient [78, 87, 88] or prolonged [70]), late-onset (characterised as wheeze after age 3 years, persisting into later childhood) [70, 74, 78, 80, 83], or persistent (controlled and troublesome, characterised by diminished lung function by school age) [9, 74]. Early-onset wheeze was found to be predictive of poor lung function, but not atopy, eczema, or rhinitis at age 6–8 years [87]. Late-onset wheeze was associated with bronchial hyperresponsiveness and, in some cohorts, poorer lung function at age 6 years [64]. The persistent wheeze phenotype was consistently characterized by diminished lung function by school age [9, 74].

Atopic sensitisation was the second most common phenotype investigated by latent variable modelling, based on the hypothesis that distinct subtypes may be present. Simpson et al. applied a hidden Markov chain model to cluster children in MAAS into five sensitization classes using skin tests and specific IgE data at ages 1, 3, 5, and 8 years [83]. The underlying assumption was that children in each class had the same probability of becoming sensitized or resolving sensitization at each age (and to a similar panel of inhalant and food allergens), and that this differed between classes. Children in one of the four classes (comprising ~25% of sensitised participants), which the authors assigned as ‘multiple early atopy’, were much more likely to have asthma and worse lung function than children in any of the other classes [65, 83]. An almost-identical five-class model was identified by extending the analysis in MAAS through to age 11 years and, in another British birth cohort (Isle of Wight study), indicating stability over time and across different populations [84, 89]. However, these classes of sensitisation can be identified only by using statistical inference on longitudinal data, and differentiation between classes at any single cross-sectional point is currently not possible. This underscores the need to develop diagnostic tools that delineate different classes at any cross-sectional time point among the patient population, in order to facilitate the application of these findings in clinical practice [8992].

In the adult population, Newby et al. performed a cluster analysis using mixture models on a multi-centre longitudinal observational study of 349 asthma patients in the British Thoracic Society Severe Refractory Asthma Registry [93]. Variables were initially restricted to those with less than 30% missing data that were non-categorical, and factor analysis was then applied. The resulting five factors (airflow obstruction, exacerbation frequency, IgE/BMI, treatment scaling, blood eosinophilia) were used in the cluster analysis to describe five clusters: (1) ‘early-onset atopic’, (2) ‘obese, late-onset’, (3) ‘normal lung function least severe asthma’, (4) ‘late-onset, eosinophilic’, and (5) ‘airflow obstruction’. The best-fitting models were chosen by the Akaike information criterion (AIC) or BIC, and the clusters were validated using a classifier on a separate data set from the same registry. Cluster stability for the whole group was only 52%, with cluster 2 accounting for 71% as the highest, while cluster 4 accounted for only 25%. A significant proportion of subjects in clusters 1, 4, and 5 moved to clusters 2 and 3 at follow-up, indicating greater obesity, lower blood eosinophilia, better lung function, and fewer exacerbations. Taking into account small differences in variables used, the results were broadly in accordance with previously reported clusters derived using model-free approaches [16, 43]. Gaussian mixture model clustering was also used to investigate cytokine response patterns of peripheral blood mononuclear cells to mite allergens, with results suggesting that asthma was associated with a broad range of immunophenotypes [94]. Various machine-learning approaches were also used to identify patterns of IgE responses to a large number of individual allergen molecules in component-resolved diagnostics microarrays and to associate these with asthma and allergic diseases [14].

Challenges in Asthma Clustering

Mixed Types of Data

Medicine generates many different types of data, including binary, numerical, and categorical variables, non-normal distributions, missing values, and outliers, and applying a model that combines these is challenging. One solution may be to transform the raw variables into a single type (i.e. all binary variables). Prosperi et al. [19] showed that, although results were vastly different when comparing the raw and binary variables, they were still clinically consistent with each other. However, in certain instances, changing continuous variables into binary variables would require the creation of categories. For example, if we take FEV1 and categorise it based on levels of obstruction (e.g. 80%, 60–80%, below 60%, above 80%), we assume that an FEV1 of 60% has the same clinical significance as an FEV1 of 79%, which is not necessarily true. Other issues with dichotomizing variables include a loss of information, leading to a reduction in statistical power, a loss of linear relationships between two groups, and underestimation of outcome variability between groups [95]. Another way to minimize this problem is to create clinically meaningful categories, but this will likely introduce an element of subjectivity.

Lack of Robustness to Choice of Variables and Clustering Methods

Different input parameters, even within the same data set, may produce different results. For example, in the SARP, the same hierarchical clustering techniques on the same data set produced different clusters [16, 46]. The major differences were in the preprocessing of the data and the cluster input. Wu et al. also included inflammatory markers in their analysis, which would account for better atopy delineation [60].

As mentioned previously, the choice of variables has been generally limited to consideration of expert opinion based on previous work. Furthermore, there is a practical consideration involved in that the variables chosen must correspond to the type of data in the cohort, given that some studies included all variables [58, 60, 61] in the data set, while others chose those that were ‘most relevant’ [42, 43, 48, 54, 55, 57]. This resulted in patient exclusion, particularly when there was a requirement to remove variables with missing data. Although some studies implemented imputation techniques in order to overcome this [60, 93], the impact on clinical outcome was not fully explored, which should be taken into account when interpreting the results.

In most studies, the choice of distance measure was not specified, and so it was assumed that the default measures in statistical packages were used (i.e. Euclidean distance). Only two studies [19, 44] specified varying the distance measures (Gower and/or Jaccard) to observe the effect. One study group used centroid linkage as their similarity measure, whereas the rest were based on Ward. Consequently, we cannot say whether the methods employed were the most reliable, as there is a repository containing hundreds of options.

Prosperi et al. hypothesized that clusters resulting from various studies differed because of variation in investigator choice of factors, encoding/categorization/transformation of variables, and methodology [19]. They proceeded to verify this using different hierarchical clustering and data reduction approaches on a cohort of children aged 6–18 years from the Paediatric Asthma Clinic in Ankara, Turkey. Data reduction was performed by both FA and PCA, resulting in five ‘dimensions’ of variables accounting for 35% of the variance. Multiple hierarchical clustering analyses were performed by varying the variable encoding scheme, distance linkages, feature selection, and dimensionality reduction space. Although the authors demonstrated that small variations in linkage-distance functions did not affect the resulting clusters [19], they tested only two, and it is possible that other linkage criteria may have influenced the results. Most significant was the fact that changes in variable encoding schemes and transformations resulted in different clusters [19]. While it is possible to test the strength of the methods employed by bootstrapping and/or multiple repetitions, this does not necessarily translate into more plausible results overall.

This is where model-free clustering runs into issues, and where a model-based approach might provide more structured methods, as MCMC and EM algorithms are applicable to all modelled distributions. However, in latent class analysis, there is no agreement on the optimal way to determine the number of classes. The most common method is the BIC, though other methods such as the AIC, likelihood tests, bootstrapping, and entropy have been used extensively, which may account for the different classes across populations.

Differing Subtypes Across Populations

It is clear that different clusters are identified across different populations (see Table 3). Other than differences in statistical methodologies, these disparities may be due to differences in features/variables selected to inform the mode (for example, the choice of lung function variables differed among studies, and post-bronchodilator FEV1 was included in only a few of these [43]). Of note, in addition to influencing heterogeneity in identified clusters, the non-inclusion of some of the potentially important variables (e.g. post-bronchodilator lung function) may result in failure to capture some important underlying mechanisms. Additionally, most studies were conducted in patients with severe or moderate–severe asthma, and the same subtypes may not be seen in the mild asthma population (Table 3).

Table 3 Studies using model-free approaches for subtyping asthma

It is also important to note that clusters identified cross-sectionally at a specific time point may not always be seen at different time points. Further longitudinal analysis is required to visualize how the clusters vary over time.

Conclusion

Our understanding of asthma has come a long way, and data-driven hypothesis-generating clustering methods have aided in identifying distinct subtypes. However, we must exercise caution when translating these results into clinical practice, as statistical inference on a large data set is needed to identify disease subtypes, and biomarkers that would allow differentiation of such subtypes at any cross-sectional time point are in most cases not available. Further challenges to the optimal use of clustering methodologies include tailoring models to individual data sets and incorporating genetic, epigenetic, and more detailed molecular-level data. The resulting models should then be able to accommodate large volumes of data in order to discern the developmental profiles of each individual, facilitating a genuinely personalised approach to asthma management.