Rationale: Chronic obstructive pulmonary disease (COPD) is a heterogeneous disease that likely includes clinically relevant subgroups.
Objectives: To identify subgroups of COPD in ECLIPSE (Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints) subjects using cluster analysis and to assess clinically meaningful outcomes of the clusters during 3 years of longitudinal follow-up.
Methods: Factor analysis was used to reduce 41 variables determined at recruitment in 2,164 patients with COPD to 13 main factors, and the variables with the highest loading were used for cluster analysis. Clusters were evaluated for their relationship with clinically meaningful outcomes during 3 years of follow-up. The relationships among clinical parameters were evaluated within clusters.
Measurements and Main Results: Five subgroups were distinguished using cross-sectional clinical features. These groups differed regarding outcomes. Cluster A included patients with milder disease and had fewer deaths and hospitalizations. Cluster B had less systemic inflammation at baseline but had notable changes in health status and emphysema extent. Cluster C had many comorbidities, evidence of systemic inflammation, and the highest mortality. Cluster D had low FEV1, severe emphysema, and the highest exacerbation and COPD hospitalization rate. Cluster E was intermediate for most variables and may represent a mixed group that includes further clusters. The relationships among clinical variables within clusters differed from that in the entire COPD population.
Conclusions: Cluster analysis using baseline data in ECLIPSE identified five COPD subgroups that differ in outcomes and inflammatory biomarkers and show different relationships between clinical parameters, suggesting the clusters represent clinically and biologically different subtypes of COPD.
Patients with chronic obstructive pulmonary disease (COPD) are heterogeneous, manifesting a wide range of clinical features, physiology, anatomy, and progression (1, 2). This heterogeneity is likely to reflect differing pathogenetic mechanisms that may respond differently to specific therapeutic interventions. Thus, the identification of more homogeneous subgroups may help in the development of treatments that could alter the course of the disease.
A recent consensus definition proposed that a COPD phenotype is “a single or combination of disease attributes that describe differences between individuals with COPD as they relate to clinically meaningful outcomes (symptoms, exacerbations, response to therapy, rate of disease progression or death)” (3). It follows that any such subtype or phenotype must have discriminant power among patients with COPD and must be prospectively tested to determine its validity (3).
Data mining methods, such as cluster analysis (4) and recursive partitioning (5), can be used to identify discrete groups of patients with similar combinations of disease characteristics. These techniques have been used previously to identify clusters of subjects with COPD (6–12), although the length of long-term follow-up information and the size of the cohorts used in these studies have varied. The ECLIPSE (Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints) study is a large, observational, longitudinal study aimed at characterizing the natural history of COPD (1, 13–16). The large size of the ECLIPSE COPD cohort (n = 2,164 patients); the detailed clinical, functional, radiological, and biological characterization of participants; and its 3-year follow-up allow the use of cluster analysis to identify subgroups of patients with COPD and to relate them to different clinically relevant outcomes.
We hypothesized that, within the ECLIPSE cohort, there will be discrete groups of subjects with different clinical characteristics that are associated with different outcomes. To test this hypothesis, we used clustering to identify COPD subgroups and then determined relationships with lung function decline, exacerbation frequency, progression of emphysema (assessed by computed tomography [CT]), 6-minute-walk distance (6MWD), health status, and mortality over 3 years.
A total of 2,164 subjects with COPD with GOLD spirometry grades 2 to 4 recruited in the ECLIPSE study (Clinicaltrials.gov identifier NCT00292552; GlaxoSmithKline study SCO104960) were used for analysis. The ECLIPSE study design has been published previously (13). ECLIPSE complies with the Declaration of Helsinki and Good Clinical Practice Guidelines and has been approved by the ethics committees of the participating centers. All participants provided signed consent.
The methods used in ECLIPSE have been published (1, 13, 15). Briefly, standardized questionnaires were used to record symptoms, health status, and patient comorbidity. Nutritional status was assessed using body mass index (BMI) and fat-free mass index (FFMI). The latter was measured by bioelectrical impedance. Exacerbations in the year before the study and during follow-up were also recorded (14). Spirometry and 6MWD were performed according to international guidelines (13, 16, 17). Low-dose CT scan of the chest (GE Healthcare, Little Chalfont, Buckinghamshire, UK or Siemens Healthcare, Erlangen, Germany) (1, 13, 15) was used to quantify the extent of emphysema at –950 Hounsfield units (Pulmonary Workstation 2.0; VIDA Diagnostics, Iowa City, IA) (15, 18). Peripheral venous blood was collected into vacutainer tubes in the morning after an overnight fasting.
Circulating white blood cell (WBC) count was measured at each participating center. Serum and plasma samples were stored at –80°C until analyzed centrally. Fibrinogen, C-reactive protein (CRP) (high sensitivity method), club cell secretory protein-16, surfactant protein-D (SP-D), chemokine (C-C motif) ligand 18, IL-6, IL-8, and TNF-α serum concentrations were determined by validated immunoassays as reported in detail elsewhere (19).
Results are summarized using mean and SD for continuous variables and percentages for categorical variables. Due to their skewed distributions, circulating inflammatory biomarker results are reported as median and interquartile range.
SAS version 9.1 (SAS Institute Inc., Cary, NC) was used for the factor analysis, demographic summaries, and longitudinal analyses. R version 2.11.0 (R Foundation for Statistical Computing, Vienna, Austria) was used to perform the clustering based on classification trees created using the random forest (RF) (20) package (version 4.5–34), based on the work of Breiman (21). More specifically, the approach to RF clustering described by Shi and Horvath was followed (22). This technique uses a dissimilarity measure based on RF predictors, which allows for mixed data types to be used (continuous, discrete, and categorical) and produces results robust to outliers. Multiple imputations (×10) were performed corresponding to 10 RF distance matrices to deal with missing data. S-PLUS 7.0 (TIBCO, Seattle, WA) was used for the partitioning of the reduced set of clinical variables using the RPART routine (5).
Differences in demographic and clinical characteristics between resulting clusters were analyzed using ANOVA for continuous normally distributed variables, χ2 tests for categorical variables, and the Kruskal-Wallis test for nonnormally distributed variables. Differences in longitudinal changes between cluster groups were also assessed using ANOVA. Longitudinal outcomes of mortality and time-to-first exacerbation were tested using log-rank tests and are displayed using Kaplan-Meier curves. Because this analysis is “hypothesis generating,” all testing was done at a nominal significance level of 0.05 without adjustment for multiple comparisons.
The study was sponsored by GlaxoSmithKline. A Steering Committee and a Scientific Committee comprised of 10 academics and six GlaxoSmithKline employees formulated the plan for the current analyses, approved the statistical plan, had full access to the data, and were responsible for decisions regarding publication. The study sponsor did not place any restrictions on statements made in the final paper.
The clinical, functional, radiological, and biological characteristics of the 2,164 patients with COPD included in the analysis have been reported elsewhere (1).
Initial variables for the factor analysis were selected by consensus of the ECLIPSE Steering and Scientific Committees. For this purpose, 41 variables from several distinct domains, including clinical, physiologic, imaging, and biomarker parameters, were selected (Table 1), eliminating those known to overlap significantly (e.g., FEV1 % predicted and FEV1 absolute). These variables were used as the input for a factor analysis (23), which identified 13 factors with eigenvalues >1 (see Table E1 in the online supplement). These 13 factors accounted for 61% of the overall variability in the set of 41 selected parameters. With two exceptions, variables with the highest loading on each fact were selected for the cluster analysis (Table 2). FFMI was chosen over sex because it allowed for additional within-sex analyses without altering the set of selected variables, and fibrinogen was chosen over CRP because it has better assay sensitivity (24). The factor loadings for both substitutions (FFMI/sex, fibrinogen/CRP) were very similar.
Demographics | Symptoms | Biochemical | Clinical/Functional |
---|---|---|---|
Age* | CES-D depression score | White blood cell count | 6-minute-walk distance |
Sex | FACIT-F fatigue score | Neutrophil count | Blood oxygen % |
Current smoker (Y/N), pack-year history | mMRC dyspnea score exacerbation history (>1 V ≤1) | Eosinophil count Hematocrit C-16 | FEV1 % predicted FEV1 percent reversibility FVC percent predicted |
ATS-DLD items | |||
Chronic cough | SP-D | FEV1/FVC ratio | |
Chronic sputum | Fibrinogen | AX (impulse oscillometry) | |
Cardiovascular events | IL-8 | R5 (impulse oscillometry) | |
Reflux | IL-6 | R5–R15 (impulse oscillometry) | |
Osteoporosis | TNF-α | Emphysema % LAA (–950 HU) | |
Anxiety | CCL18/PARC | Qualitative CT grade | |
Diabetes | CRP | BMI | |
Hypertension | Fat-free mass index |
Baseline Characteristics | Cluster A (n = 205) | Cluster B (n = 98) | Cluster C (n = 423) | Cluster D (n = 321) | Cluster E (n = 1,117) | Overall P Value |
---|---|---|---|---|---|---|
Demographics | ||||||
Age, yr | 64 (7)* | 63 (7) | 64 (7) | 63 (7) | 63 (7) | 0.169 |
Female, % | 34 | 34 | 13 | 38 | 42 | <0.001 |
BMI, kg/m2 | 26 (4) | 27 (6) | 32 (6) | 24 (4) | 25 (5) | <0.001 |
Fat-free mass index† | 17.9 (2.1) | 18.2 (2.6) | 20.8 (2.3) | 16.9 (1.9) | 17.4 (2.1) | <0.001 |
Current smoker, % | 25 | 31 | 35 | 25 | 42 | <0.001 |
Pack-year history | 46 (26) | 49 (28) | 54 (30) | 48 (26) | 47 (26) | <0.001 |
Symptoms | ||||||
mMRC score (≥2), % | 24 | 51 | 60 | 69 | 52 | <0.001 |
Chronic phlegm,† % | 35 | 37 | 46 | 56 | 56 | <0.001 |
Chronic cough, % | 28 | 42 | 48 | 53 | 53 | <0.001 |
Chronic bronchitis, % | 23 | 20 | 32 | 36 | 39 | <0.001 |
Chronic wheezing, % | 33 | 38 | 40 | 40 | 41 | 0.335 |
Exacerbations prior year | 0.56 | 0.81 | 0.83 | 1.13 | 0.85 | <0.001 |
SGRQ total score | 35 (16) | 47 (17) | 50 (19) | 56 (16) | 48 (18) | <0.001 |
Depression (CES-D ≥16) | 9 | 26 | 26 | 33 | 28 | <0.001 |
Fatigue (FACIT-F score)† | 43 (7) | 33 (10) | 34 (10) | 32 (10) | 35 (11) | <0.001 |
Physiology | ||||||
FEV1 % predicted | 55 (15) | 51 (17) | 51 (15) | 38 (13) | 49 (15) | <0.001 |
FEV1 reversibility %† | 13 (14) | 12 (10) | 10 (11) | 6 (11) | 12 (15) | <0.001 |
Respiratory resistance, 5–15 Hz† | 0.14 (0.07) | 0.18 (0.09) | 0.17 (0.08) | 0.22 (0.09) | 0.19 (0.10) | <0.001 |
6MWD, m | 426 (107) | 377 (110) | 354 (116) | 325 (125) | 376 (121) | <0.001 |
BODE Index | 1.8 (1.6) | 2.8 (2.1) | 3.1 (2.1) | 4.5 (2.1) | 3.1 (2.0) | <0.001 |
Imaging | ||||||
LAA % (–950 HU)† | 14 (10) | 18 (12) | 12 (7) | 32 (12) | 16 (11) | <0.001 |
Comorbidities | ||||||
Cardiovascular, %‡ | 28 | 32 | 48 | 27 | 31 | <0.001 |
Hypertension,† % | 32 | 34 | 59 | 32 | 38 | <0.001 |
Diabetes, % | 8 | 7 | 22 | 4 | 8 | <0.001 |
Reflux, % | 23 | 30 | 25 | 25 | 26 | 0.667 |
Systemic inflammation markers | ||||||
WBC† (109/l) | 5.5 (0.7) | 6.2 (0.1) | 9.5 (2.2) | 8.5 (1.6) | 7.8 (2.3) | <0.001 |
Fibrinogen,† mg/dl | 393 (354–434)§ | 430 (370–489) | 478 (429–545) | 462 (405–525) | 438 (375–505) | <0.001 |
CC-16,† ng/ml | 5.6 (4.1–7.7) | 4.2 (3.0–5.5) | 5.7 (3.8–7.6) | 5.2 (3.7–7.0) | 4.7 (3.2–6.6) | <0.001 |
SP-D,† ng/ml | 129 (90–196) | 108 (82–151) | 128 (92–182) | 108 (75–154) | 118 (82–173) | <0.001 |
IL-8,† pg/ml | 5.8 (3.2–10.0) | 5.6 (3.6–10.0) | 8.4 (4.6–14.7) | 6.4 (3.4–11.0) | 7.2 (3.0–14.0) | <0.001 |
IL-6, pg/ml | 1.7 (0.7–3.3) | 1.9 (0.6–3.3) | 3.4 (0.8–7.1) | 1.9 (0.6–4.4) | 2.0 (0.5–4.4) | <0.001 |
TNF-α ≥ 4.7,†¶ pg/ml, % | 34 | 29 | 25 | 27 | 30 | 0.155 |
Cluster analysis, without an a priori–dependent variable to anchor the classification of data into groups, was used sequentially to segment the entire cohort (n = 2,164). This method can divide the population into any number of clusters up to the population size. Assessment of silhouette width demonstrated monotonic increase through 10 clusters. Figure 1 shows the grouping of patients as cluster number increased through six clusters. The clustering that resulted in five groups was chosen for further analysis, in part because existing groups identified with smaller numbers of clusters were being split up rather than new groupings identified and also due to clinical judgment that the clinical features going from five to six groups resulted in patient characteristics that were becoming more homogeneous rather than more distinct.
The clinical characteristics of the five clusters at baseline are shown in Table 2. As expected, there were significant differences in a number of characteristics across the five clusters. Table 3 shows that the identified clusters also showed different relationships with the outcomes investigated, except rate of FEV1 decline, which was similar for all five clusters.
Longitudinal Outcomes | Cluster A (n = 205) | Cluster B (n = 98) | Cluster C (n = 423) | Cluster D (n = 321) | Cluster E (n = 1,117) | Overall P Value* |
---|---|---|---|---|---|---|
Died within 3 yr,† % | 3 | 6 | 13 | 12 | 9 | <0.001 |
Median time to COPD exacerbation, d | 492 | 318 | 347 | 150 | 285 | <0.001 |
COPD hospitalization within 3 yr,† % | 25 | 29 | 33 | 53 | 31 | <0.001 |
COPD exacerbation rate, PPPY | 0.76 | 1.08 | 1.12 | 1.77 | 1.17 | <0.001 |
COPD hospitalization rate, PPPY | 0.12 | 0.21 | 0.27 | 0.48 | 0.24 | <0.001 |
Rate of decline in FEV1 SGRQ total score change at Year 3, ml | 34 (39)‡ | 35 (41) | 33 (45) | 33 (37) | 32 (43) | 0.945 |
1.7 (13.3) | 2.6 (12.8) | −0.2 (13.3) | 2.8 (11.2) | 0.2 (13.1) | 0.027 | |
Change in 6-min-walk distance at Year 3, m | 2 (95) | −5 (94) | −15 (92) | −31 (106) | −19 (100) | 0.022 |
Change in emphysema % LAA at Year 3 | 1.9 (4.5) | 2.9 (5.7) | 1.0 (4.3) | 2.3 (5.3) | 1.8 (4.8) | 0.017 |
Cluster A included the patients with milder COPD within the ECLIPSE cohort. At baseline, FEV1, 6MWD, St. George’s Respiratory Questionnaire (SGRQ) total score, and modified Medical Research Council (mMRC) dyspnea score were best preserved in these subjects (Table 2). Circulating levels of inflammatory biomarkers were generally lower in Cluster A, except for SP-D, which was high (Table 2). Over the subsequent 3 years, this group had the lowest exacerbation frequency, hospitalization rate, and mortality and had the best preserved 6MWD, whereas progression of emphysema and change in SGRQ total score was intermediate. Based on these characteristics, we have labeled Cluster A as “Moderate-quasi-stable.”
At the other extreme, Cluster D includes patients with more symptoms, worse lung function, and more emphysema at baseline (Table 2). Inflammatory biomarker levels were generally intermediate, except for low SP-D and high fibrinogen levels (Table 2). During follow-up, this group had the highest exacerbation rate, the greatest deterioration of 6MWD and SGRQ, the second highest mortality, and the second greatest progression of emphysema (Table 3; Figure 2). Accordingly, we labeled this cluster as “Emphysematous exacerbators.”
Cluster C included mostly men and demonstrated the highest BMI, least emphysema, most comorbidities, and highest levels of systemic inflammatory markers (Table 2). Over 3 years, this group had the worst survival (Figure 2) but had the second lowest rate of exacerbations and the least progression of emphysema (Table 3). Accordingly, we labeled this cluster as “Inflamed comorbids” (Figure 2).
At recruitment, Cluster B showed intermediate values of airflow limitation, SGRQ, 6MWD, and emphysema and low levels of inflammatory markers (Table 2). However, over 3 years this group had a greater loss of health status and progression of emphysema (Table 3) but better survival (Figure 2) and better preservation of 6MWD than any of the other groups except Cluster A (Table 3). We labeled this cluster as “Functional emphysema.”
Cluster E was the largest cluster in the ECLIPSE cohort (Figure 1). This group may contain additional subgroups of patients with COPD who could not be readily distinguished by the cluster analysis or may simply represent the inherent heterogeneity of COPD. Subjects in this group had intermediate values for most baseline features and longitudinal outcomes, so we labeled it as “Mixed” (Figure 2).
We hypothesized that correlations between clinical parameters determined within a cluster may be stronger or weaker than those determined in the population of patients with COPD at large. For this reason, a correlation matrix of key variables was determined within the population as a whole and within individual clusters, and their relationships were explored using network analysis (details are provided in the online supplement). Interestingly, the relationships among clinical features differ among the clusters, further supporting the concept that the clusters represent different clinical entities.
The goal of the clustering analysis was to identify unique, clinically relevant groups using the available clinical, physiological, imaging, and biomarker data from the ECLIPSE study. However, the RF method used for clustering does not cut individual variables at particular points the way that a partitioning algorithm does but accounts for all variables together at once. Moreover, some of the measures that were used to construct the clusters are not easily used by a general practitioner or may involve tests that are expensive and carry some risk, such as CT scans. To facilitate determination of whether findings could be replicated by others with commonly obtained clinical data sets, we attempted to determine if the clusters that emerged through factor analysis and clustering could be replicated to some extent with a partitioning algorithm, using more routinely collected information. To this end, 23 of the variables used to create the clusters (italicized in Table 1) were considered accessible in a routine clinical setting: demography, patient-reported medical and exacerbation history, blood markers (WBC, neutrophils, fibrinogen, and CRP), basic pulmonary function (FEV1, FVC, and FEV/FVC), and the mMRC dyspnea scale. A recursive partitioning algorithm (5) was used to assign subjects into the five defined clusters using these 23 selected variables. Using this method, optimal partitioning of five variables (WBC, BMI, mMRC dyspnea score, fibrinogen, and FEV1 % predicted) assigned the subjects to the group identified by cluster analysis in approximately two-thirds of the cases. The full decision tree and the percentage of correct assignments to each cluster are shown in Figure 3. The groups defined by these five variables were assessed for prognosis (Table 4); outcomes were very similar to the clustering results, even though approximately one-third of the subjects were classified differently. The largest difference between the partitioning and the original clustering was in Cluster D, which had fewer subjects assigned, using the partitioning method (n = 120) compared with clustering (n = 321).
Longitudinal Outcomes | Group A (n = 209) | Group B (n = 122) | Group C (n = 319) | Group D (n = 120) | Group E (n = 1,394) | Overall P Value* |
---|---|---|---|---|---|---|
Died within 3 yr,† % | 2 | 7 | 14 | 8 | 10 | <0.001 |
Median time to COPD exacerbation, d | 538 | 268 | 303 | 135 | 265 | <0.001 |
COPD hospitalization-within 3 yr,† % | 18 | 30 | 37 | 61 | 34 | <0.001 |
COPD exacerbation rate, PPPY | 0.70 | 1.18 | 1.16 | 1.96 | 1.23 | <0.001 |
COPD hospitalization rate, PPPY | 0.09 | 0.23 | 0.29 | 0.55 | 0.27 | <0.001 |
Rate of decline in FEV1 SGRQ total score change at Year 3, ml | 30 (42)‡ | 33 (40) | 30 (48) | 32 (28) | 34 (42) | 0.581 |
1.5 (13.4) | 2.7 (13.2) | −0.3 (14.4) | 1.4 (11.1) | 0.7 (12.6) | 0.335 | |
Change in 6-min-walk distance at Year 3, m | 3 (90) | −14 (89) | −19 (101) | −40 (101) | −18 (100) | 0.025 |
Change in emphysema % LAA at Year 3 | 1.4 (4.5) | 2.9 (5.3) | 0.5 (3.8) | 2.9 (5.5) | 1.9 (4.9) | <0.001 |
The results of this study confirm our working hypothesis and show that, within the ECLIPSE cohort, there are discrete groups of patients with COPD that differ in both their clinical characteristics at recruitment and their association with clinically relevant outcomes after 3 years of follow-up.
A number of studies have looked at different methods to group subtypes of patients with COPD. Pistolesi and colleagues (6) grouped 322 patients with COPD based on predominant characteristics of airflow limitation using multidimensional scaling. Cho and colleagues (7) used clinical and genetic characteristics to cluster patients with COPD in the National Emphysema Treatment Trial. Vanfleteren and colleagues (12) identified five comorbidity clusters in 213 patients with COPD entering a rehabilitation program. Most recently, Castaldi and colleagues (25) evaluated 10,192 subjects from the COPDGene cohort. These studies did not relate the identified clusters to prosepective outcomes. Two recent studies, however, have related COPD clusters to outcomes. Garcia-Aymerich (8) identified three clusters in 342 patients with COPD hospitalized for the first time because of an exacerbation of COPD. Interestingly, these three clusters correspond very well to our Group A (Moderate-quasi-stable), Group C (Inflamed comorbids), and Group D (Emphysematous exacerbators) in terms of their clinical features and for subsequent mortality and hospitalizations, which were the only outcomes assessed. Likewise, Burgel and colleagues (10, 11) classified 322 patients with COPD into four clusters, which also correspond very well with our results. In this study, one cluster had more mild-to-moderate disease with very low risk of mortality, which closely resembled our Cluster A; the two clusters associated with the highest risk of mortality corresponded with our Clusters C and D. A subsequent analysis of a cohort of 527 patients found the same groups (9).
Although the concordance of these studies is based on published summary descriptions, the striking similarity supports the robustness of the groupings, especially considering the differences in patient selection, cohort sizes, clinical features, and strategy used for clustering between our study and the two previous studies. The current study further demonstrates that these clusters, which can be identified within a COPD population by several different strategies, differ for outcomes other than mortality. Interestingly, the analysis by Castaldi and colleagues (25) from COPDGene used a very different set of measures, based primarily on CT scan–defined features identified in four clusters. One cluster was individuals who appeared to be relatively resistant to the effects of cigarette smoke exposure, a group not assessed in the current study, which was limited to individuals with GOLD stage 2, 3, and 4 COPD. The remaining three clusters identified by Castaldi and colleagues appear very congruent with Clusters B, C, and D in the current proposal, despite using very different features as the basis for clustering. Taken together, our study and prior work suggest that the subtypes of COPD that have been identified do not depend on the methods used but are robust subgroups that differ prognostically.
In this context, changes in the 6MWD showed distinct differences across clusters, and these changes were greatest in Cluster D (–31 m). Because a minimal clinically important difference of about 30 m has been reported for COPD (16), heart disease (26), idiopathic pulmonary fibrosis (27), and, possibly, pulmonary hypertension (28), this suggests substantial deterioration in many of the patients in this cluster for this parameter (29) despite this group being the most impaired at baseline. In contrast, the 6MWD improved 2 m on average in Cluster A, suggesting that most of the subjects in this cluster are stable for this parameter.
Several observations based on the clustering results are noteworthy because they suggest different pathophysiology and clinical courses for the different clusters. First, although SGRQ and 6WMD are well recognized to be related to FEV1, the correlation in large groups of patients with COPD is relatively weak (30–32). In our clusters, the strength of correlation of FEV1 with SGRQ varied from 0.50 to 0.24. This supports the concept that airflow limitation may be a more important cause of functional compromise in Cluster A and less in Cluster D compared with the other clusters, suggesting there may be a different relationship between airflow and health status in these two clusters.
Emphysema progression also differed across clusters. Cluster B had the most progression despite having less pronounced changes regarding other outcomes. The severity of emphysema in these subjects was intermediate and similar to that in other groups at baseline. In contrast, baseline emphysema was greatest in Cluster D, which also showed progression in this component of the disease. These two groups were nearly identical in age and smoking history (pack-years), making it unlikely that Cluster B represents subjects who will progress to Cluster D. Rather, it suggests two different groups with progressive emphysema.
The selected measures of inflammation assessed were differentially expressed across clusters (except for TNF-α, which was detectable in a low percentage of the subjects). Cluster C generally had the highest levels for WBC, fibrinogen, SP-D, IL-6, and IL-8. However, Cluster D had high levels of fibrinogen, low levels of SP-D, and intermediate levels of the other measures. In contrast, Cluster A had high levels of SP-D, intermediate levels of club cell secretory protein-16, and low levels of the other measures. Cluster B had intermediate levels of fibrinogen and low levels of the other measures, and Cluster E was generally intermediate. This suggests that each cluster has a distinct pattern of inflammation. When correlations were assessed between measures of inflammation and clinical features of COPD, these were generally stronger in Cluster C, suggesting a different role for inflammation in this group than in the other clusters.
Clusters C and D had the highest mortality. Because mortal events were not adjudicated in ECLIPSE, it is not possible to relate clusters to cause of death. However, Cluster C is strikingly enriched for individuals with cardiac and related comorbidities. These conditions are often associated with inflammatory biomarkers, which were also found in the current analysis. Whether these biomarkers are a result of the comorbidities or contribute to their pathogenesis (or both) remains to be determined. That the association is much stronger within a specific cluster, however, suggests that further studies designed to explore these questions may be better approached within specific clusters.
The relationship of SP-D is of particular interest. The current analysis found that SP-D was reduced among individuals with more severe emphysema but higher in those in whom emphysema was progressing. A relationship has been reported between elevated levels of SP-D and lower lung function (33). SP-D is produced in the lung and can leak into plasma when permeability is increased (e.g., when there is inflammation) (34). High levels, therefore, may indicate high disease activity and progression, whereas low levels may reflect loss of lung tissue and more severe emphysema. It is therefore possible that the implications of a given biomarker could differ from cluster to cluster.
FEV1 decline was not significantly different across the five clusters identified, which is at variance with the evolution of other disease characteristics. Perhaps this is because the number of different conditions that are included under the term COPD are all defined based on reduced FEV1. A study of longer duration and/or one that includes larger numbers of subjects, however, may be able to distinguish differences in rate of decline in airflow.
The current analysis is fairly regarded as “hypothesis generating.” The approach used will always generate clusters, the validity of which must then be further tested. There are several approaches to this. The simplest is to replicate the findings in a different data set. In the current study, the entire data set was used for clustering, which increases power to discern clusters, and two different approaches were used in an attempt to validate the clusters and to help define their clinical significance. First, several key data parameters, such as health status (SGRQ) and exacerbation frequency, were excluded from the clustering. That SGRQ and exacerbations were subsequently found to differ among clusters provides internal validation. More importantly, the current analysis took advantage of the longitudinal nature of ECLIPSE to determine if the various clusters manifested differing natural histories, which was the case. This supports the validity of the clusters and the concept that they identify groups that are distinct for key clinical outcomes. Final validation of any approach to clustering will require a combination of approaches. To this end, several ongoing efforts to peform metaclustering, using multiple and varied cohorts, are underway.
One of the strengths of the current analysis is the richness of the clinical data available, including the availability of longitudinal assessments and clinical outcomes. The approach used, which takes advantage of all available data, does not permit easy “diagnosis” of individuals. For practical application, the clusters defined in the current study must be evaluable by others. In this regard, the simpler set of five measures identified, using the partitioning algorithm with defined cut points, may be of particular value because they provide fairly similar groupings to the clustering approach. This algorithm can be readily applied in a number of settings and may help subset the COPD population for a number of applications, including selection for clinical trials and in clinical management.
The current analysis has several limitations. Although the information available on the ECLIPSE cohort is very detailed, potentially important parameters were not measured, including lung diffusing capacity, sleep disturbances, and microbiological data, among others. This limitation is particularly likely for biomarkers. Identification of biomarkers of potential relevance for COPD, including measures of inflammation in the lungs, is advancing rapidly, and measures not available for the current analysis may prove very useful in the future. In addition, the current analysis is not a population-based sample. Subjects were recruited by a variety of means, including recruitment from clinical practices and by advertisement. Thus, the population must be considered a convenience sample. To what degree the clusters identified reflect the COPD population at large is, therefore, undetermined.
It is also likely that the clusters identified here capture the heterogeneity of COPD only partially. The largest cluster (Cluster E; see Figure 1) was generally intermediate and may include a number of smaller clusters that could not be distinguished in the current analysis. Perhaps this is not surprising because the number of genetic and etiologic factors that can contribute to fixed airflow limitation, the defining feature of COPD, is very large. How many “subtypes” of COPD will be definable remains to be determined, but it is likely much more than four. Finally, although the 3-year follow-up in ECLIPSE allows prognosis to be assessed within clusters, this is still a relatively short follow-up in the context of the course of COPD.
Finally, the current clustering analysis was designed to determine if clinically distinct subgroups of patients with COPD could be identified. It was not designed as a diagnostic. Thus, the accuracy of assignment of individuals to specific clusters was not optimized; nor were the parameters used for assignment designed to be a parsimonious and efficient diagnostic paradigm. The analysis, however, does identify that distinct groups of patients with COPD exist that have different prognoses.
Cluster analysis, using clinical and biological variables assessed at enrollment in ECLIPSE, was able to identify five subgroups that differed regarding clinical features, appeared to be more homogeneous internally, and, importantly, had distinctly different natural histories over a 3-year follow-up. Three of these clusters and their relationship to mortality are corroborated by previous findings in smaller longitudinal cohorts, where similar types of patient subgroups have been found using different variables and different clustering methods, suggesting that these groups represent robust COPD subtypes. Dividing the COPD population into subgroups that are more homogeneous and that have similar prognoses has the potential to improve the ability to understand COPD pathogenesis and to test interventions designed to alter the natural history of COPD.
1 . | Agusti A, Calverley PM, Celli B, Coxson HO, Edwards LD, Lomas DA, MacNee W, Miller BE, Rennard S, Silverman EK, et al.; Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE) investigators. Characterisation of COPD heterogeneity in the ECLIPSE cohort. Respir Res 2010;11:122. |
2 . | Shapiro SD, Reilly JJJ, Rennard SI. Chronic bronchitis and emphysema. In: Mason RJ, Broaddus VC, Martin TR, King TE Jr., Schraufnagel DE, Murray JF, Nadel JA, editors. Textbook of respiratory medicine. 5th ed. Philadelphia, PA: Saunders; 2010. pp. 919–967. |
3 . | Han MK, Agusti A, Calverley PM, Celli BR, Criner G, Curtis JL, Fabbri LM, Goldin JG, Jones PW, Macnee W, et al. Chronic obstructive pulmonary disease phenotypes: the future of COPD. Am J Respir Crit Care Med 2010;182:598–604. |
4 . | Everitt BS, Landau S, Leese M, Stahl D. Cluster analysis. 5th Edition. Hoboken NJ: John Wiley and Sons; 2011. |
5 . | Therneau T, Atkinson E. Technical report series no. 61, AItRPUtRRDoHSR, Mayo Clinic, Rochester, Minnesota, 1997. An introduction to recursive partitioning using the RPART routines. Rochester, MN: Department of Health Science Research, Mayo Clinic; 1997. |
6 . | Pistolesi M, Camiciottoli G, Paoletti M, Marmai C, Lavorini F, Meoni E, Marchesi C, Giuntini C. Identification of a predominant COPD phenotype in clinical practice. Respir Med 2008;102:367–376. |
7 . | Cho MH, Washko GR, Hoffmann TJ, Criner GJ, Hoffman EA, Martinez FJ, Laird N, Reilly JJ, Silverman EK. Cluster analysis in severe emphysema subjects using phenotype and genotype data: an exploratory investigation. Respir Res 2010;11:30. |
8 . | Garcia-Aymerich J, Gómez FP, Benet M, Farrero E, Basagaña X, Gayete À, Paré C, Freixa X, Ferrer J, Ferrer A, et al.; PAC-COPD Study Group. Identification and prospective validation of clinically relevant chronic obstructive pulmonary disease (COPD) subtypes. Thorax 2011;66:430–437. |
9 . | Burgel PR, Paillasseur JL, Peene B, Dusser D, Roche N, Coolen J, Troosters T, Decramer M, Janssens W. Two distinct chronic obstructive pulmonary disease (COPD) phenotypes are associated with high risk of mortality. PLoS ONE 2012;7:e51048. |
10 . | Burgel PR, Roche N, Paillasseur JL, Tillie-Leblond I, Chanez P, Escamilla R, Court-Fortune I, Perez T, Carré P, Caillaud D; INITIATIVES BPCO Scientific Committee. Clinical COPD phenotypes identified by cluster analysis: validation with mortality. Eur Respir J 2012;40:495–496. |
11 . | Burgel PR, Paillasseur JL, Caillaud D, Tillie-Leblond I, Chanez P, Escamilla R, Court-Fortune I, Perez T, Carré P, Roche N; Initiatives BPCO Scientific Committee. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur Respir J 2010;36:531–539. |
12 . | Vanfleteren LE, Spruit MA, Groenen M, Gaffron S, van Empel VP, Bruijnzeel PL, Rutten EP, Op ’t Roodt J, Wouters EF, Franssen FM. Clusters of comorbidities based on validated objective measurements and systemic inflammation in patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 2013;187:728–735. |
13 . | Vestbo J, Anderson W, Coxson HO, Crim C, Dawber F, Edwards L, Hagan G, Knobil K, Lomas DA, MacNee W, et al.; ECLIPSE investigators. Evaluation of COPD longitudinally to identify predictive surrogate endpoints (ECLIPSE). Eur Respir J 2008;31:869–873. |
14 . | Hurst JR, Vestbo J, Anzueto A, Locantore N, Müllerova H, Tal-Singer R, Miller B, Lomas DA, Agusti A, Macnee W, et al.; Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE) Investigators. Susceptibility to exacerbation in chronic obstructive pulmonary disease. N Engl J Med 2010;363:1128–1138. |
15 . | Coxson HO, Dirksen A, Edwards LD, Yates JC, Agusti A, Bakke P, Calverley PM, Celli B, Crim C, Duvoix A, et al.; Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE) Investigators. The presence and progression of emphysema in COPD as determined by CT scanning and biomarker expression: a prospective analysis from the ECLIPSE study. Lancet Respir Med 2013;1:129–136. |
16 . | Polkey MI, Spruit MA, Edwards LD, Watkins ML, Pinto-Plata V, Vestbo J, Calverley PM, Tal-Singer R, Agustí A, Bakke PS, et al.; Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE) Study Investigators. Six-minute-walk test in chronic obstructive pulmonary disease: minimal clinically important difference for death or hospitalization. Am J Respir Crit Care Med 2013;187:382–386. |
17 . | American Thoracic Society. Standards for the diagnosis and care of patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 1995;152:S77–S121. |
18 . | Patel BD, Coxson HO, Pillai SG, Agustí AG, Calverley PM, Donner CF, Make BJ, Müller NL, Rennard SI, Vestbo J, et al.; International COPD Genetics Network. Airway wall thickening and emphysema show independent familial aggregation in chronic obstructive pulmonary disease. Am J Respir Crit Care Med 2008;178:500–505. |
19 . | Agustí A, Edwards LD, Rennard SI, MacNee W, Tal-Singer R, Miller BE, Vestbo J, Lomas DA, Calverley PM, Wouters E, et al.; Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE) Investigators. Persistent systemic inflammation is associated with poor clinical outcomes in COPD: a novel phenotype. PLoS ONE 2012;7:e37483. |
20 . | Breiman and Cutler’s random forests for classification and regression [last updated 2012; accessed 2014 Feb 3]. Available from: http://stat-www.berkeley.edu/users/breiman/RandomForests/ |
21 . | Breiman L. Random forests. Mach Learn 2001;45:5–32. |
22 . | Shi T, Horvath S. Unsupervised learning with random forest predictors. J Comput Graph Stat 2006;15:118–138. |
23 . | McDonald R. Factor analysis and related methods. New Jersey: Lawrence Erlbaum Associates; 1985. |
24 . | Dickens JA, Miller BE, Edwards LD, Silverman EK, Lomas DA, Tal-Singer R; Evaluation of COPD Longitudinally to Identify Surrogate Endpoints (ECLIPSE) Study Investigators. COPD association and repeatability of blood biomarkers in the ECLIPSE cohort. Respir Res 2011;12:146. |
25 . | Castaldi PJ, Dy J, Ross J, Chang Y, Washko GR, Curran-Everett D, Williams A, Lynch DA, Make BJ, Crapo JD, et al. Cluster analysis in the COPDGene study identifies subtypes of smokers with distinct patterns of airway disease and emphysema. Thorax 2014;69:415–422. |
26 . | Gremeaux V, Troisgros O, Benaïm S, Hannequin A, Laurent Y, Casillas JM, Benaïm C. Determining the minimal clinically important difference for the six-minute walk test and the 200-meter fast-walk test during cardiac rehabilitation program in coronary artery disease patients after acute coronary syndrome. Arch Phys Med Rehabil 2011;92:611–619. |
27 . | du Bois RM, Weycker D, Albera C, Bradford WZ, Costabel U, Kartashov A, Lancaster L, Noble PW, Sahn SA, Szwarcberg J, et al. Six-minute-walk test in idiopathic pulmonary fibrosis: test validation and minimal clinically important difference. Am J Respir Crit Care Med 2011;183:1231–1237. |
28 . | Gabler NB, French B, Strom BL, Palevsky HI, Taichman DB, Kawut SM, Halpern SD. Validation of 6-minute walk distance as a surrogate end point in pulmonary arterial hypertension trials. Circulation 2012;126:349–356. |
29 . | Holland AE, Nici L. The return of the minimum clinically important difference for 6-minute-walk distance in chronic obstructive pulmonary disease. Am J Respir Crit Care Med 2013;187:335–336. |
30 . | Jones PW. Issues concerning health-related quality of life in COPD. Chest 1995;107(Suppl)187S–193S. |
31 . | Wijkstra PJ, TenVergert EM, van der Mark TW, Postma DS, Van Altena R, Kraan J, Koëter GH. Relation of lung function, maximal inspiratory pressure, dyspnoea, and quality of life with exercise capacity in patients with chronic obstructive pulmonary disease. Thorax 1994;49:468–472. |
32 . | Chuang ML, Lin IF, Wasserman K. The body weight-walking distance product as related to lung function, anaerobic threshold and peak VO2 in COPD patients. Respir Med 2001;95:618–626. |
33 . | Krane M, Griese M. Surfactant protein D in serum from patients with allergic bronchopulmonary aspergillosis. Eur Respir J 2003;22:592–595. |
34 . | Sin DD, Man SF, Marciniuk DD, Ford G, FitzGerald M, Wong E, York E, Mainra RR, Ramesh W, Melenka LS, et al.; ABC (Advair, Biomarkers in COPD) Investigators. The effects of fluticasone with or without salmeterol on systemic biomarkers of inflammation in chronic obstructive pulmonary disease. Am J Respir Crit Care Med 2008;177:1207–1214. |
This work was supported by GlaxoSmithKline.
Author Contributions: S.I.R., N.L., and A.A. were involved in writing the manuscript or had substantial involvement in its revision before submission. All authors were involved in the conception/design of the study or the acquisition of data or analysis or interpretation of data, and all authors have approved the manuscript.
This article has an online supplement, which is accessible from this issue's table of contents at www.atsjournals.org
Author disclosures are available with the text of this article at www.atsjournals.org.