## Abstract

**Background** Heart failure (HF) is a chronic condition in which the heart does not pump enough blood to meet the body's demands. Diffusing capacity of the lung for nitric oxide (*D*_{LNO}) and carbon monoxide (*D*_{LCO}) may be used to classify patients with HF, as *D*_{LNO} and *D*_{LCO} are lung function measurements that reflect pulmonary gas exchange. Our objectives were to determine 1) if *D*_{LNO} added to *D*_{LCO} testing predicts HF better than *D*_{LCO} alone and 2) whether the binary classification of HF is better when *D*_{LNO} z-scores are combined with *D*_{LCO} z-scores than using *D*_{LCO} z-scores alone.

**Methods** This was a retrospective secondary data analysis in 140 New York Heart Association Class II HF patients (ejection fraction <40%) and 50 patients without HF. z-scores for *D*_{LNO}, *D*_{LCO} and *D*_{LNO}+*D*_{LCO} were created from reference equations from three articles. The model with the lowest Bayesian Information Criterion was the best predictive model. Binary HF classification was evaluated with the Matthews Correlation Coefficient (MCC).

**Results** The top two of 12 models were combined z-score models. The highest MCC (0.51) was from combined z-score models. At most, only 32% of the variance in the odds of having HF was explained by combined z-scores.

**Conclusions** Combined z-scores explained 32% of the variation in the likelihood of an individual having HF, which was higher than models using *D*_{LNO} or *D*_{LCO} z-scores alone. Combined z-score models had a moderate ability to classify patients with HF. We recommend using the NO–CO double diffusion technique to assess gas exchange impairment in those suspected of HF.

## Tweetable abstract

** D_{LNO}+D_{LCO} is better at classifying heart failure than either one alone** https://bit.ly/3upnR46

## Introduction

In patients with heart failure (HF), there is significant pulmonary gas exchange impairment even at rest, as the alveolar–arterial oxygen tension difference (*P*_{A–aO2}) is excessive (∼28 mmHg) and the arterial oxygen tension (*P*_{aO2}) is low (mild hypoxaemia, ∼72 mmHg) [1]. Diffusing capacity of the lung for carbon monoxide (*D*_{LCO}) and nitric oxide (*D*_{LNO}) can identify gas exchange issues in HF patients. *D*_{LNO} z-scores share 48–56% variance with *P*_{A–aO2}, while *D*_{LCO} z-scores share 38–49% with *P*_{aO2} [2, 3], aiding gas exchange assessment.

Combined *D*_{LNO} and *D*_{LCO} measurements together provide a detailed understanding of lung gas exchange issues. While *D*_{LCO} reflects mostly pulmonary vascular/blood volume problems, accounting for ∼70–80% of the red blood cell barrier, *D*_{LNO} predominantly captures the diffusion between the alveolar–capillary membrane and red blood cell membrane (∼60%) (figure 1) [4]. Together, they provide a comprehensive view of different diffusion pathways: what *D*_{LCO} misses, *D*_{LNO} captures. Despite its potential, the NO–CO double diffusion technique remains underutilised. There are companies that manufacture the device [5], but clinicians do not know about it, and if they do not know about it, they cannot request it. Furthermore, no devices that measure *D*_{LNO} are yet US Food and Drug Administration approved, and despite several conversations with manufacturers, they have been reluctant to spend the time and effort to make them so. Thus, the NO–CO double diffusion technique has only been used for research purposes by a select few scientists. Recently, the European Respiratory Society (ERS) 2017 Technical Standards for *D*_{LNO} [4] highlighted its use, increasing exposure to clinicians of *D*_{LNO}'s technical advantages over *D*_{LCO}, and favour its routine use in pulmonary function testing [6]. Thus, hopefully, companies will move towards obtaining the required governmental approvals so that *D*_{LNO} becomes as ubiquitous as *D*_{LCO} in routine pulmonary function testing.

HF leads to a decline in the alveolar membrane diffusing capacity (*D*_{m}), especially as HF worsens [7]. Even though some HF research employs the single-breath NO–CO method [8–12], combining *D*_{LCO} and *D*_{LNO} in HF evaluation remains under-researched. Considering ∼35% of HF patients also have COPD [13], which affects both pulmonary capillary blood volume and *D*_{m} [3], the NO–CO double diffusion method, capturing both these issues, may offer a more accurate HF assessment than just *D*_{LCO}, which primarily detects *V*_{c} and/or haemoglobin concerns.

The ERS/American Thoracic Society 2022 Technical Standard on lung function interpretation encourages the classification of the severity of lung function impairment based on z-scores [14]. A z-score is a standardised score that indicates how many standard deviations a value is from the mean of a reference population. The z-score, compared with the percentage predicted, is a better way to classify pulmonary diffusion impairment since the percentage predicted at the lower limit of normal (LLN) changes with age [15].

As such, this study aimed to 1) assess *D*_{LNO} z-scores' impact on *D*_{LCO} z-scores and their HF correlation and 2) classify HF using the Matthews Correlation Coefficient (MCC) [16]. We hypothesised that *D*_{LNO} z-scores added to *D*_{LCO} z-scores improves HF association and classification compared with *D*_{LCO} z-scores alone.

## Methods

This was a retrospective secondary data analysis that utilised four prior studies involving White HF patients and the single-breath NO–CO double diffusion technique [9–11, 17]. At Centro Cardiologico Monzino (Milan, Italy), all patients provided consent, allowing their data to be used for retrospective research in a completely anonymous manner. As the data were previously published by the same research group [9–11, 17], the transferred data were fully anonymised so that a data transfer agreement was not necessary. These four studies were a combination of cross-sectional or case–control research designs.

This article's analyses focus on classifying those with New York Heart Association (NYHA) Class II HF from controls using *D*_{LNO}, *D*_{LCO} or their combined z-scores, distinguishing it from the goals of previous studies [9–11, 17]. Throughout these studies, we maintained consistency regarding breath-hold time (BHT) (∼5.5 s), testing apparatus and the research team involved. Since all four studies' data were gathered at the same research centre, employing fixed statistical models was deemed suitable. We opted for a BHT of ∼5.5 s because the NO electrochemical sensor used for *D*_{LNO} assessments functions within the ppm spectrum. A 10 s breath-hold could reduce the exhaled NO concentration to below the ppb spectrum, which surpasses the sensor's detection limits. As the established reference equations are grounded on a BHT of 6.2±1.3 s [4, 15, 18], the potential influence of a slightly shorter BHT on gas diffusion in the lung is offset by comparing the HF patients to reference equations that employ similar BHTs.

### HF prediction

We derived the *D*_{LCO} and *D*_{LNO} z-scores from three primary sources: ERS Technical Standards [4], Munkholm *et al.* [18] and Zavorsky and Cao [15]. Subsequently, we combined the individual z-scores (*D*_{LCO} z-score+*D*_{LNO} z-score) for each method. As a result, both the ERS Technical Standards [4] and Munkholm *et al*. [18] yielded three outcomes per patient: one *D*_{LCO} z-score, one *D*_{LNO} z-score and one combined z-score. On the other hand, Zavorsky and Cao [15] presented prediction formulas using segmented linear regression and generalised additive models for location, scale and shape (GAMLSS). Consequently, this approach produced two sets of scores: two *D*_{LCO} z-scores, two *D*_{LNO} z-scores and two combined z-scores for each patient. It is worth noting that the ERS Technical Standards' reference equations [4] drew from aggregated adult data [19–21], while Zavorsky and Cao's [15] were based on a combined dataset of children and adults [18–22].

These 12 normalised z-scores, considering age, sex, height, altitude and lung function device, acted as independent variables in binary logistic regression for HF prediction. (For both linear and segmented linear regression, the z-score is determined by the formula: (measured value–predicted value)/residual standard error. For the z-score computation *via* the GAMLSS model from Zavorsky and Cao [15], please consult the footnote in table 3 of their publication.) The model with the lowest Bayesian Information Criterion (BIC), a typical selection method for lung function reference equations, was chosen for the best overall fit [23, 24].

Between-model differences in BIC were interpreted as follows: BIC difference 0–2: weak evidence (50–75% probability that the lower BIC model is better); BIC difference 2–6: positive evidence (75–95% probability that the lower BIC model is better); BIC difference 6–10: strong evidence (95–99% probability that the lower BIC model is better); and BIC difference >10: very strong evidence (>99% probability that the lower BIC model is better) [25].

To predict HF through pulmonary diffusion anomalies, we utilised the z-score threshold derived from the optimal combination of sensitivity and specificity from the receiver operating characteristic (ROC) curve analysis for all combined models. Additionally, we employed a z-score threshold of −1.645 (corresponding to the 5th percentile, or LLN) for the standalone *D*_{LCO} and *D*_{LNO} z-score models. We applied two-sided independent t-tests to contrast absolute *D*_{LCO} with *D*_{LCO} z-scores, absolute *D*_{LNO} with *D*_{LNO} z-scores and the *D*_{LNO}/*D*_{LCO} ratio between different groups. We used the Benjamini–Hochberg procedure to control the false discovery rate among MCCs and set it at 0.05 [26].

### Classification of HF *versus* controls

Model efficacy was further gauged using the area under the ROC curve (AUC) and the MCC. Notably, the MCC is an especially reliable metric when evaluating binary categories in datasets where the number of disease cases does not match non-disease cases, as in our pooled dataset [16, 27, 28]. The MCC considers true positives, true negatives, false positives and false negatives to provide a comprehensive score for model classification. High MCC scores are only achieved when predictions accurately classify a significant proportion of diseased and non-diseased patients, regardless of any class imbalance. We also derived the 95% confidence interval for the MCC from 1000 bootstrap samples and evaluated each MCC accordingly [29, 30].

The prevalence of HF in Italians aged between 60 and 85 years is ∼4% [31]. This starkly contrasts with the prevalence in our four-study cohort (∼74%). Due to this disparity, we have chosen not to report the positive and negative predictive values, along with the false positive and false negative rates. Nevertheless, we did calculate parameters such as the true positive rate (sensitivity, probability of detection), true negative rate (specificity), false omission rate, false detection rate, and both positive and negative likelihood ratios since they are unaffected by disease prevalence.

SPSS Statistics version 29 (IBM, Armonk, NY, USA) and R version 4.2.2 (www.r-project.org) were used for statistical analyses (specifically, R packages AICcmodeavg version 2.3.1 and glmnet version 4.1.6). A p-value <0.05 signified statistical significance.

## Results

### General findings

140 patients with NYHA Class II HF (113 males, 27 females) and 50 control subjects (26 males, 24 females) were included in this analysis (table 1). The same researchers tested all subjects in the same hospital centre [9–11, 17]. Only baseline values were used when patients were tested multiple times or where they were in different experimental conditions. These were a cohort of ambulatory low ejection fraction (<40%) HF patients that were regularly followed. The subjects without HF, on average, were 4 years younger and had a mean body mass index 2.5 kg·m^{−2} less than those with HF (p<0.05). Approximately 50% of those with HF had an obstructive or restrictive spirometric pattern compared with 20% without HF (p<0.001). The mean BHT for the diffusing capacity testing was 5.5±0.5 s (range 4.2–7.4 s).

There were significantly different diffusing capacities between groups (z-scores in non-HF: *D*_{LCO}= −1.03±0.84, *D*_{LNO}= −1.30±0.91, combined z-scores= −2.33±1.45; z-scores in HF: *D*_{LCO}= −2.18±1.32, *D*_{LNO}= −2.35±1.19, combined z-scores= −4.53±2.34) (p<0.001 between groups for all, z-scores created from segmented regression [15]).

About 24% of the subjects without HF showed mild *D*_{LCO} impairment (per the diffusion impairment classification scheme of Zavorsky and Cao [15]). Conversely, 65% of the HF patients displayed mild, moderate or severe *D*_{LCO} impairment. About 36% of the non-HF subjects showed mild *D*_{LNO} impairment. Conversely, 69% of the HF patients displayed mild, moderate or severe *D*_{LNO} impairment.

The absolute values for *D*_{LCO} and *D*_{LNO} for those with HF were 17.0±5.7 and 69±23 mL·min^{−1}·mmHg^{−1}, respectively. The absolute values for *D*_{LCO} and *D*_{LNO} for those without HF were 21.8±4.9 and 88±21 mL·min^{−1}·mmHg^{−1}, respectively. The 95% CI of the difference was 3.3–6.4 and 13–26 mL·min^{−1}·mmHg^{−1} for *D*_{LCO} and *D*_{LNO}, respectively (two-sided p-value <0.001). The *D*_{LNO}/*D*_{LCO} ratio was 4.09±0.67 for those with no disease and 4.20±0.92 for those with heart disease (95% CI of the difference −0.36–0.12) (two-sided p-value 0.36), indicating that the ratio cannot be used to distinguish those with and those without HF. The correlation between *D*_{LCO} and *D*_{LNO} in mL·min^{−1}·mmHg^{−1} was 0.79 (95% CI 0.73–0.85) and the correlation between *D*_{LCO} and *D*_{LNO} z-scores was 0.73 (95% CI 0.64–0.79) when the z-scores were created from segmented regression [15].

### Prediction results

Table 2 presents the summary results of the 12 models. The “ΔBIC” column indicates the change in BIC compared with the first model. In this case, there is positive evidence that Model 1 is a better fit than Models 2 and 3, strong evidence that Model 1 is better than Models 4–7, and very strong evidence that Model 1 is better than Models 8–12. Thus, models that used ERS reference equations [4] or GAMLSS reference equations [15] were the best fit among the 12 models. Approximately 32% of the variability in the odds of having HF can be explained by combined z-scores when the z-scores were generated from the ERS equations. In other words, the combined z-scores from Model 1 provide information that helps explain 32% of the variance in the likelihood of an individual having HF. Four of the six highest-ranked models were those involving models with combined z-scores.

The “BIC weight” column in table 2 is used to see how much more likely one model is the correct model compared with the other models that have been fitted. In this case, Model 1 is 4.1 times more likely to be a better fit than Model 2 (0.69/0.17=4.1). So, for example, Model 1 has a BIC weight of 0.69, which suggests that there is a 69% chance that Model 1 is the best model compared with the 11 other models tested. In the second row, Model 2 has a BIC weight of 0.17, which suggests that there is a 17% chance that Model 2 is the best model compared with the other 11 models tested. The “Cumulative weight” column in table 2 shows the percentage chance that a particular group of models is the best in the table. For example, Model 2 has a cumulative weight of 0.86, suggesting that there is an 86% chance that either Model 1 or Model 2 is the best model of all the models tested in table 2.

While table 2 summarises the comparative results of different reference equations from three distinct studies [4, 15, 18], a detailed examination of each reference equation reveals a consistent trend. The BIC for combined z-scores was always superior to the BIC for standalone *D*_{LNO} or *D*_{LCO} z-scores within the same reference equation set, with a minimum BIC difference of 4 units. This pattern emerged across various models. Specifically, using the combined z-scores from the ERS reference equations [4] resulted in a lower BIC than using individual *D*_{LCO} or *D*_{LNO} z-scores from that set. A similar trend was observed in the reference equations from Munkholm *et al.* [18], where combined z-scores led to a lower BIC than standalone *D*_{LCO} or *D*_{LNO} scores. In the segmented linear regression or GAMLSS reference equation models by Zavorsky and Cao [15], the combined z-scores again demonstrated a lower BIC than the individual *D*_{LCO} or *D*_{LNO} z-scores. Therefore, while Models 1 and 2 stood out as the best overall models among the other 10 presented, combined z-scores consistently offered better BIC results when partitioning the three studies separately [4, 15, 18].

### Odds ratios from the prediction models

Table 3 displays the odds ratio of heart disease for every 1 unit increase in z-scores. For example, when using the calculated *D*_{LCO} z-scores from the ERS Task Force equations [4], it can be shown that for every 1 unit increase in *D*_{LCO} z-scores, the odds of having HF are reduced by 45% to 74%.

The odds that *D*_{LCO}, *D*_{LNO} or combined (*D*_{LCO}+*D*_{LNO}) models are below the LLN when a patient has HF are presented in table 4. For example, someone with HF has a ∼6–34 times increase in odds that the combined z-scores are below the LLN when a patient has HF (when using the ERS prediction equations [4]).

### Classification results

Table 5 shows that Model 1 (row 1) had a higher MCC than models presented in rows 6–12. The positive and negative likelihood ratios describe the probability of disease shifts when the finding is present and absent, respectively [32]. While the AUC for Model 1 is 0.79, demonstrating good discriminatory ability between positive and negative cases, this can be misleading as the MCC was only 0.51 (table 5). So, the power of Model 1 in determining true positives and true negatives while minimising false positives and false negatives was only “moderate”. (An MCC of +1 represents a perfect classification, a MCC of 0 indicates random classification and an MCC of –1 indicates total disagreement between observation and classification. Since the MCC of 0.51 is above the random classification level and indicates a correlation that is more than half-way towards perfect classification, in that sense one may describe an MCC of 0.51 as moderate, as it is past the half-way point but not near perfect.)

## Discussion

This study evaluated the combined effectiveness of *D*_{LNO} and *D*_{LCO} z-scores in predicting and classifying NYHA Class II HF patients from controls. Class II HF patients, given their mild exercise limitations and relatively few symptoms, are often less inclined to adopt the recommended comprehensive therapy regimen for HF. Our analysis revealed that the combined z-scores significantly outperformed individual z-scores, with an 86% probability that one of the first two combined z-score models was the best predictor of HF among the 12 we assessed. Interestingly, the first combined model alone had a 69% likelihood of being the top performer.

The most informative metric we used was the MCC, which captures the true balance between positive and negative classifications, minimising errors. Four out of the top six MCC values came from models utilising combined z-scores. While the AUC is often used as a performance metric, it can be misleading, especially in datasets with high disease prevalence [33]. For instance, an 80% accuracy in a population with 74% prevalence is not a notable achievement [27]. The MCC, which aggregates true positives, true negatives, false positives and true negatives, offers a more truthful representation, especially when both true positives and true negatives are equally significant.

In this pooled dataset, the *D*_{LCO} z-scores and *D*_{LNO} z-scores shared 53% (95% CI 41–64%) of their variance. This implies that for HF patients, approximately half of the variability in one z-score can be predicted from the other. Since there is not complete overlap between the *D*_{LNO} and *D*_{LCO} z-scores (*i.e.* they do not share 100% of their variance), it is advisable to measure both *D*_{LNO} and *D*_{LCO} to capture the entirety of the variance present. This accentuates the advantages of employing the NO–CO double diffusion technique in pulmonary function tests over traditional *D*_{LCO} tests alone. The combined use of *D*_{LNO} and *D*_{LCO} z-scores provides a broader perspective, capturing variances that either score might miss individually. There are certain standalone conditions, like anaemia [34], polycythaemia, CO poisoning [35] or pulmonary capillary haemorrhage, where *D*_{LNO} is much less sensitive than *D*_{LCO} in detecting microvascular changes in the haemoglobin sink for oxygen transfer, but generally the combination proves more robust in classifying HF.

Although *D*_{LNO} is technically superior to *D*_{LCO} [6], it does not render *D*_{LCO} obsolete. The combined approach offers the most reliable identification of HF. While *D*_{LNO} can replace *D*_{LCO} in some cardiopulmonary diseases [6], the synergy of using both for impairment of pulmonary diffusing capacity is crucial for HF.

Our analysis also highlighted that only about a third of the variance in combined z-scores directly correlates with HF identification. The factors causing the remaining two-thirds of the variance remain an enigma. Despite the strength of combined z-scores with HF, established screening tools (e.g. brain natriuretic peptide (BNP), N-terminal pro-BNP and echocardiography) are still better [36, 37]. Our intent was not to substitute these tools but to showcase how *D*_{LNO} z-scores could complement *D*_{LCO} z-scores under specific circumstances. Indeed, combining *D*_{LNO} and *D*_{LCO} offers a more comprehensive insight into the specific location of resistance in the oxygen diffusion pathway, from the alveoli to the interior of the red blood cell.

### Limitations

Several challenges were present in this study. First, the limited application of the NO–CO double diffusion method constrained the amount of available data. Second, the dataset mainly represents an adult White population, excluding children and diverse ethnic groups. The equations in Model 1 do not account for child-related data or specific device adjustments. Third, even though Model 1 performed the best among the 12 models examined, it might not be the most suitable in specific conditions like anaemia [34], polycythaemia or elevated carboxyhaemoglobin levels [35], such as in cases of CO poisoning. Under these conditions, only *D*_{LCO} is only influenced. Fourth, we chose not to adjust for haemoglobin in *D*_{LCO} due to feasibility and model fit concerns. Notably, ∼20% of NYHA Class II HF patients display signs of anaemia (11.7±1.1 g·dL^{−1}) [38]. At this concentration and prevalence, *D*_{LCO} results likely remain stable. Lastly, the participants labelled as “non-HF” were not perfect controls. Approximately 20% of these control participants displayed restrictive or obstructive patterns despite normal heart function. The MCC would likely be higher if these controls had no lung obstruction or restriction.

### Conclusions

In conclusion, combined *D*_{LNO}+*D*_{LCO} z-score models outperform single z-score models for HF fit and the binary HF classification benefits from combined z-scores over individual *D*_{LCO} or *D*_{LNO} LLN cut-offs. However, only ∼32% of HF variance predicting HF is explained by combined z-scores, discouraging HF screening use. Instead, the NO–CO technique is recommended for gas exchange assessment and improved binary classification in HF patients.

## Acknowledgements

The authors wish to thank the colleagues of P. Agostoni, specifically Elisabetta Salvioni and Anna Apostolo (Centro Cardiologico Monzino IRCCS, Milan, Italy), for obtaining the de-identified raw data from four of their studies for use here.

## Footnotes

Provenance: Submitted article, peer reviewed.

Data availability: The data that were used in this study will be available from G.S. Zavorsky or P. Agostini, but only if the researchers requesting the data will correctly cite this work, and will allow G.S. Zavorsky or P. Agostini to be co-authors in any subsequent publication that uses this data.

Conflict of interest: None declared.

- Received September 1, 2023.
- Accepted November 15, 2023.

- Copyright ©The authors 2024

This version is distributed under the terms of the Creative Commons Attribution Non-Commercial Licence 4.0. For commercial reproduction rights and permissions contact permissions{at}ersnet.org