Introduction

In 1816 Dr. Laënnec invented the most common symbol of medicine: the stethoscope1. The use of the stethoscope is considered an essential skill in the medical profession and is often chosen for its’ ease of use, as well as for its’ appearance and reputation2. Auscultation of the respiratory system is non-invasive, safe, inexpensive and easy-to-perform. History taking and a detailed physical examination, including auscultation, are considered essential parts of clinical examination. However, detailed auscultation alone can take up to 10 minutes3. Nowadays, physicians might not be in the position to spend that amount of time to evaluate chest sounds, potentially leading to an inefficient and superficial examination, giving a delay in further diagnostic work-up and treatment3,4.

To date, it is still ambiguous how this diagnostic tool contributes to the diagnostic work-up for various pulmonary entities. Despite the fact that the diagnostic accuracy of lung auscultation is widely debated, the stethoscope is still a first line diagnostic tool and used for clinical or therapeutic decision-making.

The question arises if the use of the stethoscope still attributes to further diagnostic work-up or if using the stethoscope is just a waste of time. So, is the stethoscope 200 years after its invention ready to be relegated to a museum shelf or does the stethoscope still provide vital clues to aid in the diagnosis5,6? The objective of this meta-analysis is to evaluate the diagnostic accuracy of lung auscultation in various clinical settings for the four most common acute respiratory pathologies: congestive heart failure, (hemato)pneumothorax, pneumonia, and obstructive lung diseases.

Methods

Search strategy and selection criteria

This is a systematic review and meta-analysis following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, to improve the quality of the meta-analysis7. The protocol was registered at ‘PROSPERO International prospective register of systematic reviews’ (http://www.crd.york.ac.uk/PROSPERO), registration number: CRD42016035312).

The following inclusion criteria were used:

  • Study designs: case-control studies, cross-sectional studies, prospective or retrospective observational studies and randomized controlled trials.

  • Time frame: all medical literature published till full search conducted on 19 January 2017.

  • Participants: adult patients admitted to all clinical departments of primary or secondary care institution.

  • Index test: lung auscultation, or lung auscultation as part of the physical examination.

  • Comparator: all studies comparing or evaluating lung auscultation, or lung auscultation as part of the physical examination, with a reference standard mentioned below.

  • Target condition: cardiopulmonary edema (refered to congestive heart failure in this meta-analysis), (hemato)pneumothorax, pneumonia, and obstructive lung diseases.

  • Outcome measures: all data concerning diagnostic accuracy (sensitivity, specificity, positive and negative likelihood ratios (LRs), area under the curve (AUC) and heterogeneity). Rough data must be mentioned or retrievable.

  • Reference standard: chest radiography (CXR), thoracic computed tomography (CT), Doppler echocardiography, spirometry (FEV1/FVC ratio) or final diagnosis by an expert panel, for various medical conditions.

  • Language: manuscripts published in all languages.

A medical literature search specialist of the Free University medical library (J.C.F.K.) was consulted to define a robust search strategy. PubMed® Resource Guide search engine was used to access MEDLINE® database. The following terms were used (including all synonyms and closely related words) as index terms or free-text words: ‘stethoscopes’ or ‘auscultation’ or ‘respiratory system’ and ‘sensitivity’ or ‘specificity’. Supplementary Appendix A shows the complete PubMed® (MEDLINE®) search. An EMBASE® search was defined, however due to the large number of duplicates with PubMed® and disproportionate number of articles, only the extensive PubMed® search was analysed. If necessary authors were contacted for further information.

Abstracts and titles of all articles were analysed by two independent investigators (L.A. and E.H.T.L.). First all abstracts were screened using the in- and exclusion criteria described above. This step was followed by reading the remaining full text articles out of which relevant articles were selected. From a significant number of full text articles, rough data were not retrievable and these articles were excluded. The reference lists of included articles were scanned during the screening process: backward and forward citations were reviewed. Any disagreements were resolved during consensus meetings with a third reviewer (P.R.T.).

Covidence and EndNote X7® Software were used to manage the references. When described, the different breath sounds detected by the index test were also recorded with their sensitivity and specificity. To standardize nomenclature, we followed published guidelines for the definition of the different breath sounds8,9.

Data Analysis

QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) was used to assess risk of bias and applicability concerns (www.quadas.org)10. Supplementary Appendix B shows the form used for the QUADAS-2 assessment. Quality assessment was done by two reviewers (L.A. and E.H.T.L.) Any disagreements were resolved during consensus meetings with a third reviewer (P.R.T.).

A statistician (P.M.v.d.V.) performed statistical analysis. We selected four patient groups with the most common diagnoses in pulmonary pathology to reduce the heterogeneity encountered during the conduct of this study. Groups of pulmonary pathology included were: congestive heart failure (CHF), (hemato)pneumothorax (HPT), pneumonia, and obstructive lung diseases (OLD). Number of true positives, false positives, true negatives and false negatives were obtained from the articles and used for further analysis. As several studies considered different index tests for the same outcome in the same sample of patients, a multilevel approach accounting using the xtemelogit procedure in Stata 12® (StataCorp LLC, College Station, TX) was used to obtain pooled estimates for sensitivity and specificity and their 95% confidence interval (CI)11. The MIDAS command in Stata was used for forest plots and pooled estimates for LR+, LR−, diagnostic odds ratio (DOR) and Area Under The ROC curve (AUC). Deeks’ Funnel Plot asymmetry test was used to test for publication bias.

A meta-regression was performed separately for sensitivity and specificity. Predictors considered were diagnosis-group, index test used, type of department, percentage male and average age of the study sample. Univariate analyses were performed first, followed by a multivariate analyses in which all five predictors were included. Supplementary Appendix C shows extended information about the performed data analysis.

Results

Study selection and characteristics

After extracting the duplicates from the extended search for PubMed® (MEDLINE®), a remaining 5.873 articles were critically analysed, of which 34 were included. A large number of articles were excluded after screening the abstract, based on in- and exclusion criteria of this meta-analysis. Figure 1 shows the selection process following the PRISMA four fase flow diagram (also see supplementary Table 1 for the PRISMA checklist). Table 1 summarizes characteristics of the 34 included studies. A total of 14.814 patients were included in this analysis. Auscultation was performed by different type of investigators, with or without teaching interventions.

Figure 1
figure 1

Flow chart of selection process.

Table 1 Characteristics of included studies.

Diagnostic summary measures

The overall pooled sensitivity for lung auscultation is 37% (95% CI: 30–47%) and specificity 89% (95% CI: 85–92%) (see Table 2 and Fig. 2). Table 3 shows the pooled estimates of sensitivity and specificity for the different types of breath sounds: abnormal, decreased or absent breath sounds, crackles, rhonchi, and wheezes. Heterogeneity was significant when considering all outcomes (P < 0.001), but also when restricted to CHF, OLD and pneumonia. Only heterogeneity of study outcomes for HPT was not significant (P = 0.38). Deeks’ Funnel Plot for all studies (Fig. 3) suggests publication bias (P = 0.01) when considering all outcomes. Publication bias was not significant, when restricting to CHF (P = 0.18), HPT (P = 0.34), OLD (P = 0.75) and pneumonia (P = 0.99). It must, however, be noted that the estimates of the bias when restricting to CHF and HPT were larger than the estimate of the bias based on all outcomes. Therefore, lack of significance for these pathology groups may be due to the small sample sizes (n = 10 and n = 6, respectively). Estimates of bias in the OLD and pneumonia subgroups were much smaller than the estimate of the bias based on all outcomes and sample sizes were larger compared to other subgroups (n = 22 and n = 29, respectively), suggesting the absence of publication bias for those pathology groups (see e-Fig. 1A-D).

Table 2 Diagnostic accuracy considering sensitivity, specificity, positive and negative Likelihood Ratio’s, Diagnostic Odds Ratio, and Area Under the Curve, for different pulmonary pathologies.
Figure 2
figure 2

Forrest plot of sensitivity and specificity together with their 95% confidence intervals for different acute pulmonary pathology. Side note: Estimates and confidence intervals for pooled estimates may differ slightly from those in Table 2 as correlation of sensitivities (and specificities) observed for the different index-tests within the same study was ignored when making the forest-plot. Abbreviations: PNA: pneumonia; Decr. br. sounds: decreased breath sounds; Air. Obstr.: airway obstruction; dulln: dullness; COPD: chronic obstructive pulmonary disease; Abn. Ausc.: abnormal auscultation; HPT: (hemato)pneumothorax; CHF: congestive heart failure; Uneq. br. sounds: unequal breath sounds; pen.: penetrating; Air. Obstr: airway obstruction.

Table 3 Diagnostic accuracy for considering sensitivity, specificity, positive and negative Likelihood Ratio’s, Diagnostic Odds Ratio, and Area Under the Curve, for different breath sounds.
Figure 3
figure 3

Deek’s Funnel Plot test for publication bias.

Congestive heart failure

Six prospective observational studies included patients with (acute) dyspnoea and compared auscultation with Doppler echocardiography, the Framingham criteria or by an expert panel for CHF12,13,14,15,16,17. Considering the results listed in Table 2, diagnostic accuracy of auscultation in patients with CHF is poor. Supplementary Figure 2 and Table 3 show that in all six studies the presence of crackles is more sensitive than the presence of wheezes for CHF.

(Hemato)pneumothorax

Four prospective observational studies and one retrospective study included patients with blunt or penetrating chest trauma to compare auscultation with CXR for the detection of hematothorax, pneumothorax or hematopneumothorax18,19,20,21,22. Results in Table 2 show an excellent diagnostic accuracy of auscultation for HPT in trauma patients. Except for the study of Rodriques et al., with a low sensitivity for abnormal breath sounds in patients with HPT21. This is the only study that took abnormal breath sounds into account (see Supplementary Figure 3).

Obstructive lung disease

Ten prospective observational studies, one retrospective observational study, and one cross-sectional study included patients with diagnosis of chronic obstructive lung disease (COPD) or asthma and compared auscultation with spirometry for the detection of airway obstruction23,24,25,26,27,28,29,30,31,32,33,34. The results listed in Table 2, show a poor diagnostic accuracy of auscultation for OLD. Table 3 shows that for the diagnosis COPD abnormal, decreased or absent breath sounds have a LR + of 4.3, with five available studies, and wheezes have a LR + of 3.6, with ten available studies (see also Supplementary Figure 4).

Pneumonia

Nine prospective observational studies and two case-control studies included patients with acute respiratory symptoms or with an expected pneumonia and compared auscultation with CXR for the detection of pneumonia35,36,37,38,39,40,41,42,43,44,45. Table 2 shows a low diagnostic accuracy of auscultation for pneumonia in these patients. Supplementary Figure 5 demonstrates a higher sensitivity for the combination of different breath sounds, found by Ebrahimazedeh et al. (decreased breath sounds, crackles, rhonchi), followed by crackles as a single breath sound (see Table 3)36.

Meta-regression

Sensitivities

In univariate analyses sensitivities were found to be associated with diagnosis-group (P < 0.001), index test used (P < 0.001), percentage male (P = 0.041) and department (P < 0.001), but not with average age of study sample (P = 0.72).

With regard to diagnosis group, sensitivities were significantly higher for HPT compared to OLD (P < 0.001) and pneumonia (P = 0.002). No other pairs of diagnosis groups were found to differ significantly in terms of sensitivity.

With regard to index text used, sensitivities were significantly higher for absent, decreased or unequal breath sounds compared to wheezes (P < 0.001) and rhonchi (P = 0.003). Sensitivities for crackles were significantly higher compared to wheezes (P < 0.001) and rhonchi (P = 0.004). No difference was found between rhonchi and wheezes (P = 1.000) and absent, decreased or unequal breath sounds and crackles (P = 1.000). With regard to departments, sensitivities were higher for Intensive Care Unit (ICU) compared to mixed patients from Emergency Department (ED) and wards (P = 0.042) or General Practice (GP), wards or ED only (P < 0.001 for all three). No differences were found in terms of sensitivity between ED, wards and GP. Sensitivity increased with 0.5% (95% CI: 0.0–0.9%) with each additional percent of males included in the study.

In a multivariate analysis including all five candidate predictors, diagnosis group no longer reached significance (P = 0.051). Index test used (P < 0.001), percentage male (P = 0.005) and department (P < 0.001) remained significantly associated with sensitivity. Sensitivities were not found to be associated with average age of study sample (P = 0.47).

Specificities

In univariate analyses specificities were found to be associated with diagnosis-group (P < 0.001), index test used (P = 0.013), department (P < 0.001) and average age of study sample (P = 0.001) and percentage male (P = 0.88).

With regard to diagnosis group, specificities were significantly higher for HPT compared CHF (P < 0.001) and pneumonia (P = 0.001). No other pairs of diagnosis groups were found to differ significantly in terms of specificity.

With regard to index text used, specificities were significantly higher for absent, decreased or unequal breath sounds compared to wheezes (P = 0.028). No other pairs of index tests were found to differ significantly in terms of specificity. With regard to departments, specificities were significantly higher for ICU compared ED. No other differences were found. Specificity decreased with 0.6% (95% CI: 0.3–1.0%) for each year increase in average age.

In a multivariate analysis for specificity including all five candidate predictors, only diagnosis group remained significant (P = 0.036). Specificities were not found to be associated with average age of study sample (P = 0.89), index test used (P = 0.88), percentage male (P = 0.17) and department (P = 0.22). Post-hoc tests using Bonferroni correction revealed no pairs of diagnosis groups that differed significantly in terms of their specificity.

Risk of bias and applicability concerns

Table 4 summarizes the risk of bias and applicability assessment of included studies. Supplementary Appendix D shows complete risk of bias and applicability assessment following the QUADAS-2 guidelines. Overall, the risk of bias for most studies was considered high. Risk of bias was considered low when physicians were informed with some clinical data, assumed to be a normal clinical situation. Almost all studies matched the review question, resulting in low applicability concerns. Reasons for high risk of bias most often encountered were: a highly selected group of patients; no consecutive selection of patients, no description how selection was performed; and often patients were potentially incorrectly excluded from the analysis. Many studies did not clearly describe if the physicians performing auscultation were blinded for the reference test. The studies concerning patients with a suspected HPT and pneumonia did not use thoracic CT or final diagnosis by the treating physician, which can be considered the gold standard, but CXR as reference standard, giving a high risk of bias for the reference standard.

Table 4 QUADAS-2: risk of bias and applicability assessment of included studies.

Discussion

The main findings of this meta-analysis evaluating the diagnostic accuracy of lung auscultation in adult patients with acute respiratory pathology are a low sensitivity and an acceptable specificity of lung auscultation for the different pulmonary conditions studied, with an overall pooled sensitivity of 37% (95% CI: 30–47%) and specificity of 89% (95% CI: 85–92%). LRs and AUCs of auscultation for CHF, OLD and pneumonia are low. An exception is the presence of abnormal or decreased breath sounds in trauma patients, which are highly accurate for the detection of HPT. This is confirmed by multivariate analyses for specificity where diagnosis groups remained significant. Results of the meta-regression showed that the heterogeneity found could be explained by diagnosis-group, index test used, and department. We must be aware of the high risk of bias and heterogeneity reduced the quality of evidence found in this meta-analysis.

Considering the results of this meta-analysis, auscultation can be considered not clinical useful in making a diagnosis in most circumstances, based on cut-offs by Tape,T.G. (see Supplementary Appendix C), although it is hard to determine a cut-off for a minimally accepted diagnostic accuracy. Secondly, its value depends on the prevalence of the disease, clinical setting or context, and competence of the physician performing the investigation. Therefore, the different outcomes found per department can be explained by the high prevalence of disease at the ICU compared to other wards, as found in the meta-regression where sensitivities, and also specificities, were higher for patients at the ICU, compared to mixed patients from ED and wards or GP, wards or ED only. Thirdly, next to accuracy, the efficacy of auscultation also depends on how its changes clinical behaviour, e.g. how it alters clinical diagnoses and treatment decisions. For example, consider auscultation for decompensated heart failure. Crackles on auscultation have a sensitivity of 51–75% and specificity of 45–84%, carrying a LR + of 1.8 and LR− of 0.56. This limits their use in ruling decompensated heart failure in or out, because their presence of absence only marginally alters the provisional diagnosis. Although efficacy is not studied in this meta-analysis considering the overall low sensitivity, LR + and AUC, our findings suggest that lung auscultation must often be considered unfit as screening tool and for confirming a diagnosis. Especially in patients with normal auscultation and without high burden of disease, many diagnoses will go undetected and therefore additional work-up needs to be performed. In addition, it has been shown that findings from abnormal auscultation alone are insufficient to establish a diagnosis, e.g. in pneumonia and it is advised that when diagnostic certainty is required a CXR should be performed46. For trauma patients outside the hospital with suspected HPT an exception can be made, for which probably no further diagnostic work-up is needed, and a chest tube can be placed based on the auscultatory findings. In almost all other circumstances when auscultation is performed, still further workup is needed to conform the exact diagnosis. Fourthly, another important finding of this meta-analysis is that, although particular breath sounds are more related to a specific pathologic condition, a certain breath sound can also be present in other pulmonary diseases, lowering the diagnostic accuracy in less selected groups of patients, where the likelihood of the target condition being present is much lower. For example, decreased breath sounds which are highly specific for HPT in trauma patients, are also often found in patients with OLD or pneumonia. Fifthly, in daily practice the value of lung auscultation is further jeopardized by the experience and time of the physician performing auscultation, the subjectivity of perception and the difficulty in using standardized terminology to describe auditory findings8,47. As stated by Hirschtick, a “quick physical exam” is often used by the unexperienced fingers and is not much worth47. Lastly, a diagnostic tool can be considered obsolete when a more accurate diagnostic test is available, for example lung ultrasound which is further described below48.

Considering the above, we must reconsider the use of the stethoscope in patient groups with low prevalence of disease and in clinical situations where more advanced diagnostic modalities are available. Only in clinical situations in resource limited areas, with high prevalence of disease and in experienced hands the stethoscope has some clinical relevance.

Strengths and limitations

The strengths of this meta-analysis are that it is the first on this topic, the use of a highly sensitive search strategy, a complete overview of the diagnostic accuracy of lung auscultation in a wide range of clinical settings and in predefined subgroups, and a quality assessment according to the QUADAS-2 guidelines, which is a validated and reliable instrument. When testing for publication bias, it was considered less likely. To reduce publication bias, backward citations were searched.

This meta-analysis also has weaknesses. Although, the search strategy was robust, it is still possible that not all studies were identified. Most included studies were considered to have some risk of bias. Limitations of the included studies were a wide range in number of physicians who performed auscultation, reference standards, and different clinical departments. Lastly, we changed the protocol during the conduct of the study to analyse and reduce heterogeneity.

Further implications

We are supporters of the history and physical exam and advocate use of eyes, ears, nose and hands to study patient’s condition. However, clinicians must be progressive, embrace new modalities and let go of less reliable methods. Segall et al. stated in 1963: “By the year 2016, electronic systems of collecting and analysing data about the cardiovascular system may render the stethoscope obsolete.”49 Next to newer stethoscopes, with computerized acoustic technology which can correlate lung sounds with disease states, lung ultrasonography (LUS) has been studied extensively and seems to fulfil the role of new modality as also fantasized by Segall49,50. LUS, which should be seen as part of the physical examination, has many potential advantages over lung auscultation and CXR: its high accuracy, quick and easy performance and interpretation at the bedside; dynamic imaging; avoidance of radiation and contrast burden; evaluation of disease progress; and reduction of costs51. LUS turned out to be highly accurate for most diagnosis studied in this meta-analysis with a sensitivity and specificity of more than 90%48,51,52,53,54,55,56. There is also evidence showing that LUS detects respiratory problems at an early stage and impacts clinical decision making54,57,58,59,60,61. Therefore, it has been suggested before that LUS should replace lung auscultation50,51,62. Some important implemantations have to take place before LUS can be further implemented in today’s practice. For example, more ultrasounds devices have to be purchased and medical education has to shift its attention to ultrasonography62. Experts think these barriers for the implementation of LUS can relatively easily be tackled50, for example costs are fastly decreasing, e.g. handheld ultrasound devices are avalaible on the market for around 1500 Euro’s (1670 US dollars).

Conclusion

This meta-analysis shows that in different patient populations with acute respiratory pathology, lung auscultation has a low sensitivity, LR + and AUC and an acceptable specificity and LR−. The results underline that auscultation only marginally alters the provisional diagnosis, although results are limited by a high risk of bias and heterogeneity of included studies. Now 200 years after the invention of the stethoscope, better diagnostic options are available such as lung ultrasound. Therefore, when better diagnostic modalities are available they should replace lung auscultation. Only in resource limited settings, with a high prevalence of disease and in experienced hands, lung auscultation has still a role.