Research review
Analysis of Variance: Is There a Difference in Means and What Does It Mean?

https://doi.org/10.1016/j.jss.2007.02.053Get rights and content

To critically evaluate the literature and to design valid studies, surgeons require an understanding of basic statistics. Despite the increasing complexity of reported statistical analyses in surgical journals and the decreasing use of inappropriate statistical methods, errors such as in the comparison of multiple groups still persist. This review introduces the statistical issues relating to multiple comparisons, describes the theoretical basis behind analysis of variance (ANOVA), discusses the essential differences between ANOVA and multiple t-tests, and provides an example of the computations and computer programming used in performing ANOVA.

Introduction

Suppose that a researcher performs an experiment to assess the effects of an antibiotic on interleukin-6 (IL-6) levels in a cecal ligation and puncture rat model. He randomizes 40 rats to one of four equally sized groups: placebo with sham laparotomy, antibiotic with sham laparotomy, placebo with cecal ligation and puncture, and antibiotic with cecal ligation and puncture. He measures IL-6 levels in all four groups and wishes to determine whether a difference exists between the levels in the control rats (placebo with sham laparotomy) and the other groups. He performs two-tailed student’s t-tests on all of the possible pairwise comparisons and determines that there is a significant difference between the control rats and rats receiving placebo with cecal ligation and puncture (P = 0.049). Is this statistical analysis valid?

Just as methodological flaws in research design can influence the interpretation of trial results, failure to use appropriate statistical tests may result in inaccurate conclusions. Readers must be knowledgeable enough to recognize data analytic errors and to interpret the reported statistical findings. However, in a survey of 91 fifth year surgery residents in 1987, 92% reported less than 5 hours of instruction in statistics during their residency [1]. In a more recent survey reported in 2000 of 62 surgical residency programs, only 33% included education in statistics as a formal component of their curricula [2].

Given the growing impetus to practice evidence-based medicine, surgeons must be able to understand basic statistics to interpret the literature. Although descriptive statistics and t-tests are the most widely used statistical methods [3, 4, 5], researchers are employing increasingly sophisticated techniques for analyzing data. A review of trends in statistical techniques in surgical journals in 2003 compared to 1985 reported that statistical analyses have become more complicated with time [5]. In particular, the most significant changes were increases in the use of analysis of variance (ANOVA), nonparametric tests, and contingency table analyses. While the use of more advanced statistical methods may reflect increasing contributions of statisticians and epidemiologists to study design and interpretation, researchers must still be able to understand basic statistical concepts so as to choose the appropriate test. Additionally, surgeons must be able to judge the validity of the statistical methods and results reported in the literature both for research purposes and for clinical application.

Over the past several decades, not only have statistical analyses become more sophisticated, but the appropriate application of tests has improved as well. For example, in 2003, out of 187 randomly selected articles from surgical journals, 14 (7%) study authors incorrectly used t-tests instead of ANOVA for comparison of means for three or more groups [5]. In comparison, in 1985, 50 journal articles from the New England Journal of Medicine were analyzed, of which 27 (54%) used inappropriate statistical methods for comparison of multiple means [6]. Although advancements have been made in the statistics included in medical journals, errors still occur. Inappropriate statistical analyses were identified in 27% of studies examined from 2003 surgical journals [5]. Therefore, readers must be able to recognize common errors and the appropriate methods for addressing them. The primary purpose of this paper is to address the problem with multiple comparisons and to discuss why, when, and how to use ANOVA. The intended audience for the main text is surgical researchers and clinicians and, therefore, the concepts and applications of ANOVA are highlighted. For interested readers, the calculations for the main test statistic for a simple, one-way ANOVA are included (Appendix 1). A simulated example is also provided with calculations and basic computer programming (Appendix 2). The appendices’ purposes are to provide concrete examples for the readers to reinforce the concepts presented in the paper and to increase the readers’ confidence with using ANOVA. Lastly, definitions of the statistical terms used but not explained in the paper are included in a Glossary section.

ANOVA expands on the basic concepts used in performing a t-test. In a previous article in the Journal of Surgical Research, Livingston discussed the use of Student’s t-test to detect a statistical difference in means between two normally distributed populations [7]. The F-ratio or F-statistic, which is used in ANOVA, can also be used to compare the means of two groups, and yields equivalent results to the t-statistic in this situation. In fact, mathematically, when comparing only two groups, the F-ratio is equal to the square of the t-statistic. However, there are several key differences between the two tests. First, ANOVA can be used for comparing the means of more than two groups and is in fact more statistically powerful in this situation. Moreover, variants of ANOVA can include covariates, which allow one to control statistically for confounders and to detect interactions whereby one variable moderates the effects of another variable.

t-Tests and F-tests vary essentially in the method of quantifying the variability around the group means. The t-statistic is calculated using the actual difference between means, while the F-statistic is calculated from the squared sums of the differences between means. This difference has implications for the probability distributions and the interpretation of the two test statistics. To better understand these differences, a discussion of the t- and f-families of probability distributions and degrees of freedom is necessary. Degrees of freedom is a parameter that is dependent upon sample size, which is used to calculate the probability distributions for certain statistical models. Degrees of freedom may be considered a measure of parsimony, as it is a measure of the number of observations available to vary, to estimate additional parameters. In other words, as the precision increases in estimating model parameters, fewer degrees of freedom are available.

The t-test is based upon the t-distribution, which is similar to a normal distribution (e.g., resembles a bell-shaped curve whereby 95% of data points lie within two standard deviations and 99.7% lie within three standard deviations of the mean) except for the use of the sample rather than the true population standard deviation [7]. The t-distribution approaches a normal distribution as the sample size, n, increases. A smaller sample size and fewer degrees of freedom (n − 1) result in the tails of the t-distribution being denser, containing a greater percentage of the data points. Thus, there is a family of t-distributions that are dependent upon the degrees of freedom. All members of the family of t-distributions are symmetric around zero as depicted in Fig. 1A.

The probability density function or equation for generating the family of f-distributions is also dependent upon the sample size, n. The total degrees of freedom for the f-distribution, like for the t-distribution, is n − 1. However, the total degrees of freedom is divided up into the between and within groups degrees of freedom, both of which contribute to the probability distribution. Because the f-distribution is based on squared sums, the f-distribution is always positive (Fig. 1B). The flatness and skewness of the distribution depend upon the between and within groups degrees of freedom. For more about the calculations of degrees of freedom for the F-ratio, refer to Appendix 1.

These differences in probability distributions result in two main distinctions between the t- and the F-tests. First, directionality of hypothesized statistical relations can be evaluated using a one-tailed t-test, which answers the question of whether the mean of one group is larger than the other. In contrast, the F-test cannot determine the direction of a difference, only that one exists. The reason is that for a t-test, the critical value, or the value at which the t-statistic is significant, can be either positive or negative (since the distribution is centered about zero). Therefore, the t-test can evaluate hypotheses at either tail. In contrast, the F-ratio is always a positive number. Second, t-tests are not additive; that is, multiple t-tests cannot be summed together to identify a difference between multiple groups. For example, if the t-statistic for a comparison between A and B is −3 and the t-statistic for a comparison between B and C is +3, then the t-statistic for a comparison between A and C is not 0; that is, one cannot conclude that there is no difference between A and C. On the other hand, the F-test can identify an overall difference between three or more means using a single test that compares all of the groups simultaneously; thus, the F-test is referred to as an omnibus test.

One important advantage of the F-test is that as an omnibus test, it maintains an appropriate familywise error rate in hypothesis testing. In contrast, multiple t-tests result in an increased probability of making at least one Type 1 error. The problem of multiple comparisons is important to recognize in the literature, especially since the increase in the error rate may be substantial. As an example, in an analysis of 40 studies in orthopedic surgery journals, 182 significant results were reported. However, after adjustment for multiple comparisons, only 59.3% of these remained statistically significant [8]. Therefore, the Type 1 error or false positive rate was much greater than the standard, predetermined rate of 5%.

The probability of at least one Type 1 error increases exponentially with the number of comparisons. The mathematical explanation for this increase is derived as follows: assuming an α equal to 0.05, the probability that an observed difference between two groups is not due to chance variability is 1 − α or 0.95. However, if two comparisons are made, the probability that an observed difference is true is no longer 0.95. Rather, the probability is (1 − α)2 or 0.90, and the likelihood of a Type 1 error is 1 − 0.90 or 0.10. Therefore, the probability that a Type 1 error occurs if k comparisons are made is 1−(1 − α)k; if 10 comparisons are made, the Type 1 error rate increases to 40%.

When all pairwise comparisons are made for n groups, the total number of possible combinations is n*(n − 1)/2. However, some pairwise comparisons may not be biologically plausible and other pairwise comparisons may be related to each other. Therefore, the true overall Type 1 error rate is unknown. Nonetheless, the take-home message is that the false-positive error rate can far exceed the accepted rate of 0.05 when multiple comparisons are performed.

Different statistical methods may be used to correct for inflated Type 1 error rates associated with multiple comparisons. One such method is the Bonferroni correction, which resets the P-value to α/k where k represents the number of comparisons made. For example, if 10 hypotheses are tested, then only results with a P-value of less than 0.05/10 or 0.005 would be considered statistically significant. The Bonferroni correction therefore results in fewer statistically significant results. However, the resultant trade-off for minimizing the likelihood of a Type 1 error is a potential inflation of the Type 2 error rate. Another statistical method to minimize the number of comparisons performed is to use an omnibus test, such as the F-ratio in ANOVA, thereby diminishing the Type 1 error rate.

In the initial example, the total number of pairwise comparisons that can be made between four groups of rats is 4*(4 − 1)/2 or six. Therefore, the probability of at least one Type 1 error is 1 −(1 − 0.05)6 or 0.26, which is significantly higher than the predetermined level for rejecting the null hypothesis of 0.05. Using a Bonferroni correction, the adjusted P-value would be 0.05/6 or 0.008 for each comparison. Therefore, a P value of 0.049 would not be considered statistically significant. Rather than having to perform six separate pairwise comparisons, ANOVA would have identified whether any significant difference in means existed using a single test. An F-ratio less than the critical value would have precluded further unnecessary testing.

ANOVA was developed by Sir Ronald A. Fisher and introduced in 1925. Although termed analysis of variance, ANOVA aims to identify whether a significant difference exists between the means of two or more groups. The question that ANOVA answers is: are all of the group means the same? Or is the variance between the group means greater than would be expected by chance? For example, consider the data in Table 1 representing 23 observations distributed among four groups. Expressed in words, the null hypothesis in ANOVA is that the means of all four groups are equivalent; that is, the means for each column are equal. Expressed as an equation, the null hypothesis is:μ1=μ2=μ3==μ4 where μj represents the mean of the jth group. The alternative hypothesis is then that the means of all four groups are not equivalent. Expressed as an equation, the alternative hypothesis is:μ1μ2μ3μ4 Although not intuitive, testing of the null hypothesis is accomplished by examining the total variance as an aggregated measure of all mean differences; the total variance is then partitioned into the variance due to the factors of interest (the independent variables) and the variance due to random error. In other words, the variation among observations within each column is compared to the variation among observations between columns.

Figure 2 illustrates pictorially the comparison of four group means. In one scenario, the group means are different (Fig. 2A), which would result in a statistically significant F-statistic. In the second scenario, the group means are equivalent (Fig. 2B); a non-significant F-statistic would then preclude further statistical testing.

As with the t-test, the F-test is used when the outcome of interest is a continuous variable; the outcome is designated the dependent variable in an ANOVA. The variable postulated to explain or predict the outcome in ANOVA is referred to as the independent variable or factor; that is, the variable responsible for the group classification is the independent variable. Other explanatory or predictor variables (covariates) can be included in the analysis, which is referred to as ANCOVA or analysis of covariance. MANOVA, or multivariate analysis of variance, allows analysis of multiple dependent variables. MANCOVA, or multivariate analysis of covariance, is similar to ANCOVA but includes more than one dependent variable. All of these variants belong to the same family of statistical models called general linear models, which also includes linear regression models and Student’s t-test. Ultimately, researchers should become familiar with all of these techniques so as to be able to choose the model with the best fit.

Take the example at the start of this paper where IL-6 levels are being compared in rats receiving placebo or antibiotic and sham laparotomy or cecal ligation and puncture. The hypothesis for the experiment is that the mean IL-6 levels of the groups are the same. The alternate hypothesis is that there is a difference in mean levels between at least two of the groups, presumably due to the antibiotic and/or the operation. The independent variables are the antibiotic and the operation, and the dependent variable is the IL-6 level. The null hypothesis is tested by apportioning the total variance into systematic variance and error variance, or more specifically, variance due to differences resulting from the interventions being tested (variance between groups or systematic variance) and random variation within groups, which are due to chance (variance within groups or error variance). If the null hypothesis is rejected and the alternate hypothesis supported, then the researcher concludes that there is a difference in the levels of IL-6 between at least two of the groups that is due to either the antibiotic or the operation or the combination of the two.

Comparison of systematic and error variance is accomplished in ANOVA with the F-test. The F-ratio or F-statistic is the value obtained from the ratio of the variance between groups and the variance within groups. The F-test represents the determination of significance of the F-ratio by comparing it to a critical value derived from the probability distribution (e.g., the value along the f-distribution above, which 5% of the area under the curve lies, P < 0.05). If the F-ratio is greater than the critical value, then the F-test supports rejection of the null hypothesis. The critical value is never less than 1 because if the F-ratio is 1, the variance between groups is the same as that within groups, (which is assumed to be due to chance.) Therefore, an F-ratio of 1 or less represents no significant difference between groups. As the F-ratio increases, the more the variation in the outcome is explained by differences in the independent variable. Because the F-test is an omnibus test, if the F-test is statistically significant, then there is at least one significant difference in means. (See Appendix 1 for more detailed calculations of the F-ratio). Post-hoc tests can then be used to perform specific comparisons for the purpose of discovering the origin(s) of the difference.

In describing ANOVA, there are several important conventions based on the number of factors and levels being analyzed. The term factor describes the independent variable by which the groups are determined. The number of subgroups defined by each factor is referred to as the number of levels of the factor. A one-way ANOVA refers to a single factor analysis; that is, a one-way ANOVA tests for a difference in outcome between two or more levels of a single independent variable or factor. For example, a researcher studying the effects of three different dosages (levels) of an experimental drug would use a one-way ANOVA. A factorial ANOVA is used for two or more factors or independent variables; thus, a 2-way ANOVA compares two independent variables as in the initial example, e.g., the effects of different medications and different operations on IL-6 levels. A 2 × 2 ANOVA is a two-way factorial ANOVA, which is used to compare two levels of one independent variable and two levels of a second independent variable. The IL-6 example is a 2 × 2 ANOVA comparing rats receiving one of two levels of medication (placebo versus antibiotic) and one of two levels of operation (sham laparotomy versus cecal ligation and puncture).

A fixed effects ANOVA is used when inferences are being made only about the specific levels of the factor included in the study whereas random effects ANOVA is used when inferences are being made about the levels of the factor not included in the study. A fixed effects model assumes random allocation of the level of a factor, but not random sampling. Therefore, the results of the trial would only be applicable to the specific levels studied and not to all levels possible. On the other hand, a random effects model assumes random sampling of the levels assigned to the factor of interest and the results can be generalized to the population. As an example, in an experiment evaluating the effect of a drug on enzyme levels, a fixed effects model might specify three different dosages to be tested, 1 mg, 5 mg, and 10 mg. Results would then only be applicable to those drug dosages. No conclusions could be made about enzyme levels at a dosage of 20 mg. A random effects model would randomly select the dosages to be evaluated and, therefore, the results would be generalizable to the drug at all dosages, even dosages not specifically studied.

There are three assumptions that must be satisfied to apply ANOVA. The first assumption is that of normality; the outcome variable should be normally distributed within each group. This assumption can be evaluated by examining a histogram of the observations, which should resemble a bell-shaped curve, or using formalized tests such as the Kolmogorov-Smirnov test or the Shapiro-Wilks test (see Appendix 2 for an example of how to test the assumption of normality using a computer program). However, the F-test is relatively resistant or robust to violations of this assumption; that is, the Type 1 error rate does not appear to be greatly affected by skewed populations, particularly if the group distribution is balanced. The statistical power of the test also does not appear to be substantially affected by violation of this assumption, although it may be diminished with smaller group sizes. Alternative tests are available when the data are skewed, such as the Kruskal-Wallis non-parametric procedure.

The second assumption is that the variance in each group is the same (homogeneity of variance), which can be assessed using the Levene test (see Appendix 2 for an example). The F-test is also fairly robust to violations of the assumption of homogeneity of variance. Balanced designs where sample sizes are equal across groups guarantee homogeneity of variance. In unbalanced designs, however, error rates are more likely to be inflated. For example, when the smallest group has the largest variance or the largest group has the smallest variance, then error rates will be increased. Welch’s or O’Brien’s ANOVA are alternative approaches for analyzing data that violate the homogeneity assumption.

The third assumption is that the observations are independent; that is, the observations are not correlated or related to each other. This requirement is often addressed during study design. For serial observations within subjects, repeated measures ANOVA can be used as long as the subjects are independent from each other. For example, authors assessing the same outcome measure at different time points should be analyzed using repeated measures ANOVA.

If the F-ratio is significant, indicating that a difference between means exists, then post-hoc analyses can be performed to uncover the source of the significance or, in other words, to determine which specific means are different. A significant F-test may occur unexpectedly, in which case specific comparisons between factor levels, or contrasts, may be conducted. These contrasts are referred to as post-hoc comparisons. The appropriate post-hoc analysis is dependent upon the number and type of comparisons planned. If specific comparisons are planned or hypothesized up front, then these contrasts are referred to as a priori comparisons.

In the example from the beginning of the paper, suppose that preliminary experiments were conducted with the same antibiotic, but at a lower dosage—rats received cecal ligation and puncture and either no treatment or the antibiotic at the lower dosage. Suppose that there was no difference in IL-6 levels between the two groups. Now suppose that the dosage used in this experiment is significantly higher than previously tested. Additionally, suppose that a criticism of the prior experiment was the lack of a control group (sham laparotomy). Therefore, the experiment as originally described above, with four groups, was conducted. If the F-test were significant, then the researcher would wish to explore whether the source of the difference(s) detected was due to the inclusion of the control group or due to the higher dosage of antibiotics or both. Different strategies for performing these contrasts are described below.

Tukey’s HSD (Honestly Significant Difference) procedure allows the comparison of all pairs of means. When used with equal sample sizes, the familywise error rate is exactly equal to α, which is usually set at 0.05. However, when used with unequal sample sizes (also referred to here as the Tukey-Kramer procedure), the procedure yields a conservative estimate of the chance of a Type 1 error. The Tukey procedure also allows for the derivation of confidence intervals about the mean difference.

Scheffe’s procedure differs from Tukey’s in that it allows for comparisons of all types, not just pairwise. Scheffe’s procedure is the most conservative of all of the post-hoc analyses, meaning that the critical F-test value for significance is the largest and that the familywise error rate is minimized in the setting of the largest number of possible comparisons. Therefore, if only pairwise comparisons are planned, Tukey’s procedure should be used because it will result in narrower confidence limits. Nonetheless, if the F-test using Scheffe’s procedure is statistically significant, then at least one contrast out of all possible contrasts is statistically significant. The likelihood of a Type 1 error for Scheffe’s test is exactly α regardless of whether the sample sizes are equal.

For a limited number of planned comparisons, Bonferroni’s procedure can be used. This procedure is superior to Tukey’s if the number of contrasts of interest is equal to or less than the number of factor levels. However, if all pairwise comparisons are of interest, then Tukey’s is superior and will result in smaller confidence intervals.

Another post-hoc analysis is the Newman-Keuls procedure, which ranks groups according to their means and then takes into account the number of steps between the groups in calculating the critical value for significance. Duncan’s procedure is similar to the Newman-Keuls test but is less conservative. Therefore, Duncan’s test is more likely to result in a difference when larger groups are used.

There are other post-hoc analyses that can be performed. Dunnett’s test is used for comparison of groups with a control such that for n groups, there are n − 1 comparisons. Hsu’s multiple comparisons with the best (MCB) test is used for comparison of the group with the highest mean versus each of the other groups. The appropriate post-hoc analysis therefore depends upon whether the comparisons were planned or unplanned and the number and type of comparisons. However, they all address the problem of multiple comparisons and thus minimize the Type 1 error rate.

In a previous paper in the Journal of Surgical Research, Livingston discussed sample size calculations for analyses using the Student’s t-test [9]. For a t-test, the determinants of sample size include the magnitude of the hypothesized effect or effect size, standard deviation, and probabilities of Type 1 and 2 errors. The calculations for sample size for ANOVA are more complicated and beyond the scope of this paper. However, the basic principles are similar. First, the researcher must determine the hypothesized effect size based on a number of factors including the proposed difference between means, the within group standard deviation, and the number of groups being compared. The sample size is then based on the proposed distribution of means if the null hypothesis is rejected and the alternate hypothesis is supported. Based on this distribution and the desired α and β, the sample size can be calculated using either a statistical program or standardized table. Similarly, the power can be calculated based on the probability of obtaining the critical F-value given the adjusted F-distribution if the null hypothesis were to be rejected.

Section snippets

Conclusions

In summary, the appropriate use and interpretation of statistical tests is necessary to evaluate scientific data. While there is significant overlap between different statistical analyses, depending upon the research question and design, there are advantages and disadvantages to each. ANOVA is an appropriate test for evaluating the effect of categorical independent variables on a continuous outcome variable. ANOVA minimizes the inflation of a Type 1 error due to multiple comparisons, reduces

Acknowledgments

This article was supported by an NIH grant (5K23RR20020-2) (PI: L.S.K.).

The authors acknowledge Dr. Virginia Moyer and Dr. Robert Lasky for their critical reading of the manuscript.

Glossary

Assumption:
a criterion that must be met by the data for a statistical test to be valid.
Balanced design:
study design whereby all groups are equally sized.
Comparisonwise error rate:
the probability of making a Type 1 error for a single comparison, in contrast to the Familywise error rate.
Confounder:
also known as confounding variable; an additional variable that is related to both the predictor variable and the outcome of interest, causing the predictor and outcome falsely to appear related; for

References (9)

There are more references available in the full text version of this article.

Cited by (131)

  • DCT based multi-head attention-BiGRU model for EEG source location

    2024, Biomedical Signal Processing and Control
  • Analysis of variance

    2023, Translational Radiation Oncology
View all citing articles on Scopus
View full text