Confounding variables can affect the results from studies of children with Down syndrome and their families. Traditional methods for addressing confounders are often limited, providing control for only a few confounding variables. This study introduces propensity score matching to control for multiple confounding variables. Using Tennessee birth data as an example, newborns with Down syndrome were compared with a group of typically developing infants on birthweight. Three approaches to matching on confounders—nonmatched, covariate matched, and propensity matched—were compared using 8 potential confounders. Fewer than half of the newborns with Down syndrome were matched using covariate matching, and the matched group was differed from the unmatched newborns. Using propensity scores, 100% of newborns with Down syndrome could be matched to a group of comparison newborns, a decreased effect size was found on newborn birthweight, and group differences were not statistically significant.
Families of children with Down syndrome are different from both families of children with other or without disabilities (Most, Fidler, Laforce-Booth, & Kelly, 2006). Many of these differences exist before the child's birth, and these pre-existing differences may affect child outcomes. For example, mothers of children with Down syndrome are more likely to be older (Hook, 1981; Urbano & Hodapp, 2007), with rates of Down syndrome significantly increased in mothers older than 35 years. Older mothers are at higher risk for having health-related complications, such as hypertension, and adverse birth outcomes, including preterm delivery and low birthweight (Cleary-Goldman et al., 2005).
Pre-existing group differences can confound, or confuse, the interpretation of study findings. When groups differ on a variable also related to the outcome of interest, it is difficult to interpret a finding of significant group differences. For example, maternal age confounds the relationship between Down syndrome and birthweight (Figure 1). Children with Down syndrome have lower average birthweight compared with children without Down syndrome (Cronk, 1978), and newborns of older mothers typically weigh less than newborns of younger mothers (Cleary-Goldman et al., 2005). In addition, older mothers are more likely to have a child with Down syndrome (Hook, 1981; Urbano & Hodapp, 2007). Because both maternal age and Down syndrome may contribute to lower birthweight, maternal age is a confounder in studies of the effect of Down syndrome on birthweight. If the confounder, maternal age, is not measured and accounted for, the effect of Down syndrome on birthweight will be overestimated.
When uncontrolled, confounder variables can bias study results by underestimating or overestimating the true relationship between two variables. Biased estimates, in either direction, can negatively affect research, practice, and policy. When effect sizes are underestimated, important factors related to developmental disabilities are not identified, and when effect sizes are overestimated, time, energy, and monetary resources might be devoted to factors that are not actually related to developmental disabilities.
Confounders are a known problem in Down syndrome research (Hodapp, Glidden, & Kaiser, 2005; Seltzer, Abbeduto, Krauss, Greenberg, & Swe, 2004; Stoneman, 2005), and methods, such as matching and covariate analysis, exist to control for confounders. Matching is a common strategy for selecting a group of comparison participants who are similar to the participants of interest on the potential confounders, especially when the group of interest is limited. Covariate analyses can adjust for the confounders statistically, but matching has several advantages (Rosenbaum & Rubin, 1983): Analysis of matched data is relatively easy, matching produces effect-size estimates with smaller variance than covariate adjustment, analyses on matched data are more robust, and matching can control for more confounders than covariate adjustment, for a given sample size.
Covariate matching, the systematic selection of participants to balance groups (e.g., with or without Down syndrome) on select variables, is one method for controlling confounders. Covariate matching methods have two major practical limitations: The number of confounders is restricted by the sample size, and participants must be available for both groups.
For covariate matching, the number of confounding variables must be relatively small. As the number of confounding variables increases, the sample size requirement increases exponentially. Each additional confounder adds a group to the study design. For example, in a study of Down syndrome with two groups (with and without Down syndrome) of 20 children each, adding advanced maternal age (yes/no) as a confounder adds an additional factor (2 × 2). With 20 children per group, the sample size is now doubled to 80 children (2 × 2 × 20). This is problematic because the number of potential confounders in Down syndrome research is likely to be large. In studies of children with Down syndrome and their families, potential confounders include both factors related to risk for Down syndrome (e.g., maternal age, maternal race) and factors that may be related to child or family outcome (e.g., marital status, birth order, child temperament, and sibling temperament [see Stoneman, 2005]). Adding a second confounder would increase the research design to three factors (2 × 2 × 2) and the sample size to 160. Thus, matching on more than two confounding variables quickly becomes prohibitive.
A related issue is that potential participants must be available for both groups. As described above, as sample size requirements increase, it can be difficult to find sufficient numbers of study participants. This is especially true when the participant population, such as newborns with Down syndrome, is relatively small. Availability of potential participants can be further compounded by the distribution of the confounder across the two groups. For example, advanced maternal age is a good candidate for matching because maternal age is associated with both increased risk for Down syndrome and multiple outcomes, such as low birthweight. However, the proportion of mothers with advanced maternal age is different for newborns with and without Down syndrome. In Tennessee births from 2001, 23% of the children born with Down syndrome had mothers 40 years or older (Tennessee Birth Records, 2001), but for children without Down syndrome, only 1.5% had mothers 40 years or older. Thus, the population of possible newborns without Down syndrome who also have older mothers is substantially reduced, making it difficult to find matches.
Propensity scores provide a method for matching on multiple confounding variables, without the limitations of covariate matching methods. Propensity score methods were first introduced 25 years ago (Rosenbaum & Rubin, 1983) and have been used in several different fields, such as cardiology (Normand et al., 2001), psychology (Hill, Brooks-Gunn, & Waldfogel, 2003), and psychiatry (Sernyak, Desai, Stolar, & Rosenheck, 2001). However, to our knowledge, propensity scores have only been used once in developmental disabilities research (Witt, Kasper, & Riley, 2003) and have yet to be used in research on Down syndrome.
A propensity score represents a group of variables (confounders) with a single number (Rosenbaum & Rubin, 1983). Specifically, the propensity score is a number between zero and one that represents the predicted probability that a person is in a particular group, given the confounders. For example, using a group of potential confounders such as maternal age, education, and weight gain, a propensity score represents the predicted likelihood that a specific newborn has Down syndrome. A propensity score of .90 means that the newborn has a .90 predicted probability of having Down syndrome, given his/her scores on the specific group of confounding variables.
Propensity scores are computed using logistic regression, a standard statistical method. In the logistic regression, the outcome variable is a binary variable representing group (e.g., with or without Down syndrome), and the predictor variables are the potential confounders. The logistic regression analysis produces an empirically derived formula (i.e., weighting of predictor variables) that best discriminates between the two groups. Applying this formula to each individual's values on all of the predictor values produces a predicted score. This predicted score is the individual's propensity score.
Propensity matching uses propensity scores to find matched participants from a comparison group. There are many methods for creating matched pairs, such as nearest available, Mahalanobis metric, and caliper matching (D'Agostino, 1998; Rosenbaum & Rubin, 1985a). Successful matching requires both variability in the propensity scores within each group and overlap of propensity scores across groups (i.e., common support region). After the matched groups are created, standard statistical analyses are used to test for group differences on the outcome variables (see Blackford, 2007, for a detailed description of propensity score methods).
Propensity matching has multiple strengths. First, because propensity matching requires only the single propensity score to select a comparison group, the sample size limitations associated with covariate matching are avoided. Second, the independence of the propensity score from the outcome variable of interest makes it a good proxy for random assignment. Third, research shows that estimated propensity scores are better than the true propensity score at removing bias (Joffe & Rosenbaum, 1999). Last, propensity methods can be implemented using standard software packages.
Propensity matching also has several caveats and requirements. A major caveat is that propensity scores can only control for observed confounders (Joffe & Rosenbaum, 1999). Unobserved confounders likely still exist and may influence study findings. Computation of the propensity score using logistic regression generally requires 10 observations per confounder. Thus, the number of confounders that can be controlled is proportional to sample size, limiting the use of this approach for small samples. In addition, propensity scores can only be used for matching two groups at a time. Studies with three or more groups require separate propensity scores for each two-group comparison.
To determine the utility of propensity scores in Down syndrome research, we compared propensity score matching with no matching and covariate matching in a statewide sample of newborn births. We tested the hypothesis that newborns with Down syndrome have lower birthweight than a comparison group of newborns without Down syndrome. Matching success rates, statistical significance, and effect size were compared.
Participants included newborns born in Tennessee during the year 2001. Administrative databases of births and hospital discharges were provided by the state of Tennessee. Birth records from the year 2001 (Tennessee Birth Records, 2001) were linked to hospital discharge records from 2001 to 2003 (Tennessee Hospital Discharge Data System, 2001–2003). The birth record included demographic variables, birth variables, and identification of Down syndrome. Although data collected for administrative purposes, such as birth records, may contain errors (Iezzoni, 1997), descriptive demographic and birthweight data from Tennessee birth certificates have been reported to be reliable for both healthy births and those with adverse outcomes (Piper et al., 1993). However, birth records in that study showed low sensitivity for the identification of congenital anomalies. Birth records may underreport diagnoses, such as Down syndrome, for multiple reasons, including delays in genetic testing. To increase the accuracy of Down syndrome diagnosis in this study, the birth records were linked to hospital discharge records to identify additional children with Down syndrome who were not identified at birth (see Urbano & Hodapp  for details on data linkage and Down syndrome identification).
In 2001, 84,398 infants were born. Of these newborns, 85 (0.10%) had Down syndrome; 1,150 (1.4%) had a congenital anomaly; and 83,163 did not have Down syndrome or other congenital anomalies. For this study, we compared newborns with Down syndrome with newborns without Down syndrome or other congenital anomalies. For these two groups, complete data for all analytic variables were available for 90% of the sample. For the purpose of this simulation study, only complete cases were included in the sample. The final analytic sample included 77 newborns with Down syndrome and 75,155 typically developing newborns without congenital anomalies.
We selected possible confounders based on putative contributions to Down syndrome or birthweight. Potentially confounding variables included maternal age (Hook, 1981), maternal education (Chapman et al., 2008; Urbano & Hodapp, 2007), birth order (Seidman et al., 1988), plural birth (Luke, 1994), maternal weight gain (Abrams & Selvin, 1995), maternal race (Canfield et al., 2006), marital status (Luo, Wilkins, & Kramer, 2004), and child gender (Seidman et al., 1988). Maternal age, maternal education, and maternal weight gain were continuous variables. Birth order (first/second/third, etc.), plural birth (yes/no), maternal race (White/non-White), marital status (married/not married), and child gender (male/female) were categorical.
Covariate matches were created by selecting newborns from the Down syndrome group and a comparison group with the same values on each of the confounding variables. For propensity matching, propensity scores were first computed for each newborn using logistic regression. Matches were made by finding the closest propensity score match for each newborn with Down syndrome from the comparison group of newborns (Parsons, 2001). This process was performed without replacement so that after a match was made, the comparison newborn was no longer available for other matches. This method is known as nearest neighbor, nearest available, or greedy matching (Cochran & Rubin, 1973; Rosenbaum & Rubin, 1985a).
Following propensity score matching, t tests were performed to test for balance between groups for each of the potentially confounding variables. Variables that remained unbalanced, based on a significant difference between the two groups, were included as covariates in subsequent analyses (Rosenbaum & Rubin, 1985a). Tests of group differences were unnecessary in the covariate-matching sample because each match was exact.
The effect of matching method on newborn birthweight was tested for three subsamples: (a) a nonmatched group (75,155 typically developing newborns without congenital anomalies), (b) to a covariate matched group, and (c) to a propensity matched group. For each of the matching methods, an analysis of variance (ANOVA) was performed to test for group (Down syndrome/comparison) differences in birthweight. Repeated measures ANOVAs were performed to test for group differences in birthweight for the covariate matched and propensity matched groups because matched pairs represent dependent samples. Down syndrome and comparison group differences are reported with both effect size (d) and statistical significance (p < .05).
Effect sizes provide a standardized estimate of the magnitude of the relationship between Down syndrome and birthweight, regardless of sample size. For the two matched groups, we used a standard Cohen's d (1992) to measure standard differences in group means. However, for very large sample sizes, standard methods for computing effect size may be unsuitable because effect sizes become small and unrepresentative when the sample size gets very large. For the very large, nonmatched group, we used a bootstrap procedure (Efron, 1979) to compute effect size for repeated samples (n = 77, the same as the propensity matched group) from the full group. The bootstrap procedure provides a precise estimate of effect size comparable with the other matching methods.
To determine the utility of propensity scores in Down syndrome research, we used matching success rate, statistical significance, and effect size to compare each of the three matching methods. All statistical analyses were performed using SAS (Version 9.1; Cary, NC) or R (Version 2.5.1; The R Foundation for Statistical Computing [http://www.r-project.org]). An alpha of .05 was used for all tests.
Newborns with Down syndrome and their mothers differed from the comparison newborns on the eight confounders (Table 1). There were significant group differences in maternal age, maternal weight gain, birth order, plurality, marital status, and newborn gender. Newborns with Down syndrome were more likely to be male, plural birth, and later born, and on average their mothers were older, gained less weight during pregnancy, and were more likely to be married.
Using covariate matching methods, only 48% of the newborns with Down syndrome had at least 1 match from the comparison group. Out of the 75,155 newborns in the comparison group, only 37 were exact matches to at least 1 infant in the Down syndrome group on all eight confounding variables. At that rate, at least 156,572 newborns would be needed in the comparison group to achieve 100% matching. Moreover, of the 37 newborns with matches, 41% only matched to 1 child in the comparison group. The high number of single matches indicated that the probability of finding an exact match on a moderate number of confounders was very low.
In contrast, with propensity-score matching, 100% of the newborns with Down syndrome had a match from the comparison group. The matched groups were balanced on all confounding variables (all ps > .20) except maternal race (p = .03), where mothers of newborns with Down syndrome were more likely to be White.
It is also noteworthy that for covariate matching, differences emerged when comparing the matched versus unmatched newborns with Down syndrome. Matched infants—those newborns for whom it was possible to find a match with the comparison group—differed from unmatched infants—those 40 for whom an appropriate match could not be found—on maternal age, t(75) = −4.02, p = .0001, and maternal weight gain, t(75) = 3.19, p = .002, with trends for plurality, χ2(1, N = 77) = 3.52, p = .06, and birth order, χ2(2, N = 77) = 4.98, p = .08 (see Table 2). Matching each newborn with Down syndrome, and deleting from the study infants for whom matches could not be found, produced a group that differed from the overall sample of newborns with Down syndrome on several confounding variables.
Statistical Significance and Effect Size
The presence of Down syndrome was significantly related to birthweight when the nonmatched and covariate matching methods were used. However, this effect was no longer significant (p = .16) when the propensity matching method was used to control for confounders. This statistical significance difference was accompanied by differences in effect size. The effect size of Down syndrome on birthweight based on the three matching methods differed (Table 3). The effect sizes for the nonmatched (d = .60) and covariate-matched (d = 1.02) groups were larger than the effect size for the propensity matched group (d = .33).
This study introduced propensity scores as a method for controlling for multiple confounding variables in Down syndrome research. Most studies in developmental disabilities research have been able to control for a few confounders at most. When uncontrolled, confounders bias study results by underestimating or overestimating true effect sizes, which can negatively affect research, practice, and policy. When effect sizes are underestimated, researchers fail to identify important factors related to developmental disabilities, and when effect sizes are overestimated, they may devote time, energy, and monetary resources to the wrong factors.
This study compared three different methods—nonmatched, covariate matched, and propensity matched—to test the relationship between Down syndrome and birthweight. There were four major study findings: (a) pre-existing group differences, (b) low matching rates for covariate matching, (c) bias introduced by covariate matching, and (d) overestimation of effect size and statistical significance by nonmatching and covariate matching.
First, there were many pre-existing differences between the newborns with Down syndrome. Newborns with Down syndrome were more likely to be male, plural birth, and later born. The mothers of newborns with Down syndrome were older and gained less weight. The pre-existing maternal and newborn differences reported here are similar to differences found by others (Hook, 1981; Urbano & Hodapp, 2007). Prior studies have also reported significant differences in maternal race (Canfield et al., 2006) and education (Urbano & Hodapp, 2007). Although there were mean differences in this sample, none reached statistical significance.
Second, when attempting to match on eight confounding variables, there were substantial differences in the number of successful matches for covariate matching compared with propensity matching methods. With covariate matching, only half of the newborns with Down syndrome could be matched to a newborn from the comparison group. In contrast, all of the Down syndrome group could be matched using propensity score matching. Covariate matching methods resulted in us eliminating over half of the newborns in the Down syndrome group, reducing the overall study sample size and negatively affecting statistical power.
Third, covariate matching methods introduced a systematic bias into the study. The newborns with Down syndrome who were successfully matched were not representative of the entire group of newborns with Down syndrome. That is, the matching process resulted in a biased sample of newborns with Down syndrome. Compared with the newborns who were not matched, those with matches were born to younger mothers who gained more weight during pregnancy. Maternal age and maternal weight gain showed pre-existing group differences and were included as potential confounders. The bias resulted in an overestimation of the effect of Down syndrome on birthweight. A study comparing propensity matching with categorical matching (Rosenbaum & Rubin, 1985b) in children with prenatal barbiturate exposure reported similar low matching success and bias. Only 57% of the children with prenatal barbituate exposure were successfully matched, and the unmatched children were significantly different from the matched children. Thus, methods that leave some participants unmatched introduce strong bias due to systematic differences in the matched and unmatched groups.
Last, no matching and covariate matching methods overestimated the effect size of Down syndrome and birthweight and produced erroneously significant findings. That is, the two methods produced estimates of effect size that were larger than the true effect. The effect size estimated from the propensity score method was .33. Compared with the propensity score method, the effect sizes from the no matching and covariate matching methods were 80% and 276% larger, respectively. For the nonmatched group, the overestimation resulted from uncontrolled confounders. Covariate matching introduced systematic bias that inflated effect size. The overestimation of effect size was greatest for covariate matching, suggesting that, at least in this case, covariate matching was worse than no matching. The effect size differences were accompanied by differences in statistical significance. The group differences in birthweight found with the nonmatched and covariate matching methods lost statistical significance after the confounders were controlled with propensity matching.
This study has several limitations. First, the sample included only a single year of births in one southern state. Infants born in Tennessee, and their mothers, may not be representative of the United States or other countries. These likely demographic differences might affect matching results. Second, administrative databases may contain errors. Although we were able to improve accuracy in the identification of newborns with Down syndrome by record linkage with hospital data, we were not able to identify or correct other potential errors in the birth data. In addition, we required exact matches for the covariate matching method. Less stringent methods, such as matching on categories or ranges of values may result in more successful matching and, therefore, less bias.
Nonetheless, this study demonstrates that propensity matching can control for many confounders in Down syndrome research, overcoming the limitations of covariate matching methods. With propensity matching, 100% of the newborns with Down syndrome were successfully matched, the confounding variables were controlled, and a less biased estimate of effect size resulted.
The study findings have implications for practitioners, policymakers, and researchers by illustrating the importance of controlling for confounders. Failure to identify and control for possible confounders will likely result in under or overestimates of the true relationship. For example, in this study, Down syndrome was associated with lower birthweight in newborns. However, after removing the effects due to various maternal and newborn factors, there was no significant association between Down syndrome and birthweight. This suggests that Down syndrome may not uniquely contribute to birthweight but has an impact through another factor such as maternal age or weight gain. This finding could change both prenatal health care and future research in Down syndrome by focusing on factors related to prenatal or maternal weight gain. Practitioners, policymakers, and researchers should consider the role of confounders when evaluating the potential impact of research findings.
In addition, research findings in developmental disabilities research may be impacted by method selection when controlling for confounders. Covariate matching methods had multiple practical limitations but, most important, introduced serious bias into the estimation of the effect of Down syndrome on newborn birthweight. This finding highlights the importance of selecting matching methods, such as propensity matching, that do not restrict the final matched sample and introduce bias.
Propensity matching was used in this study to identify a matched comparison group from a large, pre-existing, administrative dataset. However, propensity scores can also be used to control for confounders in other study types. Confounder control may be especially important for developmental disability research because there are often multiple potential confounders and small samples. For example, in a two-group study design, the propensity score could be included as covariate, providing statistical control for a group of potential confounders. Representation of multiple confounder variables with a single score provides confounder control without sacrificing degrees of freedom and reducing statistical power. However, propensity methods are not applicable to all study designs (e.g., prospective assignment to treatment groups) because propensity scores can only be computed on complete datasets. Future studies should test for propensity score method advantages in other types of study designs.
This work was supported by funding from the National Institutes of Health, National Institute of Mental Health (K01-MH083052 to J.U.B.) and the National Institute of Child Health and Development Program Grant (P30 HD15052 to the Vanderbilt Kennedy Center for Research on Human Development). Birth (2001) and Hospital Discharge (2001–2003) were provided to the Vanderbilt Kennedy Center under contract NC-06-00849-00 with the Tennessee Department of Health, Office of Policy, Planning and Assessment.
I thank the following for assistance with this manuscript: Robert Hodapp for significant guidance in manuscript preparation, Trent Rosenbloom for review and comments, and Richard Urbano for data linkage.
Editor-in-Charge: Susan Parish
Jennifer Urbano Blackford, PhD (E-mail: Jennifer.Blackford@Vanderbilt.edu), Assistant Professor of Psychiatry, Department of Psychiatry, Vanderbilt University Medical Center, Nashville, TN 37212.