We reviewed a Mental Retardation article by Conroy et al. (2003) on consumer outcomes following the closure of the Hissom Center in Oklahoma. In this article the authors misconstrued their 254 subjects, implying they are representative of the Hissom Focus Class while failing to account for 128 subjects included in an earlier analysis. We found the research to be seriously compromised by data collection problems and discrepancies between reported findings and those obtained when the analyses were replicated. Problems ranged from those that seriously compromise the findings (such as the sample) to other problems in basic data management, transcription, and analysis that, when taken together, compromise the integrity of the findings, lead to inappropriate interpretations, and give rise to misleading conclusions that do not follow from the data.
If people with developmental disabilities are to benefit from scientific findings, then the empirical data constituting such findings need to be carefully analyzed and presented. Science is difficult to do well. Data need to be collected carefully, cleaned thoroughly, analyzed appropriately, and interpreted correctly; errors at any step may undermine the accuracy of conclusions. A peer-reviewed article published in the August 2003 issue of Mental Retardation raises concerns about scientific merit because of data collection and analysis errors and subsequent interpretation problems. In that article, Conroy, Spreat, Yuskauskas, and Elks (2003) evaluated the deinstitutionalization outcomes for a group of 254 individuals in Oklahoma who had been living at the Hissom Memorial Center in 1990 but had transitioned to community-based supported living by 1995. Our concerns were initially prompted by the important differences between the published article and a previously unpublished report (Conroy, 1996), which was one of a series of studies on this population.
Conroy (1996) submitted the original report to the Oklahoma Department of Human Services, under a contract with the Center for Outcomes Analysis, as well as to a review panel of the United States District Court. The 1996 report described the research sample, methodology, and findings and appeared to serve as a model for the 2003 Mental Retardation article. Upon inspection, we found four types of problems in the peer-reviewed article (Conroy et al., 2003): (a) problems with the sample, (b) problems in data analysis, (c) measurement problems, and (d) problems in interpretation. We identified discrepancies between the published article and the earlier report and found that the sample size did not match the known size of the Hissom Focus Class as described by Conroy (1996). Some of the methodological problems are critical and represent serious flaws that call major findings into question, whereas others, though sometimes serious, are due to calculation or analytical errors or represent departures from preferred analytic practice. To clarify these problems, we conducted a critical review of Conroy et al. (2003), reanalyzed much of the original data, and performed several secondary analyses.
To accomplish this review, in September 2003, we requested the data used to produce this article from the authors under the Data Sharing Policy that governs the peer-reviewed journals of the American Association on Mental Retardation (AAMR). In December 2003, prior to receiving the data, we also requested permission for access to the data from the Oklahoma Department of Human Services (DHS), which was granted on December 18, 2003, provided that individual identifiers were removed. In January of 2005, the authors provided the dataset used for the analyses in Conroy et al. (2003) on a disk. In this critique, we refer to the report submitted to the Oklahoma DHS and a U.S. District Court Review Panel (Conroy, 1996) as the 1996 Report and the published article (Conroy et al., 2003) as the 2003 article or by using standard citation format (i.e., Conroy et al., 2003).
Problems With the Sample
Conroy et al. (2003) presented what appears to be a study of outcomes associated with the relocation of persons with developmental disabilities from a large, publicly funded Intermediate Care Facility/ Mental Retardation (ICF/MR) to residences of 3 or fewer persons. They employed a longitudinal design through which they followed individuals who were relocated to the community, seemingly to link various outcomes to the deinstitutionalization process. Specifically, these individuals were rated on a battery of instruments prior to leaving the Hissom Memorial Center (the 1990 data) and then 5 years later while residing in supported living settings (the 1995 data). Unfortunately, causality in changes in the outcomes cannot be unambiguously linked to deinstitutionalization because the sample was highly selected: Individuals in the Hissom Focus Class not enrolled in supported living were excluded. Without consideration of the outcomes for the Hissom Focus Class Members who did not move to supported living, the study is only a retrospective look at individuals in a particular type of community setting rather than one about the outcomes of deinstitutionalization in general, because people with poorer outcomes were systematically excluded from the study. In that sense, the study is not what it appears to be—a study of the effects of institutional closure on individuals with developmental disabilities. Unfortunately, the authors of the article repeatedly used the term Hissom Focus Class Members when referring to findings from a sample drawn from that group, a sample that was demonstrably not representative of the larger Hissom Focus Class.
Although it was appropriate for Conroy et al. (2003) to only study those who moved from the Hissom Center to supported living arrangements, that is not how they presented the article. The first paragraph of the Discussion implies that the findings apply generally to the entire Hissom Focus Class: “There can be little argument that the lives of the Focus Class Members improved over the 5-year period of study” (p. 271). Throughout the Results and Discussion sections, the authors provided findings and interpretations referring to Focus Class Members as if they applied to the entire Hissom Class. In one place, they referred to “this study of the closure of Hissom Memorial Center … (p. 272, italics added),” clearly indicating that the intent is for the findings to document the effects of deinstitutionalization. Unfortunately, because of the restricted and likely biased sample, these findings do not apply to the entire Hissom Focus Class.
The Hissom Focus Class, the referenced population, included 520 individuals who were living on the grounds of the facility after May 2, 1985 (Conroy et al., 2003, p. 265). A number of Focus Class Members were lost to even the earliest rounds of follow-up as far back as 1990; for example, Conroy (1996) recognized the loss of 84 individuals, noting that data had been collected on only 436 of the 520 Focus Class Members “The other 84 people identified as Focus Class Members were not visited in 1990” (Conroy, 1996, p. 7). In the 1995 follow-up, 427 individuals were contacted, of whom only 382 individuals had complete pre–post comparison data (1990 and 1995) available for the 1996 Report. As Conroy stated in that report;
There were 382 people who were visited in both 1990 and 1995 … to see changes in independent functioning, I will compare 1990 to 1995 scores on the adaptive behavior scale, and there will be 382 people with complete data. (p. 8)
In their 2003 article, Conroy et al. also presented these 1990–1995 comparisons, albeit in a subsample that was different from the one used in the 1996 Report. Although the original size of the Hissom Focus Class (520) is mentioned in the introductory paragraphs of the 2003 article, there is no description of why there were results for only 254 members (i.e., a loss of an additional 128 individuals from the 382 for whom complete data were available in the 1996 Report). Thus, in the 2003 article, readers were often given the impression that the 254 subjects comprised the entire population that was studied (i.e., the Hissom Focus Class Members). Although the group of individuals who moved from the Hissom Memorial Center to supported living in the community appears to be the targeted subset, no information is given as to whether the sample (i.e., N = 254) is the entire subset, is representative of this targeted subset, or is representative of the larger Hissom Focus Class (as we will show, it is not).
Adding to the confusion, Conroy et al. (2003) stated in the Discussion section that all members of the Hissom Focus Class were moved to “community-based homes typically housing two or three people” (p. 271). In fact, the 1996 Report (p. 9) shows that of the original Hissom Focus Class Members lost to follow-up, 18 went to private ICFs/ MR, 17 went to group homes larger than 7 beds, 9 went to group homes of 4 to 6 beds, one went to another public institution, the whereabouts of 8 were unknown, and 6 went to “other” placements. Furthermore, none of the eventual placements of the additional 128 individuals missing from the 2003 article (i.e., of the 382 included in the 1996 Report) are known.
In the end, Conroy et al. (2003) included only 49% of the original 520 Hissom Focus Class Members, and only 67% of the 382 individuals on whom data were collected in both 1990 and 1995. Despite these problems and the concerns any researchers would have in following this large and complex population (cf. Freeland & Carney, 1992), Conroy et al. (2003) did not discuss subject loss or the extent to which the sample used was representative of the Hissom Focus Class Members. Therefore, it is difficult to gain an understanding of who constituted their sample.
In fact, subject loss resulted in systematic differences between the sample used by Conroy et al. (2003) and the original 520 members of the Hissom Focus Class. For example, there were differences in the measured level of mental retardation. According to the 2003 article, in the larger Hissom Focus Class of 520 individuals, “73% were labeled as having severe or profound mental retardation” (p. 273), whereas the sample in the 1996 Report included 67.9% so labeled. In contrast, in the data provided to us by Conroy et al., which they used to produce the 2003 study, there were 242 individuals for whom level of mental retardation was known; of these, 91% were classified as having either severe (n = 48) or profound (n = 173) mental retardation. Unfortunately, in the Method section of the 2003 article, the authors incorrectly reported this percentage as 80% (p. 265). It is almost certain that a swing of 18 percentage points (from the Hissom Focus Class in general) or 23 percentage points (from the sample used in the 1996 Report) in the number of people with severe or profound mental retardation would have given rise to differences in outcomes and should have been analyzed and shared with readers. This same effect occurred in other variables as well; for example, there were 3.3% more males (and 3.3% fewer females) in the sample of 254 in Conroy et al. (2003) than in the 1996 Report.
Why do such discrepancies represent a danger to the interpretation of the findings in the Conroy et al. (2003) article? The answer is that by selecting only individuals in supported living and excluding those with poorer outcomes, the authors misrepresented the Hissom Focus Class. In the end, Conroy et al. could draw the conclusion, on some measures perhaps, that individuals in supported living are better off than they were in the institution, but they cannot say the same for the Focus Class as a whole. Some of these individuals may very well have moved into other ICF/MR settings, larger community residences, or nursing homes or they may have died. As a result, findings reported in the 2003 article are simply descriptions of the characteristics of individuals who, once deinstitutionalized, remained in supported living. Based on these facts alone, no general conclusions can be made about the prospective effects of deinstitutionalization of the Hissom Focus Class on personal outcomes. As such, the authors did not evaluate outcomes of deinstitutionalization in a general sense. As we show in the remainder of this review, however, even the modest goal of describing the discharge effects of the selected sample is undermined by many additional problems.
Problems in Data Analysis and Statistics
In our review of Conroy et al. (2003), we identified technical problems and errors according to the categories of outcome variables presented by the authors.
We found that the overall adaptive behavior scale scores for both 1990 and 1995 as reported in the article (i.e., 1990: M = 41.5, SD = 28.8; 1995: M = 47.3, SD = 29.5) did not match the values produced by the dataset the authors provided (1990: M = 34.9, SD = 24.2; 1995: M = 41.2, SD = 25.7). We further note that the 1990 and 1995 means for overall adaptive behavior scores reported in the 2003 article are nearly the same as the means reported in the 1996 Report (to within two tenths of a point in both cases: 1990 = 41.3 and 1995 = 47.5). Because the sample in the 2003 article was a subset of the one in Conroy (1996) and included 23% more individuals with severe or profound levels of mental retardation, we do not understand how the overall adaptive behavior means could be nearly identical from one report to the other.
The inclusion of a factor analysis for adaptive behavior in Conroy et al. (2003) raised the possibility of clarifying the data; unfortunately, the factor analysis also contained a number of errors. The analysis resolved 31 adaptive behavior items (although the authors said that there were 32 such items) into three Adaptive Behavior factors. One item (“walking and running”) was inexplicably dropped from the analysis; although it was the only item for which a significant decline was reported in the 1996 Report. The three-factor solution included a Self-Care factor that reportedly accounted for the most variance (52.2%). The remaining factors accounted for less variance: an Academic factor accounted for 9.1% of the variance and a Socialization factor, 4.5%. The authors conducted an overall repeated measures analysis of variance on the three factors that was statistically significant and led to post-hoc comparisons. The factor explaining the most variance, Self-Care, was not significantly different between 1990 and 1995, although the two remaining factors were both reported to be statistically significant, with the Academic factor scores showing an increase between 1990 and 1995 and the Socialization factor scores showing a significant decline over the same period. We were surprised that the Self-Care factor that accounted for over 50% of the variance was not reported to be statistically significant, whereas two lesser factors were.
To determine the reasons for this finding, we inspected the data and discovered that the number of individuals who increased on these factors exceeded the number whose scores stayed the same or decreased; we also found errors in the algorithms used to calculate all three factor scores. For example, in addition to other problems, we found that the algorithm Conroy et al. (2003) used to calculate the Academic factor score for 1990 left out an entire variable—use of sentences. We re-calculated the factor scores based on the factor loadings presented in Table 1 of Conroy et al. (2003) and found that the descriptive statistics we derived did not match the means and SDs they provided (in Table 2 of the original article, p. 268). We have reproduced the factor score means and SDs from Conroy et al., showing the discrepancies between the published article and our calculations using the dataset. In only two cases did the figures originally reported match those we calculated from the data (see Table 1).
When we re-ran the comparisons described by Conroy et al. (2003), we obtained a significant overall repeated measures analysis of variance, as did Conroy et al., but we found in post hoc comparisons that the Self-Care factor was, indeed, significant, t(253) = 9.86, p < .001, with Self-Care factor scores increasing about 5 points on average (see Table 1). We also found that the Socialization factor, which Conroy et al. found to significantly decrease, actually increased 1.9 points on average, t(253) = 5.78, p < .001. Thus, all three factors were significant, and all three increased from 1990 to 1995. We do not know why Conroy et al. did not find the Self-Care factor to be significantly different or why they found a decrease in the Socialization factor score. For the Academic factor, we used a paired t test and found that the increase between 1990 and 1995 was significant, t(253) = 6.9, p < .001, although the actual magnitude of the average factor score difference was only 1.8 points (see Table 1), not the 8.5 points reported in the 2003 article, which suggested far more change in the factor than actually occurred.
Because two of the factors increased by less than 2 points, we looked at the factor score differences to try to understand what these findings meant. The fact that such modest differences achieved statistical significance is, in part, related to the large sample; therefore, we wondered what clinical or practical significance these small differences might actually mean. For example, the average Academic factor score increase of 1.8 was quite small relative to the overall factor score distribution, which ranged between 0 and 33. A mean factor score difference of 1.8, for any individual, could reflect rating changes from “does not use money” to “uses money with some help” or from “no understanding of numbers” to “counts two objects.” A similar interpretation is possible for the Socialization factor score, which increased, on average, 1.9 points for scores ranging between 0 and 31. Such a finding for an individual, for example, could have arisen from the difference between “will not pay attention to an activity for 5 minutes” to “will pay attention to an activity for 10 minutes” or from “interacts with others in limited ways” to “interacts with others for longer than 5 minutes.”
Conroy et al. (2003) explained or interpreted findings little beyond simply stating statistical significance levels, what Rosnow and Rosenthal (1989) called the “habit of defining the results of research in terms of significance levels alone” (p. 1276). To understand the extent of the adaptive behavior findings better, we calculated Cohen's (1988) ,d statistic for effect size for each of the factor scores. Cohen's d is a standardized mean difference, with values of d around .20 taken to be small effects; those around .50, moderate effects; and those of .80 or greater, large effects (Cohen, 1988). In Conroy et al. (2003), all three factor score differences reflect small effects (i.e., for both Self-Care and Academics, d = .25, and for Socialization, d = .28), showing that the combined effects of the movement to supported living plus time (i.e., 5 years) resulted in increases of only about a quarter of an SD in each case.
Because these effects are small, they may well represent increased opportunities rather than, or in addition to, improved behavioral functioning. Increased opportunities represent something positive for the individual and may represent real improvements to quality of life and, therefore, are legitimate outcomes. However, Conroy et al. (2003) should have been careful not to confuse improved adaptive behavior scores due to better opportunities (e.g., a phone is nearby) with actual behavior change due to improved functional abilities (e.g., a person learns to look up numbers and place a call). On the basis of the data they presented, Conroy et al. could not adequately distinguish between these opportunity effects and real functional behavior change. Although the possibility of such effects is mentioned by the authors in the Discussion, there was little or no examination of them in juxtaposition to skill change. In fact, they seemed to reject opportunity effects in the Discussion, writing that the sample “evidenced increased skills” (p. 271, emphasis added) and later that the data “suggest that community placement was responsible for at least the observed changes in adaptive behavior” (p. 272, emphasis added). In our view, both of these statements appear to ignore opportunity effects and, strictly speaking, go beyond the data presented.
Finally, we are concerned that the authors did not include either discussion or analyses of the ranges of factor scores or their distributions. Understanding changes in adaptive behavior is facilitated if one goes beyond the overall significance tests to an examination of the distributions of these variables (Feingold, 1995). For example, the high rates of zero, or very low, scores are striking. In 1990, 177 individuals (69.7% of the entire sample) had factor scores under 5 on the Academic factor, of which 93 (36.6%) had a score of 0; in 1995, 146 individuals (57.5%) had scores under 5, and 47 individuals (18.5%) had a score of 0. Given the nature of the sample (i.e., 91% with severe and profound disabilities), it is not surprising to find scores clustering in the lower half of the highly skewed distribution (skewness = 1.5 in 1995) for the Academics factor in both 1990 and 1995. For example, in 1995, although the Academic factor scores ranged from 0 to 34 (M = 6.2, SD = 7.4), there were only 30 individuals (11.8%) with scores higher than the middle of the score range. Also, in 1990, 72% of the sample had an Academics factor score of 5 or less; in 1995, 63% had scores in this range, with many individuals scoring 0 (93 in 1990 and 47 in 1995).
Similar, but less extreme, was the extent of low or 0 scores for Self-Care, with 20% of the 1990 sample scoring under 5 (n = 51) and 35 individuals (13.8%) scoring under 5 in 1995, whereas for Socialization factor scores, there were relatively few individuals with scores under 5. Furthermore, recall that factor scores are made up of a number of adaptive behavior items: 15 items for Self-Help, 9 items for Academics, and 7 items for Socialization (see Conroy et al., 2003, Table 1) that are scored on Likert-type scales, with values ranging between 0 and 6. Thus, for example, a factor score of 0, when comprised of up to 15 item scores, is peculiar indeed and calls the sensitivity of the measures into question. In a sample that is comprised mostly of individuals with severe or profound disabilities, we believe that many of the items simply may not have applied to most participants. For example, we wonder what it actually means (as an outcome) to have had a score total in 1990 of 0 or 1 on all 9 of the Academic items combined, and then, 5 years later, to have a score of 2 or 3 (recall that the average increase of the Academic factor score is 1.8)?
In stark contrast to the distribution of Academic scores, the 1995 Self-Help factor scores ranged between 0 and 60 (M = 28.8, SD = 19.9, skewness = .033). There were fewer scores of 0 in both years (13 in 1990 and 8 in 1995), with the scores distributed more evenly (e.g., with 63% of the population falling below the middle of the score range and 37% above it), suggesting that changes (due either to opportunity effects or real behavior change) were more evenly distributed across the sample.
In our view, Conroy et al. (2003) did not provide the necessary information on scores, score ranges, and distributions to adequately interpret their findings. Further, we think these score range issues may also point out critical flaws in the data collection. It is clear that data were collected from different informants in 1990 and 1995 by different interviewers, resulting in data that likely suffered from both reliability and validity problems; and we have already noted that we think there may have been sensitivity problems in the measures given the nature of the sample. Although Conroy et al. did make statements about potential measurement problems, they were generally dismissed by reference to reliability studies.
Because the differences in Adaptive factor scores are modest and because of outright errors, along with potential measurement errors, we believe that caution should be exercised in the interpretation of these factor scores. Overall, in our re-analysis of the adaptive behavior measures in Conroy et al. (2003), we failed to replicate the findings as reported and found, using data supplied by these authors, rather different results for the three Adaptive Behavior factors. We found statistically significant increases in all three factors between 1990 and 1995, although the actual differences were small and may be of dubious real-world significance given the range of the scales and their distributions, potential problems in measurement, and the impact of opportunity effects.
Conroy et al. (2003) employed a 16-item behavioral rating scale that was reported to assess both frequency and severity of challenging behaviors. In reporting the findings on challenging behavior, Conroy et al. noted that the 1990–1995 differences are generally modest; nonetheless, data and scoring problems also threaten the validity of these findings. The dataset furnished to us included two overall challenging behavior scores, frequency and severity, and the raw data for frequency on 3 of the 16 challenging behaviors (rebellious behavior, runs away, and unresponsive to activities), all of which were reported as significantly improved in 1995. We did not receive raw data for the frequency of the remaining behaviors or for severity of any challenging behavior variables. We calculated effect sizes (Cohen's d) for all of the 1990–1995 challenging behavior comparisons Conroy et al. presented; (i.e., for overall frequency, overall severity, and frequency for the three specific behaviors mentioned). We found that three of the five effect sizes met Cohen's criterion for small effects: overall frequency, d = .20; severity, d = .22; and frequency of rebellious behavior, d = .23; the effect size statistic for runs away, d = .34, and unresponsive to activities, d = .35, fell midway between small (.20) and medium (.50) effects in Cohen's scheme. Regardless, there is a question of whether such parametric statistics are appropriate at all given the skewed nature of these data; for example, in both 1990 and 1995, distributions for both frequency and severity are found to have skewness coefficients more than −1.3, and in all of the distributions, more than 20% of the population scores were 100 (i.e., no behavior problems).
Of greater concern were inexplicable problems in the actual scoring of challenging behavior items. We came to understand that the 16 challenging behavior items were scored on a 3-point scale for frequency and a 4-point scale for severity. Unfortunately, in their Method section, Conroy et al. (2003) did not provide any information on how the frequency and severity scores were derived. In the earlier report, however, Conroy (1996), using the same instrument, reported that severity was scored on “4-point scales and increases in scores are favorable— higher scores mean better, less problematic, behavior” (p. 32). The three frequency items for which we received raw data were scored on 3-point scales as follows: more than five times per week past month (0), five times a week or less past month (1), and not observed in the past 4 weeks (2), again resulting in higher scores denoting less frequent challenging behavior. Given these scoring rules, we are at a loss to understand how 16 challenging behavior items scored on 3- or 4-point scales (either frequency or severity) can range from 0 to 100. The maximum score on the frequency items (scored 0 to 3) could only be 48 (16 items receiving the maximum score of 2), and an individual with a severity rating of 4 (i.e., no problem) on all 16 challenging behavior items could only achieve a total score of 64. Nonetheless, in the data provided to us, more than 95% of the overall challenging behavior scores for both frequency and severity, in both 1990 and 1995, are over these maximum possible scores.
Although it is possible that some weighting scheme was used to produce the overall frequency and severity scores, none is presented in the Method section of Conroy et al. (2003). Nor is any such scoring scheme offered for these scales in Fullerton, Douglass, and Dodder (1999), who reported on their reliability. In fact, in the Method section, Conroy et al. cited a reference (Conroy, Efthimiou, & Lemanowicz, 1982) whose authors specifically described their preference for simple summed item scores. Summing across 16 items on either a 3- or 4-point rating scale cannot produce scale scores over 48 and 64, respectively, and certainly not up to 100 as reported in the 2003 article.
Finally, because of the non-normal distributions of these variables, it is instructive to understand that reported significant effects on the changes in challenging behavior are produced by just over half or less of the individuals in the sample. In general, the frequency data showed that more than 20% of the 254 subjects had no challenging behaviors whatsoever in either 1990 or 1995. Therefore, we examined subgroups of individuals in the sample who improved (1995 score higher than 1990 score), stayed the same (1995 score the same as 1990 score), or exhibited more challenging behaviors (1995 score lower than 1990 score). For overall frequency and severity of challenging behavior, 48% and 44%, respectively, stayed the same or experienced more challenging behavior in 1995 than in 1990. In the three individual frequency variables (rebellious behavior, runs away, and unresponsive to activities), these same figures (i.e., stayed the same or experienced more challenging behavior) were, respectively, 73%, 90%, and 78%. Thus, the significant tests showing decreased challenging behavior problems in 1995 are due to about half of the sample for frequency and severity and about a quarter of the sample or less for the three individual items.
Therefore, in our review of the data on challenging behavior, we were able to replicate the t tests and found that all of the effects are small effects based on Cohen's d; however, we are at a loss in understanding what Conroy et al. (2003) meant in light of the scoring problems. Because it is not known how the overall frequency and severity scores were derived from the 16 challenging behavior items on the instrument used, it is not clear how the overall challenging behavior findings can be interpreted.
Developmentally oriented therapy and services
Conroy et al. (2003) presented statistics on nine service areas, such as homemaker services, occupational and physical therapy, and nursing services (reproduced here in Table 2, with additional information). Data collected for each service were the mean hours a person was estimated to have received that service during the month prior to data collection; but for some of the variables presented, the data are not comparable. This is because in 1990, the Oklahoma data collectors “made 99 the maximum allowable value on the number of hours of each service per month. Any higher values were simply coded back to 99 hours” (Conroy, 1996, p. 50). These were also the data used for the 2003 article. Unfortunately, in our re-analysis we discovered that the 1995 data on developmentally oriented services used in Conroy et al. (2003) were not “coded back” in the same way as the 1990 data, resulting in invalid comparisons for 6 of the of the 9 service variables in which 1995 values exceeded 99. Therefore, it is possible that significant effects could be an artifact of the coding scheme and not due to actual change between 1990 and 1995.
In addition, in 1995, ranges for several of these service variables were quite large (Table 2), whereas in 1990, ranges did not exceed 99 hours due to the coding limit. In addition to limiting the ranges of the 1990 variable (e.g., for habilitation training, 87 cases, or 34.2%, were limited in this way), the data distributions for 1995 become highly skewed, driving up both the mean and SD. Although excessive ranges can be identified in 6 of the 9 developmentally oriented therapy service variables, there are still other problems with these data. For example, in two places in Conroy et al. (2003), they stated that the “number of individuals” receiving a service constituted the data of interest (pp. 266, 270). In fact, despite these statements, in Table 5 Conroy et al. (2003) did not report the number of persons; rather, the data were presented as estimated service hours.
Furthermore, in Table 2 here, two additional elements are immediately observable: (a) very large differences in mean hours of services per month between 1990 and 1995 in some of the variables (e.g., 36.9 hours vs. 187.2 hours for habilitation training) and (b) the extremely large SDs in comparison to the sizes of means (e.g., the SD for homemaker services in 1995 is more than six times as large as its corresponding mean; for nursing services, the SD is 4.5 times the size of its mean). Variability of this magnitude points to an extremely wide range of service intensity and, in these types of services, suggests that many of the individuals in the sample were actually not receiving services (i.e., that there would be many with 0 hours) in conjunction with perhaps a smaller number of individuals who received many hours of services (Table 2 also shows the number who received each service, that is, those with non-zero values in the dataset). Therefore, we expected that the distributions would be skewed, and, indeed, we found that all of the service area variables had skewed distributions. Skewness refers to the fact that the distribution of scores is not symmetric as in a normal distribution that has a skewness value of 0. Positive values denote an inordinately long right tail. The level of skew in these data is one reason that Conroy et al. (2003) should have considered nonparametric statistics.
Although all of the service areas had skewed distributions, Conroy et al. (2003) singled out nursing services for additional analysis. As they stated: “Closer inspection of the nursing data revealed extreme skew in the data” (p. 270). Of course, as Table 2 here shows, it is not only nursing services that were skewed; in fact, in 1995, 6 of the 9 service areas had distributions that were more skewed than nursing services. Unfortunately, in the 2003 article, the authors did not recognize any other service area other than nursing services nor did they analyze them further because of a skewed distribution. Nonetheless, Conroy et al. did reanalyze the nursing data using a nonparametric test, and they found that despite the mean increase in reported hours, there was actually a decline in nursing services from 1990 to 1995. This result arose because of the extremely large range in nursing service monthly hours in 1995, from 0 to 744 hours, which the authors explained was caused by 10 individuals requiring around the clock nursing coverage.
We have reanalyzed the data on the 9 developmentally oriented service variables using nonparametric test statistics (Wilcoxon), with the 1995 data adjusted in the same way as the 1990 data (i.e., 1995 cases with a variable value over 99 were “coded back” to 99). Our results were consistent with Conroy et al. (2003) on 7 of the 9 variables. However, whereas Conroy et al. found that homemaker services increased significantly, we did not find significance; for occupational therapy services, Conroy et al. reported no difference; we found a significant decline. Of the 6 significant tests, 4 were found to decrease in 1995. Despite these statistical differences, the majority of individuals actually stayed the same on these variables from 1990 to 1995. For 5 of the 6 significant variables, the percentage of individuals who did not change ranged between physical therapy (57.1%) and habilitative services (87.0%); for the 6th variable, nursing services, 22.0% of the sample did not change. Thus, there is a danger, based on significant test results, that readers may overemphasize the amount of change in the sample or the number of individuals experiencing change; in all cases but one in Conroy et al. (2003) only a minority of individuals had changed scores.
With these differences in mind, Table 2 also shows that some of the data presented in Conroy et al. (2003) are quite unusual. For example, in the habilitation service area, the 1995 data ranged between 0 and 819 hours per month (despite the fact that there are only 744 total hours in a 31-day month); homemaker services ranged between 0 and 730 hours; nursing services, between 0 and 744 hours per month. Thus, for 3 of the developmentally oriented services presented in the 2003 article, if the data ranges reported are not incorrect, they surely represent statistical outliers that should have been made known to readers. For example, someone receiving daily round-the-clock private duty nursing coverage would have 744 such hours in a 31-day month (there were 10 cases in the data in 1995 estimated to have received 720 or more hours during the past month). At the very least, it seems dubious, without further qualification, to include such nursing coverage under the heading of developmentally oriented services.
Although occurring to a lesser extent, this same effect is evident for both physical therapy and occupational therapy services (services that actually decreased from 1990 to 1995). For example, 2 individuals in the dataset were listed as receiving 120 hours of occupational therapy per month in 1995; another individual received 100 hours of occupational therapy; 1 participant received 120 hours of physical therapy; an additional individual received 84 hours, yielding weekday averages of up to 6 hours per day. We have never known this level of rehabilitation therapy to occur in any typical setting for individuals with developmental disabilities.
Inspection of the distributions of these variables also reveals striking findings. Distributions were not normal and medians for all of these services in 1990 was 0, except for nursing (median = 4); medians in 1995 were 0 for all services except communication (median = 2). In addition, distributions shifted radically from 1990 to 1995; for example, the 1990 and 1995 distributions for habilitation services are strikingly different. In 1990, 137 individuals (53.9%) were estimated to have received no habilitative services (0 hours) during the past month, whereas 87 (34.3%) received 99 or more hours (i.e., coded back to 99). Thus, 88.2% of the sample received either the lowest (0) or the highest (99) score, producing a compact U-shaped distribution, with a mean of 36.9 hours but a median of 0. In contrast, in 1995, because the data were not coded back to 99, the distribution was skewed to the right (skewness = 1.33), with a mean of 187.2 and a median of 70 hours of service per month. We have already noted our concerns about extremely high estimates of service hours per month; for habilitative services, however, examining the distributions reveals that more than half of the sample was estimated to have had no habilitative services in the past month in 1990, whereas in the 1995 data, 42.5% of the sample received over 100 hours of habilitative services during the past month. Because the Hissom Memorial Center was an ICF/MR facility, it would have operated under a number of Conditions of Participation with the federal Medicaid program, a central part of which is the delivery of active treatment. It is not clear how more than half of the sample could have had no habilitative services for an entire month in a facility required to deliver active treatment. We suspect measurement error here, either in defining what constituted habilitative services or in how the estimates were carried out and rated, again resulting in noncomparability between 1990 and 1995.
Therefore, in the area of developmentally oriented services our re-analysis identified clear measurement problems as well as incomparability resulting from statistical adjustments of the ranges of variables in 1990 but not 1995. Our re-analyses confirmed certain findings from Conroy et al. (2003) but disconfirmed others, as shown in Table 2. Because of these problems, we believe that these results also need to be interpreted with extreme caution.
Productive uses of time
In this category of variables, Conroy et al. presented 1990–1995 comparisons on educational and vocational service variables. However, as we worked through the dataset provided to us by the authors, we came to believe that the data were not likely to have been checked or “cleaned” according to accepted practices (e.g., Fowler, 1993; van Kammen & Stouthamer-Loeber, 1998). We also suspected that the dataset was not examined closely to gain an understanding of its structure using tools such as frequency distributions, histograms, or stem and leaf plots. Such methods point researchers toward appropriate analyses and help avoid analytical missteps. Instead of relying solely on significance tests to understand the data, we examined data lists, frequency distributions, and descriptive statistics for all of the variables. When we did this for variables reported under the heading Productive Uses of Time, we again found numerous inconsistencies and confusing data values. We also discovered that the median value was zero for 10 of the 11 variables in this category, again showing that there were many individual zero scores. Once again, we found the limitation of the 1990 data to a maximum value of 99, with no corresponding limit being applied for the 1995 data.
It became clear that the variables used by Conroy et al. (2003) for educational activities or school hours were not precisely aligned with enrollment in school, despite their names and labels in the dataset we received from the authors (names: “school90” and “school95” labeled “school hours”). Instead, when we looked at the data, we were surprised to find that receipt of hours of education and receipt of vocational services in one of the four vocational levels (prevocational, supported work, sheltered employment, or competitive employment) were not always independent. That is, in both 1990 and 1995, a subset of individuals received both, sometimes at striking levels. For example, we found an individual in the dataset who was estimated to have had 99 hours of educational services per month and 99 hours of prevocational training (combined, 9.9 hours per day for weekdays); another was reported to have had 16 hours of education, 80 hours of prevocational services, and another 60 hours of sheltered employment. Thus, within vocational services, we found the same phenomenon: enrollment in one of the four levels did not preclude hours in one or more other levels. This occurred more in the 1995 data than in the 1990 data—1990: 4 cases (1.6% of the sample); 1995: 23 cases (9%).
Although terminology in both the 2003 article and the dataset itself suggest that educational hours were associated with school enrollment, we found several cases in which individuals over 21 received educational hours. Again, this occurred more often in the 1995 data (14 cases) than in 1990 (6 cases). Overall, it appeared to us that not all of the data collectors were employing the categories in the same way either across the two data-collection times (1990 and 1995) or even within a single data-collection effort.
In comparing educational services between 1990 and 1995, we found that there were 191 individuals in 1990 (75.2% of the sample) and 230 individuals in 1995 (90.6%) who received 0 as an estimate for “hours in the past month” for this activity. Conroy et al. (2003) reported a significant decline in educational hours based on a paired t test with 253 degrees of freedom. However, had only school-aged students been eligible for school (as the authors suggested), it would not have been possible for non-school-aged individuals to have “hours” for educational services. Although initially we thought it might be that hours such as adult education were included, Conroy et al. (2003) implied this variable was related to school entitlements (p. 269). If that is the case, then it is inappropriate to use 253 as the degrees of freedom for this test (an error that reduces the critical value for significance). Furthermore, as before, the data for educational hours in 1990 were limited to 99 (range 0 to 99), although the 1995 data were not similarly re-coded (range 0 to 160). Upon examination of the data, we found that in 1990, 30 of the 63 individuals who were 21 or younger (47.6%) were estimated to have received at least 99 hours of educational services per month (i.e., a typical 5-hour school day). Because of aging, comparable figures for 1995 showed there were 24 individuals continuing to receive some educational hours, with 10 of these (41.7%) receiving at least 99 hours. A paired t test on the 11 individuals who attended school in both 1990 and 1995 revealed no significant difference in school hours, t(10) = .35, p = .732.
With respect to vocational services, once again the 1990 and 1995 data were not directly comparable because the 1990 data were limited to 99. We recoded any higher 1995 values back to 99 and compared these data using Wilcoxon tests and found the prevocational services significantly declined in 1995, and sheltered employment and supported employment significantly increased. Nonetheless, once again on these three variables, most of the participants did not change (i.e., in each case more than 50% remained the same: prevocational, 53.1%; sheltered employment, 58.6%; and supported employment, 74.0%), demonstrating that the significant effects were again caused by a minority of the sample. Thus, ignoring potential problems in the data, the significance tests in the area of vocational services reported by Conroy et al. (2003) held up for prevocational, sheltered work, and supported employment.
For competitive employment, however, we were concerned that there was only a single individual in the dataset so engaged in 1990 and only 9 in 1995 (representing, respectively, .4% and 3.5% of the sample of 254). Because of this, Conroy et al.'s (2003) reported statistically significant increase between 1990 and 1995 is misleading. In reviewing the data used to produce descriptive statistics for this article (p. 269, Table 4), we discovered that the 1990 entry under the mean hours of competitive employment (.1 hours) was calculated by dividing by 254 the estimated number of hours worked (20) by a single individual, the only person in the entire 1990 sample with any competitive employment hours. The corresponding 1995 mean included 9 individuals. Calculating a mean and SD and carrying out a t test comparing group means when one of the groups is comprised of a single individual obfuscates the real nature of the data. Further, we believe that there may be other errors (or outliers) in the competitive employment data. For example, we note that 1 of the 9 individuals who had competitive employment hours in 1995 was a man with moderate disabilities who, in 1990, was enrolled only in prevocational services, yet was reported in 1995 to have had 216 hours of competitive employment per month (or about 11 hours per day given a typical 20-day work month). Another individual who was reported to have had 120 hours of competitive employment (6 hours per day for a 20-day month), was a 31-year old woman with profound disabilities who was also reported to have received 120 hours of sheltered workshop services in the same month. Given these striking data in 2 of the 9 cases of competitive employment, without any further explanation in the article, we discontinued our examination of this variable.
In addition, we suspect that in the 1995 data collection, when an individual was enrolled in a program “full time,” many interviewers/raters assigned a value of 120 hours (the most frequently occurring value in the 1995 data, and one that was not possible in 1990). This number of hours (120) converts, on a 20-day month basis, to a 6-hour day. Although not unreasonable for some, it should be remembered that 91% of the sample had either severe or profound intellectual disabilities. Nevertheless, in 1995, 38% of the sample (nearly 100 individuals) had values equal to or higher than 120 hours per month with respect to vocational/school activity (i.e., programming days of 6 hours or more). Without further information on how the informants/interviewers made estimates of “hours” in a particular setting (e.g., Was full-day programming simply taken to be 6 hours in all cases? Was transportation time factored into the time estimates? Were there data-collection standards or rules for assigning values to estimates?), we are prone to consider these data as too unstable for interpretation.
Further, in the dataset provided to us, 242 individuals had known functioning levels; only 21 had either mild or moderate cognitive disabilities (8.3%), leaving 221 who functioned in either the severe (48) or profound (173) range. Nonetheless, in 1995, 68 individuals were reported to be enrolled in either supported employment (59) or competitive employment (9), including 29 individuals with profound disabilities (3 of whom worked in competitive employment). These atypical juxtapositions of the types of vocational settings and the levels of intellectual disabilities caused us to wonder about the definitions of program types used in the study and about potential rating errors in these measures.
Finally in this area, data were missing from the dataset forwarded to us. For example, there was no combined overall vocational/educational measure for 1995 in the dataset, although there was such a variable for 1990 with comparisons reported in Conroy et al. (2003). When we attempted to calculate the variable in the same manner as it was calculated in 1990, we were not able to produce an algorithm that replicated the 1990 data values. Upon closer inspection we found that the variables that were summed for the overall vocational/educational measure did not appear to be standard (i.e., different variables were included for different individuals). When we attempted to re-calculate new variables, using algorithms that we created based on the text, the resulting measures were influenced by the extreme ranges noted above resulting, in several cases, in unreasonable values (i.e., 240 total hours of either vocational or educational services per month, or 12 hours per day). Having no way to verify the data, we discontinued our efforts on these variables. Based on all of these factors, we find it difficult to place much confidence in the findings presented under Productive Uses of Time (Conroy et al., 2003, p. 269) because, in our view, the variables in this area appear to include measurement or transcription problems or, in the case of competitive employment, statistical problems.
In the 2003 Report, Conroy et al. explained that the Family Contact Score was a combination of three frequency items: family contact, family visits (to consumer), and consumer visits to family. Unfortunately, only the scale scores for 1990 and 1995 were provided to us, so we were unable to examine the raw item data for these scores. Although the differentiation between family “contact” and “visits” is not explained in the 2003 article, in the 1996 Report Conroy noted that the former referred to phone calls or mail. Of more concern here is the range of these Family Contact Scale scores. In the Method section of the 2003 article, the following description is provided:
The three family questions covered frequency of family contact, frequency of family visits, and frequency that the consumer visits the family. Summing these three scores yields a brief Family Contact Scale … a score of 18 on this scale would indicate maximal family involvement [indeed, an individual would literally have to live with his or her family to achieve this score]. A score of 3 would indicate that either the individual had no family or has not had family contact within the past year. (Conroy et al., 2003, p. 267)
From this it would seem the three items are likely to have ratings between 1 and 6, although we also found scores of zero in the dataset suggesting that actual ratings ranged between 0 and 6. Nevertheless, frequency distributions of the 1990 and 1995 Family Contact Scores showed that the 1995 data ranged not between 0 and 18, but between 0 and 27. Additionally, we were unable to replicate any of the statistics presented in this section from the data forwarded to us. For example, Conroy et al. (2003) stated:
Persons who had no family were excluded from this analysis. … Approximately 17.4% of the 1990 participants fell into the category of having no family involvement. In 1995, only 8.3% of the participants fell into the category of having no family involvement. (p. 270)
However, the database we received contained scores associated with no family contact (3 or lower) for 20.2% of the sample in 1990 and 10.2% in 1995. In addition, if the authors actually excluded 17.4% of the 1990 participants due to lack of family contact, then the degrees of freedom for their t test could not have been 240 as reported (p. 270); rather, it could have been no larger than 209. Therefore, in the area of family contact, we are again not sure what to make of these data other than to say that they seem to be improperly collected and analyzed and, likely, in error.
In fact, in our view the three components making up the Family Contact Scale pose too many problems to be easily (and validly) combined into a single family contact measure. Without some empirically derived weighting scheme, the scoring approach used in Conroy et al. (2003) presumes that a 2-minute phone call or a birthday card has equal weight when compared to a weekend at home. Furthermore, because individuals were leaving the Hissom Center in cohorts, there are likely to have been some selection factors operating for different subgroups across the study period. As individuals moved out, it would have been natural to find an increase in family contact surrounding the actual move. Thus, contact of all types may have increased at the time of a move to a new home, but may have tapered off to original levels thereafter. It is not possible to examine the contact data in this way because neither the actual data collection dates nor the raw data on each component measure were provided to us in the dataset.
Health care access/need
In this area, Conroy et al. (2003) examined (a) the urgency of health care needs, (b) the frequency of physician visits, and (c) the difficulty in securing medical services. We were unable to examine the second of these because the 1995 data on physician visits were missing from the dataset we received. With respect to the urgency of health care needs, we find the rating system inadequate (see Table 3); informants were not trained health care professionals, and the ratings are too global to be captured by means, SDs, and t tests. In our view, the “mean rating difference” in urgency between 1990 (M = 2.4) and 1995 (2.5) is simply not interpretable, nor is the significant t test of these two reported in the 2003 article. Comparison of the actual 1990 and 1995 data (see Table 3) shows them to be quite similar, with minor differences likely attributable to the increased health orientation available in the ICF/MR setting in 1990.
Likewise, the statistically significant difference reported by Conroy et al. (2003) for the difficulty in securing medical services is equally uninterpretable because of slight shifts in rather unstable ratings (Table 3). Data for both trouble in receiving medical services and for physician visits was only provided in range estimates (e.g., as seen in Table 3, ranges used for trouble and for physician visits ratings were on a 6-point scale that ranged between more than once a day to twice a year or less). As noted, we could not compare physician visits because the 1995 data were missing from our dataset. Nonetheless, one wonders what an average value for such ratings actually means; arguably, these data do not even constitute proper parametric data for statistical analyses. In conclusion, once again, we find that the data presented were based on ratings of nonprofessionals, who did not adequately take environmental contexts into account, employed a less than useful rating basis, did not address the possibility of health care changes over time, and are, therefore, difficult, if not impossible, to interpret.
In the 2003 article, Conroy et al. reported an 8.7% decrease in the use of antipsychotics from 20.9% to 12.2% for the entire sample between 1990 and 1995; they also reported an offsetting increase of 8.6% in other classes of medications (2.7% increase in anxiolytics and a 5.9% increase in sedative/hypnotics). Unfortunately, because no dose levels or specific medications were identified (i.e., medications were only tracked by drug class), we are unclear about the interpretation of the authors' findings. Antipsychotics was the only class in which more than 50 individuals received one or more specific drugs. Unfortunately, the changes reported (i.e., a reduction of the use of antipsychotics with a concomitant increase in other psychotropic medications) are not, in our view, sufficiently specific to allow broad conclusions.
Furthermore, data published later by this group (Spreat, Conroy, & Fullerton, 2004), but available to those authors prior to the 2003 article, is contradictory. In a larger analysis of the medications of “movers” and “stayers” of public ICF/MR settings in Oklahoma, Spreat et al. reported a 39% increase in the number of individuals who use any type of antipsychotic medications (from 27.5% to 38.9% of the sample) when individuals moved from ICF/MR settings to supported living in community settings, a finding consistent with Conroy's findings from other states. This fact, in conjunction with the findings from Conroy et al. (2003), suggests that there may be sample-specific factors that may have affected medication usage (e.g., the high percentage of individuals with severe and profound disabilities in Conroy et al., 2003).
Notes on Measurement
For research to have scientific merit, conclusions must rest on data that are carefully collected with reliable and valid instruments. We have repeatedly shown in this review that the measures used by Conroy et al. (1993) were often not carefully constructed or carefully employed. With respect to reliability and validity, we note that although the authors cited literature in support of some of the measures, nearly all of the citations supported the reliability of the instruments as opposed to their validity. Furthermore, we are concerned that typical reliability studies may not directly apply because the measures in Conroy et al. largely consisted of interviewer transcriptions of informant ratings. One wonders how the repeatability inherent in typical measures of reliability (e.g., test– retest or interrater) applies in the present case, in which both the informant and the interviewer are different. In fact, the reliability problem here is not between two raters at all; rather, there are four individuals who may affect reliability (two informants and two interviewers), a problem referred to as interrespondent reliability, which studies have shown to be quite complex (e.g., Felce & Perry, 1996). For example, in the Oklahoma Hissom project reported in Conroy et al. (2003), there were four individuals, either rating people or transcribing ratings 5 years apart, in entirely different contexts. In our view, measurement reliability is a real problem that must be directly addressed in a study of this complexity, especially given the number of data-based problems we have identified here.
The one concurrent validity study cited (Pawlarczyk & Shumacher, 1983) compared the adaptive behavior measure used in Oklahoma (the Behavior Development Survey) with other prominent scales. However, the study was published more than 20 years ago, and the authors used a single rater at a single data-collection point. Even then, they cautioned about psychometric problems, especially in the measurement of maladaptive behavior. Finally, in the Hissom closure, it is also possible, because raters (and interviewers) were not blinded as to the purpose of the study and may have even known about the Court's interests, that rater or interviewer bias, or both, could easily account for a part of the often modest differences in outcomes.
With respect to many of the other measures used by Conroy et al. (2003), validity data are typically not cited, but there are very clear threats to their validity. First, we are skeptical of how estimates of service hours during a month, from different informant/interviewer pairs 5 years apart can be considered valid outcome measures, especially when we have identified a substantial number of apparent transcription or definition errors. Second, we have shown how for several of the variables used here that a coding limit (i.e., 99) was imposed on the 1990 data but not in 1995. Nevertheless, Conroy et al. presented the results of several statistical tests used to compare these groups on mean differences without taking the coding limit into account. Third, we have found that the range of challenging behavior scores does not seem to be possible given the rating system and the number of items. Finally, we believe that the ways in which the items used in the Oklahoma Hissom studies were presented, aggregated, combined, and scored all pose threats to validity. Examples include the repeated finding of extreme score values that do not reflect typical service delivery settings and the Family Contact Scale, in which seemingly disparate types of contact enter equally into the overall score through a simple sum.
Summary and Conclusion
In this review we have shown that problems with the sample, data-collection instruments and their use, and data analyses have seriously compromised the findings of Conroy et al. (2003). We have shown serious design errors as well as many data transcription and analysis errors that often reverse the findings reported in their article. In other cases, these problems represent errors that are less egregious, but when taken together, add to the validity problems in the article.
These problems were often compounded by a lack of specific interpretation provided in the article beyond simply listing significant statistical tests (many of which we have shown to be in error). We assert that readers of Conroy et al. required more information to fully understand the implications of the study described.
We have shown that throughout the text, the authors appear to document the effects of deinstitutionalization from a large ICF/MR on the Hissom Focus Class, although, as we have noted, this was not the case due to sample problems. In addition to the evidence presented here showing that the sample in the 2003 article was not representative of the Hissom Focus Class, other studies by these same authors have reinforced this conclusion more broadly. For example, in another study in their series on costs, Jones, Conroy, and Spreat (1999) reported that in the community, the Hissom Focus Class Members cost, on average, $34,000 more than other Oklahoma community residents and have staffing ratios that are often nearly 2.5 times as generous. Unfortunately, these findings and the selected nature of the sample in Conroy et al. (2003) are not addressed in the Discussion section. Instead, readers are left with the impression that if there were statistically significant differences, then there were real changes in the participants, although we have shown, first, that the effects were often small as measured by the d statistic; second, it is possible that in some cases opportunity effects may have been responsible for part of the effects; third, that such differences may have little real-world significance; and fourth, statistical significance was often the result of a very small portion of the sample. In short, we believe that authors have a responsibility to readers to not only present the results of significance tests, but to place them in context, within the data being analyzed, and in relation to other known data and past research. We do not believe that Conroy et al. (2003) adequately carried this out.
Science, in its idealized form, is a continuously self-correcting enterprise. To accomplish this self-correction, and ensure the quality of accumulated scientific knowledge, the methods of science include two complementary mechanisms: peer review and replication. Peer review is the process of allowing competent scientific peers to examine research reports, ask questions, and draw conclusions about the scientific validity of reported claims.
The second mechanism, replication, is a dynamic process that involves a re-doing of a scientific study or project, in part or whole, to shed light on aspects of the original findings and conclusions. Replication can be exact, such as when a scientist in a lab attempts to precisely reconstruct the experimental conditions of a published study to reproduce the original findings. In applied work, where it is usually not possible to reconstruct the exact conditions of a studied event, replication often takes the form of secondary data analyses or re-analyses of data by another team.
In this review we have replicated many of the analyses from Conroy et al. (2003), using data provided by the original authors, and have documented (a) an inordinately large number of lost subjects, (b) an unacceptable number of data transcription and analytic errors, (c) several measurement and psychometric concerns, and (d) a narrow perspective on interpreting findings typically comprised of the authors simply reporting the results of statistical tests. Among the principal findings of this review is that Conroy et al. (2003), as published, cannot account for over 50% of the Focus Class Members of the Hissom Memorial Center, although in several places in the article, they present the findings as if they applied to the entire class.
Our re-analyses also set aside several “statistically significant” findings in the major outcome areas, while actually uncovering important findings that were missed in the original analyses. We have shown that although changes in adaptive behavior factor scores were found, they were modest and constituted small effects. We have demonstrated how changes reported in challenging behavior are difficult to interpret because the data do not match the scoring possibilities of the scales (frequency and severity) used. We also note that severity ratings, such as those used in Conroy et al., have traditionally comprised an area of adaptive behavior that has known scoring and reliability problems and requires a great deal of caution in interpretation, content that actually appeared in several of the sources cited in Conroy et al. but not addressed in the text.
We have shown how the data underlying both the productive uses of time and the service area variables contain many errors and analysis problems, which give rise to serious interpretive difficulties. We have revealed that data were often not comparable and highlighted evidence pointing to systematic measurement error in the data collection, resulting in our conclusion that several of the uses of parametric tests are inappropriate given either the level of data collected or because of aspects of variable distributions.
While conducting several of the analyses for this critique, we were, at times, concerned that the dataset actually was not the exact one used to produce the analyses in Conroy et al. (2003). However, upon close examination, there were a sufficient number of exact replications (e.g., in various descriptive statistics and in selected findings, see Table 1) to conclude that the dataset is the correct one, although it was originally analyzed with little rigor.
Comparing the content of this review to the content of the Discussion section from Conroy et al. (2003) is instructive if only to show how improperly conceptualized measures and poorly executed statistical analyses can lead to conclusions that are clearly not supported by the data. For example, the first and last sentences of the first paragraph of the Discussion read as follows:
This study adds to the growing body of empirical literature that attests to the benefits of community living for persons with mental retardation. … There can be little argument that the lives of the Focus Class Members improved over the 5-year period of study. (p. 271)
Although such statements fit in well with a particular zeitgeist with respect to the community imperative, they do not follow from a study of less than 50% of the Focus Class Members. Further, our re-analyses appear to raise several “arguments” as to whether there was actually meaningful change in the lives of sample participants.
Later in the Discussion section, as Conroy et al. (2003) presented statements on loneliness and isolation, they made arguments that drew together several threads of the findings and interpreted them as showing increased “economic intercourse” in community settings and suggested that it underlies social integration. We note, in response, that the adaptive factor score gains in Socialization increased, on average, only 1.9 points on a scale that ranged between 0 and 31 and was made up of 7 distinct items; in short, a 1990 to 1995 difference that constituted a small effect, based on Cohen's d. We have also shown how some of the employment and activity data underlying the “economic intercourse” arguments are either invalid or incorrectly analyzed or involve a very small number of individuals; suggesting that conclusions drawn by the authors about community-based economic intercourse and social integration go beyond the data presented.
Overall, in our replication of the analyses in this article, using data provided by Conroy et al., we have shown that their findings are generally not as they presented them. As such, we do not find that this article adds to the literature as the authors contended. As the number of traditional ICF/MR congregate institutional settings decline and newer support models of community services arise, it is vitally important to objectively examine the impact that service system changes have for individuals with complex and profound disabilities. By and large, the population now residing in institutional settings represents the most complex and challenging subgroup of individuals with developmental disabilities. Their welfare demands the highest quality scientific studies to help shape public policy on the services, and service models, needed and desired by this group.
We understand that a secondary analysis of any data by a new research team is likely to result in unanswered questions and even occasional errors. Although desired, perfection in any human endeavor is often not achieved. Nonetheless, the range and depth of the difficulties outlined here demonstrate pervasive problems that seriously call into question conclusions that are likely to be relied upon to make social policy decisions. In our view, the article published by Conroy et al. (2003) may actually pose problems for individuals with developmental disabilities who are similar to the Hissom Focus Class. Furthermore, articles with this level of inaccuracy seem to violate the tenets fundamental to the appropriate practice of scientific research. As such, we conclude that readers of the journal Mental Retardation should exercise an abundance of caution in relying upon findings in Conroy et al. (2003). We also believe that prior to drawing firm conclusions or taking action, readers should consider the ramifications of the information presented in this critique.
Authors: Kevin K. Walsh, PhD (email@example.com), Director of Quality Management and Research, Developmental Disabilities Health Alliance, Inc., 1527 Forest Grove Rd., Vineland, NJ 08360-1865. Theodore A. Kastner, MD, Associate Professor of Clinical Pediatrics, New Jersey Medical School, University of Medicine and Dentistry of New Jersey, and President, Developmental Disabilities Health Alliance, 1285 Broad Street, Bloomfield, NJ 07003-3045