Between-rater reliability of the 6-minute walk test, berg balance scale, and handheld dynamometry in people with multiple sclerosis.

This study investigated the between-rater reliability of the Berg Balance Scale (BBS), 6-Minute Walk test (6MW), and handheld dynamometry (HHD) in people with multiple sclerosis (MS). Previous studies that examined BBS and 6MW reliability in people with MS have not used more than two raters, or analyzed different mobility levels separately. The reliability of HHD has not been previously reported for people with MS. In this study, five physical therapists assessed eight people with MS using the BBS, 6MW, and HHD, resulting in 12 pairs of data. Data were analyzed using intraclass correlation coefficients (ICCs), Spearman correlation coefficients (SCCs), and Bland and Altman methods. The results suggest excellent agreement for the BBS (SCC = 0.95, mean difference between raters [d̄] = 2.08, standard error of measurement [SEM] = 1.77) and 6MW (ICC = 0.98, d̄ = 5.22 m, SEM = 24.76 m) when all mobility levels are analyzed together. Reliability is lower in less mobile people with MS (BBS SCC = 0.6, d̄ = -1.83; 6MW ICC = 0.95, d̄ = 20.04 m). Although the ICC and SCC results for HHD suggest good-to-excellent reliability (0.65-0.85), d̄ ranges up to 17.83 N, with SEM values as high as 40.95 N. While the small sample size is a limitation of this study, the preliminary evidence suggests strong agreement between raters for the BBS and 6MW and decreased agreement between raters for people with greater mobility problems. The mean differences between raters for HHD are probably too high for it to be applied in clinical practice.

I n order to assess the outcome of clinical interventions, reliable outcome measures are essential. Reliability refers to the amount of change in score that can be attributed to measurement error rather than actual improvement or deterioration of the person being measured. It can be expressed as relative or absolute reliability. Relative reliability is estimated with correlation coefficients, which should always be augmented by Bland and Altman calculations. 1 Absolute reliability expresses measurement error in the same units as the original measurement, as with standard error of measurement (SEM). 2 The need for large randomized controlled trials (RCTs) may lead to multicenter trials with many raters, making between-rater reliability estimates between numerous raters increasingly relevant when interpreting results. Between-rater reliability is also important when assessments are performed in clinical practice by two or more different clinicians.
Balance, mobility, and strength are some of the most commonly impaired body functions in people with multiple sclerosis (MS); such impairments can significantly limit a person's ability to take part in social, work, and family activities. 3,4 A recent profiling study of people with MS in Ireland found that balance was reported as the main problem experienced, followed closely by walking, mobility, and strength. 3 These domains are frequently measured in both research and clinical practice.
with MS, and to examine the effect of stratifying the participants by mobility level on reliability scores.

Methods Participants
Physical therapists were recruited for this study through the MS Society of Ireland. Therapists who had received training as part of a multicenter trial were invited to participate by e-mailing them the study information leaflet. Five therapists who provided consent were available to participate on the day of testing. People with MS were recruited through the local MS Society office through a database of those interested in taking part in research. Eight people with MS gave their consent to participate.
The inclusion criterion for participating physical therapists was having attended the training sessions for the multicenter study. People with MS needed to be ambulatory, using at most a walker to walk a minimum of 10 m unaided. People with MS who were pregnant or who were experiencing a relapse or an exacerbation of symptoms at the time of testing or within 2 months before the testing were excluded. Written informed consent was obtained from each participating therapist and person with MS, and people with MS were also required to fill out a Physical Activity Readiness Questionnaire (PAR-Q). 15 Ethical approval for this study was obtained through the University of Limerick Research Ethics Committee.

Procedure
During preparation for the study, a number of different reliability studies that used more than two raters were examined in terms of their methodologies. After review of 12 studies, it was decided to divide the raters into pairs, as was done by Lafave et al., 16 Hesketh et al., 17 and Howe et al., 18 owing to the effectiveness and efficiency of this methodology as noted in those studies. In accordance with the study carried out by van Loo et al., 19 it was decided that each pair would rate the same BBS and 6MW but mark individually. Because of uneven numbers, four out of the five raters examined the BBS and 6MW in randomly allocated pairs. The nature of HHD meant that it had to be carried out independently by each assessor, so all five raters were used for the assessment of HHD.
The BBS was administered using the standardized test instructions. 5 The 6MW was conducted by having the person with MS walk around two chairs set 10 m The Berg Balance Scale (BBS) is a 14-item, 56-point scale designed to measure balance among older people by the assessment of functional tasks. 5 Its concurrent validity has been previously established for people with MS. 6 A study investigating BBS between-rater reliability for two raters in 25 people with MS 7 obtained an SEM estimate of 1.51 with an intraclass correlation coefficient (ICC) of 0.96. Four of the participants used a walking aid. A second study 8 using two raters estimated the SEM at 0.85 with an ICC of 0.99. The overall mean score achieved by the study participants was 52.62 out of a possible 56, and four of the nine participants used a walking aid. The high BBS scores achieved in this study by Paltamaa et al. 8 and that of Cattaneo et al. 7 (mean BBS score of 47.79) may indicate a ceiling effect of the BBS for these samples that resulted in artificially increased reliability coefficients. These studies used only two raters, and the results were not subanalyzed to take into account the variability of mobility levels in people with MS.
The 6-Minute Walk test (6MW) is a simple functional walking test that measures exercise capacity and endurance levels and is a good predictor of habitual walking. 9 Goldman et al. 10 found a strong relationship between the 6MW and subjective measures of ambulation and physical fatigue in 40 people with MS. They stratified their population by Expanded Disability Status Scale (EDSS) score and found mean distances of 603 m, 507 m, and 389 m for those with mild, moderate, and severe disability, respectively. An ICC value of 0.91 was obtained for between-rater reliability, but no SEM was reported. Paltamaa et al. 8 reported an ICC of 0.93 with an SEM of 35.85 for the 6MW in nine people with MS who achieved a mean distance of 427 m.
Handheld dynamometry (HHD) is a means of assessing muscle strength using a small handheld force plate. It has been previously found to be a reliable method of assessing muscle strength in people with diseases such as Huntington chorea 11 and other neurologic conditions. 12,13 Newsome et al. 14 found HHD to have high between-rater reliability for the muscle actions of ankle dorsiflexion and hip flexion, with ICC values of 0.97 and 0.98, respectively. To the best of our knowledge, corresponding SEM data for HHD have not been collected for this population.
The aim of this study was to examine the absolute and relative reliability between more than two raters of the BBS, the 6MW, and HHD when used with people SEM = SDdiff/√2 (where SDdiff comes from Bland and Altman analysis) 23,24

Results
There were four people with MS in group A: one male (GNDS = 1) and three females (two with GNDS = 0 and one with GNDS = 1). There were four people with MS in group B: two males (GNDS = 3, 4) and two females (GNDS = 3, 4). All participants answered "no" to all questions on the PAR-Q. Testing resulted in 12 pairs of data. The BBS group B, 6MW group B, elbow flexion HHD, and ankle dorsiflexion HHD values were normally distributed (Shapiro-Wilks test, P > .05). Results of the ICC and Spearman calculations for each of the outcome measures are presented in Table 1. The highest correlation coefficient was 1.0 for BBS group A. Five of the six pairs of data for group A participants scored maximally on the scale, which may have falsely lowered the variability between scores; therefore, the data were reanalyzed separately with the group B participants only. The 6MW data were also reanalyzed with group A and group B participants separated. The lowest correlations were 0.647 for HHD of hip extension and 0.575 for ankle dorsiflexion. The confidence intervals for the ICC of the HHD measures were large.
The results of the Bland and Altman analysis and SEM values are presented in Table 2, along with the mean scores of participants on each outcome measure. The SEM value as a percentage of the mean score of all tests for each outcome measure is also shown. The SEM as a percentage of the mean score is lowest for the BBS and group A 6MW, and ranges from 19.8% to 31.7% for HHD.

Discussion
This study aimed to evaluate the between-rater reliability of the BBS, 6MW, and HHD in people with MS. Although the reliability estimates for the BBS and 6MW were good, the differences (as a percentage of the mean scores) between raters for HHD were large.
We analyzed both the relative and absolute reliability of the scales using the correlation coefficients (ICC and Spearman correlation) and SEM, respectively. The correlation coefficients give an estimate of the strength of the relationship between raters, and the P values and 95% confidence intervals for the ICC give us a sense of the accuracy of these estimates. On the other hand, the SEM provides an estimate of the variability between raters expressed in the original values of the measure. It has apart and marking the distance walked in 6 minutes. Participants were instructed to walk "as quickly and safely as possible." In pairs, raters watched the person with MS perform the tests, but marked separately. Handheld dynamometry was tested individually by each rater using a standardized protocol described by Bohannon. 12 People with MS were stratified by the mobility section of Guy's Neurological Disability Scale (GNDS). 20 This was used instead of the EDSS because the GNDS can be administered by any health-care professional without the need for prior training. People with MS were categorized into group A and group B. Group A consisted of individuals who walk independently or use at most one stick (cane) outside, equating to a GNDS score of 0 (no problems walking), 1 (problems walking but don't use a stick), or 2 (use a stick to mobilize outdoors), with a rough EDSS equivalent of between 0 and 5.5. Group B consisted of individuals who use bilateral support or a walker or may occasionally use a wheelchair, equating to a GNDS score of 3 (use bilateral device) or 4 (use bilateral device and wheelchair for longer distances), with a rough EDSS equivalent of 5.5 to 6.5.

Data Analysis
Data were analyzed using SPSS, version 16 (SPSS, Chicago, IL). Data were initially assessed for normality using the Shapiro-Wilks test. As some variables met the assumptions for normal distribution and others were not normally distributed, both parametric and nonparametric statistics are presented. Normally distributed data were analyzed with a two-way mixed model using the ICC and associated 95% confidence interval (CI). The Spearman coefficient was ascertained for non-normally distributed data. However, use of ICCs or Spearman coefficients alone does not give an accurate description of reliability and cannot be interpreted clinically, as it does not elaborate on the extent of disagreement between measurements. 1,21 Therefore, Bland and Altman 22 methodologies were also employed to determine the mean difference between all pairs of data (d -), the standard deviation of the differences (SDdiff), and the standard error (SE) of d - . In order to further enhance the clinical applicability of the results, the standard error of measurement (SEM) was calculated as detailed by Stratford 23 and Beckerman et al. 24 The SEM gives us a clinically applicable value based on the mean difference between raters that enables us to determine the amount of change that is due to an intervention itself as opposed to measurement error. 2 The equation used for the SEM was as follows: taneo et al. 7 (1.77 vs. 0.85 and 1.51, respectively), and our mean score was 46.71, compared with 52.62 and 47.79, respectively, suggesting that absolute reliability decreases as the score decreases. This is supported by an SEM value of 2.91 for those in group B, which is much larger than previous estimates. To our knowledge this is the first published data estimating an SEM of the BBS in people with MS that is applicable to specific mobility levels.
The ICC values for the 6MW for the total cohort and the two groups can also be considered excellent. The SEM value for the 6MW for the total cohort was 24.76 m, compared with a value of 35.85 m found by Paltamaa et al., 8 suggesting greater agreement between the raters in this study. This may be accounted for by the standardized training previously undertaken by all physical therapists used in this study. 26 When divided into groups, the SEM for the more disabled people was 33.14 m, compared with 13.26 m for those using at most a stick to walk. The issue of a ceiling effect is not present in the 6MW, which is based on meters walked. been suggested that the difference between raters would need to be greater than the SEM to conclude that the difference was due to the interventions rather than measurement error.
The relative reliability of the BBS when both groups were included can be considered excellent according to Fleiss,25 who recommended that values of greater than 0.75 represent "excellent" reliability while values between 0.4 and 0.75 represent "fair to good" reliability. However, five out of the six people in group A scored maximally on the BBS and therefore had identical scores of 56 on both occasions. When they were removed from the analysis and only group B considered, the Spearman correlation dropped to 0.60. This value is significantly lower than the value of 0.99 reported for the ICC by Paltamaa et al., 8 although the overall mean score of their participants was 52.62 on a 56-point scale. It is possible that their results were also influenced by the ceiling effect of the measure, leading to identical pairs of data at the maximum score. Our SEM for the total BBS was larger than that of both Paltamaa et al. 8 and Cat-  previously cited as a significant factor affecting betweenrater reliability results. 13,28 It is possible that the inexperience of these two raters in using the tool contributed to the increased error of the HHD scores. Physical strength differences between the assessors may also have been a factor, as postulated in earlier studies of HHD reliability. 13,28 As all testing took place in a single day, fatigue may have influenced the variability of the HHD data. However, in order to minimize the effects of fatigue, at least 1 hour of rest was incorporated between tests. The proportion of change in the HHD measures required to exceed the criterion of the SEM calls into question whether the values obtained by two different therapists can be used before and after an intervention. Until further research is carried out on the psychometric properties of these measures using larger sample sizes in When 6MW data for this study were split into group A and group B, the SEM values for those with less disability decreased and the SEM values for those with greater disability increased. Interestingly, the mean distance walked by our minimally disabled group was 353.36 m, which is more similar to the result for the severely disabled group (EDSS 4.5-6.5) in the study by Goldman et al. 10 Our more disabled group had a slower speed, averaging 228.15 m in 6 minutes, similar to the moderate MS group in the study by Gijbels et al., 9 who achieved a mean distance of 294 m, and to those with EDSS scores of 5 to 6.5 in the study by Learmonth et al.,27 who scored 246.88 m. The method of stratifying to disability (GNDS vs. EDSS) may explain this difference in walking speed in our minimally disabled group.
The contrast between the estimates of relative and absolute reliability for HHD highlights the importance of using both methods. For example, the ICC values for ankle dorsiflexion and elbow flexion are 0.799 and 0.852, respectively, suggesting excellent between-rater reliability, 25 a finding similar to a previous evaluation of between-rater reliability of HHD in 30 patients with neurologic conditions. 13 However, the ICC confidence intervals are wide, and the large values obtained for d -, SDdiff, and SEM demonstrate the poor absolute agreement of these results. The variability between raters was greatest for knee extension, with the SEM at 31.7% of the mean score of the participants.
Two of the five therapists had not used HHD prior to the day of this reliability study but did receive training on the day of testing. Experience with HHD has been this population, we suggest, based on the results of this and previous studies, 12,13,28 that only HHD measures taken by the same rater before and after intervention may be reliably used to assess the effectiveness of the intervention. The small sample size is the main limitation of this study, and the external validity and precision of the estimates would be improved considerably by increasing the sample size. The data on absolute and relative reliability would need to be replicated in a larger study in conjunction with test-retest reliability estimates before application in clinical practice. When combined with the information from previous studies, however, these data do provide important preliminary evidence that between-rater reliability decreases as disability increases.

Conclusion
The results of this between-rater reliability study suggest that there is strong agreement between raters for the BBS and 6MW. This is reflected by the high correlation coefficients, low SEM values, and mean differences that are a small percentage of the scores obtained.
The results for HHD are more difficult to interpret, as the correlation coefficients contradict Bland and Altman data. Nevertheless, it is clear that knee extension strength has the greatest variability of all the outcome measures. Given the variations illustrated in this study, it must be concluded that HHD may not be a reliable measure when used by more than one rater and that only the results before and after intervention from the same rater may be considered reliable.
Despite the study limitation of small sample size, these data provide preliminary evidence to suggest that as disability increases, the agreement between raters on these standardized outcome measures decreases. o