Dynamic postural control has gained popularity as a more useful assessment of function than static postural control. One measurement of dynamic postural control that has increased in frequency of use is the Star Excursion Balance Test (SEBT). Although the intrarater reliability of the SEBT is excellent, few authors have determined interrater reliability. Preliminary evidence has shown poor reliability between assessors.
To determine interrater reliability using a group of investigators at 2 testing sites. A corollary purpose was to examine the interrater reliability when using normalized and nonnormalized performance scores on the SEBT.
Descriptive laboratory study.
University research laboratory.
A total of 29 healthy participants between 18 and 50 years of age.
Participants were evaluated by 5 raters at 2 testing sites. After participants performed 4 practice trials, each rater assessed 3 test trials in the anterior, posteromedial, and posterolateral reaching directions of the SEBT.
Normalized and nonnormalized (leg-length) reaching distances were analyzed. Additionally, the mean and maximum values from the 3 test trials were analyzed, producing a total of 16 variables.
For all 16 measures, the interrater reliability was excellent. For the normalized maximum excursion distances, the intraclass correlation coefficients (1,1) ranged from 0.86 to 0.92. Reliability for the nonnormalized measurements was stronger, ranging from 0.89 to 0.94.
When the raters have been trained by an experienced rater, the SEBT is a test with excellent reliability when used across multiple raters in different settings. This information adds to the body of knowledge that exists regarding the usefulness of the SEBT as an assessment tool in clinical and research practice. Establishing excellent interrater reliability with normalized and nonnormalized scores strengthens the evidence for using the SEBT, especially at multiple sites.
When multiple raters in different settings were trained by an experienced rater, the Star Excursion Balance Test had excellent reliability.
Whether the chosen outcome was average or maximum scored and used raw or normalized data, the anterior, posteromedial, and posterolateral directions had excellent reliability.
Clinicians often use postural-control assessments to evaluate the risk of injury, initial deficits resulting from injury, and the level of improvement after intervention for an injury. Dynamic postural control has gained popularity in clinical and research settings as an assessment of function. One measurement of dynamic postural control that has increased in frequency of use is the Star Excursion Balance Test (SEBT). The measure of dynamic postural control is inferred from how far a participant can reach while maintaining a base of support. Widespread use of the SEBT in the clinical and research settings has demonstrated its strong capability to differentiate patients with lower extremity conditions such as ankle instability,1–8 anterior cruciate ligament reconstruction,9 and patellofemoral pain.10 Additionally, the SEBT can assess improvements in dynamic postural control after exercise interventions.4,11,12
The limited literature available suggests that when a single investigator performs the assessments and the participant has had an adequate number of practice trials, the conventional categorization of reliability13 is consistently moderate or better, even across multiple days of testing.9,14–16 Kinzey and Armstrong15 were the first to examine the reliability of the SEBT. A single investigator conducted trials on 20 healthy participants, who performed reaches in 4 directions of the SEBT during 2 sessions, with moderate to strong intraclass correlation coefficient (ICC) scores ranging from 0.67 to 0.87.15 Hertel et al14 recruited 16 healthy women who performed all 8 directions of SEBT over 2 testing sessions, with 2 investigators evaluating each participant on each day. The range of interrater reliability for the 8 directions was wide (ICC = 0.35–0.93), whereas the range of intrarater reliability was more narrow with stronger reliability scores (ICC = 0.78–0.96). The authors attributed the wide range in interrater reliability to lower scores that occurred on the first day of testing and were a potential artefact of a learning effect. They recommended 6 practice trials to overcome the learning effect and consequently improve reliability of the measure. More recently, Robinson and Gribble17 found that 4 practice trials were sufficient to overcome the learning effect, with better consistency in subsequent test trials. Similarly, Munro and Herrington16 noted that reliability between test trials improved after a fourth consecutive trial, with excellent reliability (ICC = 0.84–0.92) in the subsequent 3 trials. Thus, repeated testing using the SEBT is reliable after 4 trials are implemented to effectively account for the learning effect associated with the test.
Although the intrarater reliability of the SEBT is excellent, few investigators have determined interrater reliability; preliminary evidence is for poor reliability between assessors.14 Plisky et al18 showed that the SEBT may be a useful, inexpensive tool to screen athletes for risk of lower extremity musculoskeletal injury and reported strong intrarater reliability (ICC = 0.84–0.87) and test-retest reliability (ICC = 0.89–0.93). Their work demonstrates a likely underused purpose for the SEBT: assessment of dynamic postural control in the clinical and research settings. This could lead to large-scale, collaborative efforts to establish appropriate screening methods for injury prevention.
However, we first need to determine whether multiple assessors at different sites can provide strong consistency in SEBT measures. To date, no authors have had more than 2 assessors examine interrater reliability; a larger number of assessors is paramount to strengthening interrater reliability and expanding the use of this inexpensive tool to multi-site applications in which a number of individuals may be sharing information about patient screening. Interrater reliability is also essential to underpin the development of prevention and intervention strategies for lower extremity injury. Additionally, since the first reliability studies were performed, the now-accepted practice is to use normalized reach distances (reach distance / leg length).19 Therefore, we must revisit the reliability of the SEBT using normalized, rather than absolute, reach distances.
The primary purpose of our study was to determine interrater reliability using a group of investigators at 2 testing sites. The raters were all trained by the same instructor. This approach to determining reliability was used to establish whether the SEBT can be applied in large-scale assessments of dynamic postural control.
A corollary purpose was to examine the interrater reliability when using normalized and nonnormalized scores of SEBT performance. Until now, the reliability of normalized scores has not been examined. We also wished to examine the intertester reliability when using average and maximum performance scores on the SEBT. Traditionally, average scores are used, but when the SEBT is conducted in clinical and research practices, the largest distance from a series of test trials may be more appropriate; however, reliability has not yet been established for this approach. A tertiary purpose was to establish the interrater reliability of measures of leg length. These measures are essential when normalizing the SEBT performance, so it was necessary to examine this measure as we examined the reliability of normalized SEBT values.
A total of 29 participants (19 women, 10 men, age = 31.72 ± 10.8 years, height = 169.52 ± 8.8 cm, mass = 65.58 ± 12.3 kg) from the University of Toledo (Ohio) and the University of Sydney (Australia) volunteered for the study. Each participant completed an injury-history questionnaire and was free of any known musculoskeletal injury or other condition that would preclude completion of the test. Before the study, all participants read and signed an informed consent form approved by the Human Research Ethics Committee at each university.
Before the test sessions at each laboratory, an investigator with more than 11 years of experience (P.A.G.) with the SEBT instructed the raters at the test site using a script and a standardized demonstration. This investigator then served as the practice model for the other raters at each site and established that the raters were properly instructed and could take measures independently. Participants were scheduled for the day after the raters were trained. On testing days, 3 raters (the supervising investigator and 2 trained raters) each assessed the SEBT performance of each participant.
A total of 29 individuals volunteered to participate in the trial: 19 at 1 test site and 10 at the other site. At each site, participants reported to the laboratory for a single testing session. The stance leg was determined by randomization. The length of the stance leg was measured from the anterior-superior iliac spine to the most distal point of the ipsilateral medial malleolus, using a standard tape measure while participants lay supine on a plinth.
Each participant's performance on the SEBT was rated by all 3 raters in the manner described in the “Performance of the SEBT” section, with the order of raters being randomized. A verbal and visual demonstration of the SEBT was given to participants by the first rater, and the constraints of the test were explained. The participants then underwent the same protocol, and their SEBT performance was measured by the 3 raters. Participants performed 4 practice trials of the SEBT, in any order in each direction, with the rater with whom they were initially randomly assigned to minimize any learning effects.17 The participant then performed 3 test trials in each direction for each of the 3 raters, sitting in a chair to rest for 5 minutes between raters. The 3 reach directions tested were anterior (ANT), posteromedial (PM), and posterolateral (PL). The order of the reach directions was randomized for each participant and kept constant across all 3 raters.
Performance of the SEBT
Participants performed the SEBT by standing in the middle of a testing grid with strips of tape placed at 45° angles, reaching with 1 foot as far as possible along the different grid lines, and then returning to the starting position.20,21 The goal was to have the individual establish a stable base of support on the stance limb at the apex of the testing grid and maintain support through a maximal reach excursion in multiple directions.20,21 While standing barefoot or in socks on a single limb and keeping the hands on the hips, the participant made an effort to reach as far as possible with the reaching limb along each tape measure; touch lightly on the tape measure with the most distal portion of the reaching foot, without shifting weight to or coming to rest on the foot of the reaching limb; and return the reaching limb to the start position at the apex of the grid, resuming a stable bilateral stance. Standardized oral instructions were given to every participant (Table 1). A trial was not considered complete if the participant touched heavily or came to rest at the touchdown point, had to make contact with the ground with the reaching foot to maintain balance, or lifted or shifted any part of the foot of the stance limb during the trial.19,21
Although the SEBT consists of 8 directions, conventional testing procedures have adopted a condensed version of the test, using the ANT (Figure 1), PM (Figure 2), and PL (Figure 3) reaching directions.11,18,22 For the ANT reach, the stance-foot position is to place the toes at the 0 mark position of the anterior-reach direction line. For the PM and PL reaches, the heel is placed at the 0 mark position of the anterior-reach direction line. At the first testing station, 4 practice trials were required in each direction.17 Participants were afforded 5 minutes' rest between the practice and test trials.
From each reaching direction (ANT, PM, and PL), the excursion distances were recorded (cm) and considered the nonnormalized data. Additionally, the results from the 3 directions were averaged to create a composite nonnormalized score.
For the 4 dependent variables (ANT, PM, PL, and composite), the nonnormalized scores (cm) were recorded and analyzed. Additionally, the excursion distances in each direction (cm) were normalized by dividing by a participant's leg length (cm) and multiplying by 100 (normalized maximum excursion distance) for the percentage score.19
Finally, using the nonnormalized and normalized scores from the 4 dependent variables, we determined 2 measures: the average and the maximum score. For the average score of each direction, the means and standard deviations from the 3 trials were used. For the maximum score, the maximum reach distance of the 3 trials was used.
In addition to the SEBT analyses, the interrater reliability of the single trial of leg-length measures needed to be established because this value is used in the normalized version of SEBT performance reporting. Therefore, this measure was included as a separate variable for analysis.
Thus, in total, 16 SEBT variables and 1 leg-length variable were available for our analyses. All variables were analyzed for 1 randomly selected limb from each participant.
Interrater reliability refers to variation between 2 or more assessors who measure the same group of participants.13,23 We used a conservative interpretation of interrater reliability by pooling data from 2 testing sites to determine whether consistent SEBT assessments could be achieved by multiple raters, all instructed by the same source. Using an ICC (1,1) model, we examined the reliability of a single assessment by a rater, wherein different sets of raters assessed different groups of participants.13,23,24 The interrater reliability of the SEBT was determined by calculating ICCs (1,1) with 95% confidence intervals for each of the 4 primary variables (ANT, PM, PL, and composite) for both normalized and nonnormalized measurements. Within the groupings of normalized and nonnormalized measurements, an ICC (1,1) was calculated for the average and maximum score from the 3 trials for each of the 4 primary variables. Additionally, this model was applied to the leg-length measure. The data from the 2 sites were pooled, and an ICC (1,1) was used because participants were rated by different sets of 3 raters. An ICC (1,1) of <0.4 represents poor reliability; 0.4 to 0.75, fair to good reliability; and >0.75, excellent reliability.25
For all 16 measures, the interrater reliability was excellent. For the normalized maximum excursion distances, the ICC (1,1) ranged from 0.86 to 0.92 (Table 2). Reliability for the nonnormalized measurements was stronger, ranging from 0.89 to 0.94 (Table 3). The interrater reliability of the leg-length measurement was excellent (ICC [1,1] = 0.92, 95% confidence interval = 0.86, 0.96). The ICC (1,1) and 95% confidence interval for the average, maximum, and composite scores in each direction, for both normalized and nonnormalized measurements, are shown in Tables 2 and 3, respectively.
Assessment of dynamic postural control with the SEBT had excellent interrater reliability, as per the classification of Fleiss.25 For each type of measure (average and maximum) and for both normalized and nonnormalized data, our results demonstrate strong consistency of measurements by multiple investigators for the ANT, PM, PL, and composite scores. This information adds to the body of knowledge regarding the usefulness of the SEBT as an assessment tool in clinical and research practice. Establishing excellent interrater reliability with normalized and nonnormalized scores supports the use of the SEBT, especially at multiple sites.
This is the first study in which more than 2 raters evaluated the interrater reliability of the SEBT. Five investigators measured the participants overall; however, each participant was assessed by 3 investigators, and testing took place at 2 sites. Experience varied among the investigators, but each was trained by the same SEBT expert before testing. Despite the varied experience of the investigators, the interrater reliability results were excellent. This allows us to conclude that the SEBT can be used with confidence across raters of different experience levels without compromising the reliability of results, provided each rater is initially trained in the measurement of the SEBT by an experienced rater. This finding has promising clinical and research implications: consistent, reliable data can be collected when multiple investigators at multiple sites are trained and then assess participants performing the SEBT. A limitation to the application of our findings is that an expert provided the training and was involved in rating the participants. The next needed step is to determine whether similar levels of reliability can be obtained with the use of written or video instructions (or both) that could be distributed to clinicians and researchers.
In the only previous report of interrater reliability of the SEBT, Hertel et al14 reported ICCs between 0.35 and 0.93 when 16 healthy females performed all 8 directions of the SEBT over 2 testing sessions and 2 investigators evaluated each participant on each day. Lower estimates of reliability were found on day 1 of testing: ICCs were 0.76, 0.58, and 0.80 for the ANT, PL, and PM directions, respectively, less than demonstrated in our study. Hertel et al14 included the initial published recommendation for a specific number of requisite practice trials before SEBT assessment because of higher ICC values on a second day of testing. The interrater reliability scores in our study might be higher because we included 4 practice trials in each direction before the test was performed, based on a more recent study17 examining the learning effect during SEBT performance.
Furthermore, although it is not stated in their testing protocol, Figure 1 in the investigation of Hertel et al14 shows a participant performing the test wearing footwear. Our participants performed the test barefoot, potentially allowing for a more accurate measurement of excursion distance along the tape measure. Additionally, Hertel et al14 did not normalize reaching distances. One purpose of our investigation was to establish the interrater reliability using the normalized procedures that are now standard for the SEBT.19 Both normalized and nonnormalized reaching distances were associated with stronger interrater reliability than previously reported.14 Again, this is likely because we afforded participants a specific number of practice trials, which was a recommendation from the initial reliability study by Hertel et al.14
Little difference was observable in the strength of the reliability between the normalized and nonnormalized reaching data. Although the ICC values associated with the nonnormalized data seemed to be slightly higher, all values had excellent reliability with small 95% confidence intervals. Therefore, it is logical that SEBT performance should continue to include normalized reaching distances as previously established,19 and we can be confident that the associated reliability will be excellent.
A secondary purpose of this study was to establish interrater reliability of the leg-length measurements. Because this measurement is critical to the normalization of the SEBT reaches, it was imperative that this be reliable across our group of investigators. We demonstrated excellent interrater reliability using the selected leg-length measurement technique. This technique was chosen based on the original article19 describing normalizing SEBT reach distances. It should be noted that all the investigators were either credentialed clinicians or students in clinician preparation programs, and all had experience using this technique. Therefore, we must conclude that this result would be applicable to similar populations of individuals with clinical backgrounds.
An additional secondary purpose of the study was to compare the reliability of the SEBT assessments when using the average of 3 trials versus the maximum reach distance among 3 trials. In more than half of the assessments, the average normalized value had higher associated ICC values than the maximum values. However, for all the assessments, the ICC values were strong (>0.81). Common practice is to use an average of 3 or more trials on the SEBT. These data could be interpreted to show that if reliability is strong when maximum trials are used, perhaps greater time efficiency could be gained from only recording the maximum trial. When working with large sample sizes, this reduction may be advantageous. Additional investigation will be needed to determine how much time is saved when using only the maximum of 3 trials.
Each investigator in our study was trained in the measurement of the SEBT before the testing sessions. An investigator with more than 11 years of experience in SEBT performance and measurement provided instruction to the raters at each site before deeming them competent to take measures independently. The Hertel et al14 reliability study was conducted early in the initial development of the SEBT. Since this time, the evidence base surrounding the SEBT has been well established, and the test has become widely used in both clinical and research settings. The pretesting training of raters in the current study by an individual with expertise using the SEBT may have been responsible for greater consistency of measurement among raters and may provide an explanation for the higher interrater reliability we obtained. We cannot conclude from our study exactly how much experience is needed to adequately train other individuals to ensure excellent interrater reliability. Future authors should investigate this factor to further strengthen the use of the SEBT in multi-site testing designs.
It is commonly accepted that in order to obtain accurate estimates of reliability in research, data should be collected from a wide range of participants. Where previous trials may have been limited in participant characteristics of either sex16 or age,17 the participants in our study ranged widely in age, height, and mass: women predominated, the age range was 21 to 57 years, the height range was 152 to 190 cm, and the mass range was 42 to 112 kg. Despite the demographic variability, interrater reliability results were excellent. This finding further verifies the strength of the current results, improves the external validity of the study, and allows for greater generalizabilty of results.
The main limitation of our study is that only 3 reach directions were evaluated. Therefore, we can conclude only that interrater reliability is excellent for the ANT, PM, and PL reach directions, as well as the composite score of these 3 reaching directions. However, previous researchers5 have demonstrated considerable redundancy in performance of the different reach directions of the SEBT in participants with and without chronic ankle instability. It is now common practice to include only 3 reach directions to assess dynamic postural control with the SEBT, and the 3 reach directions in our trial are those most often used in both clinical and research settings.
Additionally, we had a relatively small total sample size, with unbalanced numbers of participants at the testing sites. Our sample sizes were based on convenience and were restricted by time and availability due to the nature of the multi-site design and the international collaboration. We believe that our study provides interesting and useful information, but future researchers could consider replicating our methods with larger and equal sample sizes at each testing site.
All participants performed the procedures without shoes, but we did not control whether the participants wore socks during the performances. Although this could pose a limitation, a recent investigation,26 using the same 3 reaching directions as those in our study, demonstrated no differences in SEBT performance between a bare foot and wearing a regular sock.
Finally, a potential limitation is that the verbal instructions and practice trials were only provided to the participant at the first testing station by the first randomly assigned rater, followed by performing test trials for each of the 3 raters. This protocol was implemented because our interpretation of the previous reliability studies14,17 suggested that during a single testing session, after a specific number of practice trials are allowed, no significant improvement in performance follows. This protocol is useful for studying reliability, but we do not believe that having 3 raters is practical. Instead, a single rater would provide the verbal instructions and observe the prescribed practice trials for a single patient. This is a small discrepancy between our study protocol and practical implementation, yet our results support the strong reliability in using the SEBT, as we have discussed in previous sections.
The SEBT is a reliable test when used across multiple raters in different settings when raters are trained by an experienced rater. Reaching in the ANT, PM, and PL directions has excellent reliability, whether the chosen outcome is the average or maximum score using raw or normalized data. Therefore, researchers, especially with clinical backgrounds, should be able to use this tool for assessing dynamic postural control after receiving instruction and practice, making it a suitable and inexpensive tool in clinical and research settings.