Children with autism spectrum disorder (ASD) display differences in multisensory function as quantified by several different measures. This study estimated the stability of variables derived from commonly used measures of multisensory function in school-aged children with ASD. Participants completed: a simultaneity judgment task for audiovisual speech, tasks designed to elicit the McGurk effect, listening-in-noise tasks, electroencephalographic recordings, and eye-tracking tasks. Results indicate the stability of indices derived from tasks tapping multisensory processing is variable. These findings have important implications for measurement in future research. Averaging scores across repeated observations will often be required to obtain acceptably stable estimates and, thus, to increase the likelihood of detecting effects of interest, as it relates to multisensory processing in children with ASD.
Recent literature and updated diagnostic criteria suggest that sensory abnormalities represent a core feature of autism spectrum disorder (ASD; American Psychological Association [APA], 2013). Children with ASD have been observed to display unusual responses to stimuli presented within a number of sensory modalities (e.g., audition and vision) and to demonstrate atypical responses to sensory stimuli presented across multiple sensory modalities (i.e., multisensory stimuli, such as audiovisual stimuli) in studies employing a broad range of measures, including psychophysical tasks, eye tracking, and electroencephalography (EEG). Differences in responding to multisensory stimuli are most consistently observed, seemingly, for stimuli that are social in nature, specifically audiovisual speech stimuli (Baum et al., 2015; Irwin & DiBlasi, 2017; Smith et al., 2017; Stevenson et al., 2014).
The aforementioned findings have fostered interest in exploring the extent to which variables derived from measures of audiovisual speech processing and perception may be valid for predicting other core and related symptoms of ASD, and the degree to which such indices are potentially malleable with targeted treatment. It is critical, however, to first ascertain the stability of variables derived from measures of multisensory function, as this influences the potential validity of these variables for predicting ASD and related symptomatology, as well as detecting training and intervention effects (McCrae et al., 2011). Indeed, discrepant findings across past studies exploring multisensory processing in children with ASD could be explained by instability of the measures that have been employed in prior work (Feldman et al., 2018; Magnotti & Beauchamp, 2018). Therefore, the present study explores the stability of several variables derived from commonly used measures of multisensory function—in particular, variables indexing attention to and integration of audiovisual speech.
Attention to and Integration of Audiovisual Speech in Typically Developing Individuals
Speech is inherently a multisensory process, wherein highly synchronized cues from the moving mouth accompany the dynamic acoustic signal (Chandrasekaran et al., 2009). The perception of speech is influenced by the presence of these complementary visual speech cues (Calvert & Campbell, 2003; Massaro & Palmer, 1998). This fact is illustrated by the McGurk Effect, a perceptual illusion wherein persons presented with incongruent auditory and visual speech cues (e.g., a visual “ga” paired with an auditory “ba”) often report perceiving an illusory percept (e.g., “da” or “tha”) purported to reflect a “fusion” of the mismatched multisensory information (McGurk & MacDonald, 1976).
Typically developing (TD) individuals attend to and integrate audiovisual speech cues very early in life (Soto-Faraco et al., 2012). Specifically, TD infants begin to look to the mouth of a speaker (the source of multisensory redundancy) by approximately 8 months of age (Lewkowicz & Hansen-Tift, 2012). Once this propensity to lipread emerges, TD individuals will continue to capitalize on the corresponding visual cues from the mouth across the lifespan whenever speech processing becomes challenging (e.g., when in a noisy environment and/or when faced with an unfamiliar dialect or language; Barenholtz et al., 2016; Buchan et al., 2008). TD children begin to show some sensitivity to the temporal synchrony of auditory and visual speech cues, looking preferentially to the visual cues that are more highly correlated with a fluent auditory speech stream by approximately their first birthday (i.e., 12–14 months of age; Lewkowicz et al., 2015), and display increasing temporal acuity for audiovisual speech throughout childhood and adolescence (Hillock-Dunn et al., 2016; Hillock et al., 2011; Lewkowicz & Flom, 2014).
Access to multisensory speech cues affords a number of functional benefits. Psychophysical studies have demonstrated that concurrent visual speech cues boost perceptual accuracy substantially for TD individuals, particularly in the presence of noise or otherwise difficult listening conditions (e.g., Fraser et al., 2010). Studies of electrophysiological responsiveness in TD children as well as adults have found that their processing of audiovisual speech is more efficient than their processing of auditory-only speech (e.g., Knowland et al., 2014; van Wassenhove et al., 2005). This increase in speech processing efficiency is evident via faster latencies and reduced amplitudes for multiple EEG waveform components indexing the brain's response to speech, in particular in the negative-going deflection that occurs around 100 ms (i.e., N1 or N100) and the positive deflection that occurs around 200 ms (i.e., P2 or P200) following stimulus onset (Knowland et al., 2014; van Wassenhove et al., 2005).
Attention to and Integration of Audiovisual Speech in Individuals With ASD
A large and ever-growing literature utilizing diverse measures of multisensory function suggests that children with ASD display differences in their attention to and integration of audiovisual speech relative to their TD peers (see Feldman et al., 2018, for a review). For example, studies using eye-tracking technology have shown that children with ASD display diminished attention to audiovisual speech (i.e., reduced looking to the mouth of a speaker) in comparison to TD controls (Grossman et al., 2015; Riby & Hancock, 2009). Investigations using psychophysical approaches have additionally reported that children with ASD, on average, tend to show reduced multisensory integration (i.e., report fewer perceptual fusions) in response to discrepant McGurk stimuli (Iarocci et al., 2010; Irwin et al., 2011; Williams et al., 2004). Furthermore, children with ASD have been observed to exhibit a lesser degree of audiovisual “gain” for speech-in-noise stimuli when compared to TD children (Foxe et al., 2015; Smith & Bennetto, 2007).
School-age children with ASD are also less attuned to the typical temporal relations between auditory and visual speech cues. In the context of a simultaneity judgment task, responses to paired audiovisual stimuli presented at various temporal offsets can be used to estimate the window of time over which an individual tends to “bind” auditory and visual information together, or perceive such multisensory information as arising from a unitary event (i.e., the temporal binding window; TBW). When compared to TD controls, children with ASD present with significantly wider TBWs (Stevenson et al., 2014; Woynaroski et al., 2013). Findings from studies using electroencephalography (EEG) and event-related potentials (ERPs) are limited for this clinical population; however, those that are available suggest that neural responses to multisensory stimuli differ for school-age children with ASD as a function of the severity of their core and related symptoms (Brandwein et al., 2013, 2015; Woynaroski et al., 2019).
A Need to Assess Stability of Variables Derived From Commonly Used Measures of Multisensory Speech Processing
The aforementioned findings of altered multisensory function have engendered interest in exploring whether measures of multisensory integration may be valid for predicting ASD and related symptomatology or may be sensitive to effects of sensory-based interventions geared towards children on the autism spectrum. However, we currently know little about the stability of commonly used estimates of audiovisual functioning across observations and contexts. In fact, to date, no study has comprehensively investigated the stability of the variables derived from measures routinely used to tap audiovisual speech processing and perception in children with ASD. Ascertaining the stability of indices of multisensory function in this clinical population is critical because the stability of any given variable limits its validity for detecting effects of interest (i.e., the validity cannot exceed the square root of the stability; DeVellis, 2006; Nunnally, 1978).
Generalizability (G) studies are a useful tool for measuring stability because they allow us to parse the variance in a given variable that is attributable to the construct of interest versus facets of measurement error. Decision (D) studies then draw upon the results of a generalizability study and extrapolate beyond observed data to predict the level of stability that would be achieved for a hypothetical number of observations (and/or levels of other facets of interest in the study; Mushquash & O'Connor, 2006). This study uses G&D studies to ascertain the degree to which variables derived from some of the most frequently used measures of multisensory function are stable and to determine how many observations are required to obtain acceptable stability for variables of interest (Yoder et al., 2018). Our specific research questions were:
How stable are variables derived from the various commonly used measures of multisensory function, in particular, indices of selective attention to and integration of audiovisual speech, in children with ASD?
How many observations are required to reach acceptable stability for each of the variables of interest?
Eleven children (7 male; 4 female) aged 7–16 years old participated in the study (see Table 1 for descriptive information). Eligibility criteria were as follows: (a) diagnosis of ASD as confirmed by research-reliable administrations of the Autism Diagnostic Observation Schedule, second edition (ADOS-2; Lord et al., 2012) and clinical judgment of a licensed clinician on the research team; (b) no history of seizure disorders; (c) no diagnosed genetic disorders, such as Fragile X, Down syndrome, or tuberous sclerosis; and (d) normal or corrected-to-normal vision and normal hearing, as confirmed by screening at entry to the study.
This study was conducted at Vanderbilt University. Participants completed a series of psychophysical, EEG, and eye-tracking measures once per day, on two different days, within a 1-week timeframe. For each participant, data collection was conducted in the same order on each measurement day, and each procedure took place at the same time of day across measurement days; however, procedure order was randomized across participants. All procedures were approved by the institutional review board at Vanderbilt University. Parents provided written informed consent, and participants provided written or verbal assent prior to participation in the study. All participants were compensated for their participation.
Participants completed all psychophysical measures in a sound- and light-attenuated booth (WhisperRoom Inc., Morristown, TN, USA). Stimulus presentation for all tasks was managed by E-Prime software. Visual stimuli were presented on a Samsung Syncmaster 2233RZ 22-inch personal computer (PC) monitor. Auditory stimuli were presented binaurally via Sennheiser HD550 series supra-aural headphones (simultaneity judgment task and McGurk tasks) or via an M-AUDIO BX8 D2 speaker (listening-in-noise task).
Simultaneity Judgment Task
Audiovisual stimuli for the simultaneity judgment task consisted of a neutral-faced adult female speaker saying the syllable “ba” against a white background. The auditory and visual components of the stimuli were separated in the video editing software Adobe Premiere. Stimuli were presented synchronously and asynchronously at various stimulus onset asynchronies (SOAs; the difference in the presentation of the auditory and visual components of the stimuli, with negative values indicating auditory first and positive values indicating visual first). Asynchronous stimuli were presented at 14 SOAs: ±500 ms, ±400 ms, ±350 ms, ±300 ms, ±250 ms, ±150 ms, and ±100 ms.
Each participant was instructed to report whether he or she saw and heard the speech at the “same time” or at a “different time” by pressing the “1” and “2” keys on the keyboard, respectively. To ensure comprehension of the task, participants completed a practice round, consisting of two trials of stimuli presented synchronously and two trials of stimuli presented at an SOA of ±900 ms in a randomized order. Participants were required to correctly respond to all items of the practice round prior to starting the task. After this comprehension check, synchronous trials and asynchronous trials at each SOA were presented four times, in a random order (total of 60 trials per run). Participants completed five runs (300 total trials) of the task each day.
Data from E-Prime were exported into MATLAB. TBWs were derived for each child by fitting two psychometric functions to the data for his or her reported rate of perceived synchrony across SOAs (i.e., the number of times that the child answered “synchronous” over the total number of trials presented for each SOA) using the glmfit function in MATLAB (see Powers, Hillock, & Wallace, 2009; Stevenson et al., 2014, for a detailed description of this approach), one for auditory-leading (left) trials and another for visual-leading (right) trials, after normalizing the data (i.e., setting the maximum value to 100%; see Figure 1). The point at which each psychometric function crossed 75% perceived synchrony was considered the left- and right-TBW. The TBW was then calculated as the difference between these values.
Audiovisual stimuli utilized in the McGurk task were derived from media files of the same adult female speaker previously described saying the syllables “ba,” “ga,” “pa,” and “ka” with a neutral facial expression. Adobe Premiere was used to create visual-only, auditory-only, matched audiovisual, and mismatched audiovisual (i.e., McGurk; auditory “ba” + visual “ga”; auditory “pa” + visual “ka”) stimuli. Participant responses were recorded via a four-button response box labeled with four syllables for each task (i.e., “ba,” “ga,” “da,” “tha” for the Ba/Ga task; “pa,” “ka,” “ta,” “ha” for the Pa/Ka task).
Participants completed two runs each of the two different McGurk tasks, one task with the auditory “ba” and visual “ga” syllables (which frequently induce a fused percept of “da” or “tha”) and one task with auditory “pa” and visual “ka” syllables (which frequently induces a fused percept of “ta” or “ha”). Prior to starting the task, participants were provided oral instructions to press the button that corresponded to the syllable they perceived during each trial (e.g., participants were told to “press ‘ba' if you think she says ‘ba,' ‘ga' if you think she says ‘ga,'” etc.). Prior to each run of the task, the participants completed a comprehension check wherein they were prompted to press the designated button for each syllable in a random order. During each run, participants were presented with 10 trials of each syllable in the auditory-only, visual-only, and matched audiovisual conditions, and 10 trials of the incongruent audiovisual (McGurk) stimuli in a randomized order (70 trials per run). After each trial, participants reported the syllable they perceived using the four-button response box.
Data from E-Prime were exported into MATLAB. Magnitude of multisensory integration in response to McGurk measures was operationalized as the proportion of trials in which participants reported perceiving the illusory percept in response to incongruent audiovisual stimuli in each task (i.e., “da” and “tha” for the Ba/Ga task; “ta” and “ha” for the Pa/Ka task).
Listening-in-noise stimuli were videos of an adult female speaker saying monosyllablic words with a neutral facial expression (described in Picou et al., 2011). These words were arranged in eight lists of 25 words each (as in Picou et al., 2017). Each list was balanced for audibility, and stimuli were presented via a single (mono) speaker positioned above a monitor at 0° azimuth and calibrated to a sound level of 50 dB SPL. Speech-shaped noise was created in MATLAB by generating gaussian white noise via the wgn function and shaped based on the long-term average spectrum of the speech (LTASS; described in Donley, Ritz, & Kleijn, 2018). This noise was presented at 53 dB (for a −3dB signal-to-noise ratio [SNR]) and 56 dB (for a −6dB SNR). These SNRs were selected based on the largest group differences previously reported between individuals with ASD and individuals with typical developmental histories in this age range (Foxe et al., 2015). At each SNR, stimuli were presented in audiovisual and auditory-only conditions.
Four wordlists (1–4) were used on the first observation day (i.e., two modalities x two SNRs), and four different wordlists (5–8) were used on the second observation day. The testing order was randomized for each participant on each day. Participants were instructed to listen and repeat the word they perceived to a research assistant who then typed the word and confirmed the participants' responses orally and via the typed response on the monitor. To ensure comprehension of the task, participants were presented with five words without white noise. The participant was required to correctly identify each word before proceeding to the task. After this comprehension check, participants were presented with the four lists, one word at a time. Identification accuracy for listening-in-noise measures was calculated as the percent of whole words correctly identified in each condition for each recording day.
Stimuli used in the ERP measure were consonant-vowel syllables (i.e., “ba”) naturally spoken by an adult female speaker using a neutral facial expression (see simultaneity judgement and McGurk tasks) in two conditions. In the audiovisual (AV) condition, the corresponding auditory and visual stimuli were presented in synchrony (i.e., with the visual cues from the face and neck temporally preceding the auditory cues in onset as naturally produced by the speaker). In the auditory-only (AO) condition, the auditory stimulus was presented in conjunction with a static face (i.e., a still image of the speaker) in order to isolate the contribution of visual articulatory cues versus simply the presence/absence of a face on speech processing.
Stimuli were presented via E-Prime in conjunction with an Eyelink 1000 Plus eyetracker, which ensured that videos were presented only when participants were gazing at the screen (i.e., when each participant's gaze was focused on a fixation cross centered on the speaker's face for the 500 ms interval immediately preceding stimulus presentation). Data were collected using NetStation and a 128-channel Geodesic sensor net (Net Amps 400 amplifier, Hydrocel GSN 128 EEG cap, EGI Systems Inc.). The raw EEG signal was sampled at 1000 Hz and referenced to the vertex (Cz). Electrode impedances were kept at or below 40 kΩs.
Prior to the task, children's eye gaze was calibrated using a five-point calibration procedure. This procedure was performed twice to validate accuracy of calibration. Participants then viewed a video wherein a member of the research team briefly described the task, and a TD peer modeled the task (i.e., wore the EEG cap and attended to the screen during stimulus presentation). Following calibration and presentation of the introductory video, the experimental task was initiated. The task employed an equiprobable paradigm, wherein 50 trials of each stimulus type (i.e., AV and AO, as described above) were presented in random order in two blocks for a total of 100 trials of each stimulus type across the two blocks. Trials were separated by an interstimulus interval (ISI) that was randomly jittered between 400 ms and 800 ms plus the 500 ms gaze contingency period (i.e., minimum ISI between 900 ms and 1300 ms). Between the two blocks, children took a scheduled break. During each block, images of cartoon aliens were presented periodically in between trials (i.e., after every fourth trial, there was a 50% chance of an alien image appearing) to maintain participant attention to the task. Participants were instructed to hit a BIGmack button (AbleNet Inc., Roseville, MN, USA) to “catch” the aliens each time one appeared on the screen.
EEG data were bandpass filtered from 0.5Hz to 50Hz, using the EEGLab firfiltnew.m function, which implements a bidirectional zero-phase finite impulse response filter, and artifacts and bad channels were manually removed in EEGLAB (Delorme & Makeig, 2004). An average of 73.2% of trials (146.4 trials) were retained across children. After data were cleaned, they were re-referenced to the average, and removed channels were interpolated. Trials were baseline corrected from 200 ms to 0 ms pre-stimulus onset. The amplitude and latency of the N1 (i.e., window defined a priori as occurring between 100 ms and 140 ms post-stimulus onset) and P2 (i.e., window defined a priori as occurring between 160 ms and 240 ms) components as measured at a centrally located electrode site (Cz) were extracted from the average waveform of each participant for each EEG observation day and manually reviewed (see Table 2 for further detail re: ERP variables).
Stimuli utilized in the eye-tracking measure were 50 second video clips utilized in several past studies of attention to multisensory speech (e.g., Lewkowicz & Hansen-Tift, 2012; Pons et al., 2019). In each video clip, an adult female actor recited a prepared monologue in children's native language (i.e., English) or a non-native language (i.e., Spanish) in a child-directed manner (i.e., with high pitch excursions, prosodically exaggerated speech and slow articulation, while smiling) or in an adult-directed manner (i.e., with minimal pitch variation, average speed of articulation, and neutral affect). Visual stimuli were presented on a 24-inch computer monitor positioned approximately 50 cm in front of the participant. Auditory stimuli were presented at 75 dB by an M-AUDIO BX8 D2 speaker placed in front of the participant just below the computer monitor. A Sensorimotorics Instrument (SMI) REDn Scientific Eye Tracking System (SMI, Teltow, Germany) was used to control stimulus presentation and randomization and to track eye gaze via pupil-centered corneal reflection.
Participants were seated in front of the eye-tracking system, monitor, and speaker. Eye gaze was calibrated using a five-point calibration procedure during which participants were instructed to watch a looming star that moved from the center to each corner of the computer screen. Following calibration, participants were presented with the four video clips (i.e., English and Spanish clips presented in a child- and adult-directed manner) in random order, on each observation day. Prior to the presentation of each video clip, participants were instructed to “please watch the movie.”
SMI's BeGaze software was utilized to automatically quantify the duration of looking to a priori specified areas of interest (AOIs; i.e., the mouth, eyes, and face) during stimulus presentation (see Figure 2). Attention to audiovisual speech was operationalized as the proportion of total looking time deployed to the mouth AOI (the source of multisensory redundancy) and the eye AOI (a commonly used contrast region), respectively, out of the total time spent fixating on any part of the face during stimulus presentation in each condition (i.e., English infant-directed, Spanish infant-directed, English adult-directed, Spanish adult-directed).
Generalizability (G) and Decision (D) studies were carried out using EduG (Swiss Society for Research in Education Working Group, 2012). EduG is freeware created specifically for generalizability analysis. G and D studies were conducted on the variables derived from psychophysics, ERP, and eye-tracking measures (see Table 2 for a summary). For each of these variables, random effects models constituting a total of 22 observations (11 participants X 2 days) in a crossed design (participant X day) were run. Absolute g coefficients, which are preferred over relative g for their inclusion of all effects of measurement facets in the computation of the coefficient (Yoder et al., 2018), were derived to quantify the stability achieved for observed data (i.e., one and two observations). In the D studies, the g coefficient was projected beyond the number of observed sessions to determine how many observations would be needed to achieve acceptably stable scores. Our a priori threshold for acceptable stability was set at g = .8, a criterion commonly applied in previous stability studies (e.g., Bottema-Beutel et al., 2019; Sandbank & Yoder, 2014; Woynaroski et al., 2017; Yoder et al., 2016).
Variables Derived From Eye-Tracking Measure
Variables derived from the eye-tracking measure showed relatively high stability. For English adult-directed speech, stability was high for both proportion of time looking at the mouth (g for a single observation = 0.91) and proportion of time looking at the eyes (g for a single observation = 0.87). These variables exceeded acceptable stability with one observation. During English infant-directed speech, stability was also high for proportion of time looking at the mouth (g for a single observation = 0.70) and proportion of time looking at the eyes (g for a single observation = 0.75). Both of these variables were acceptably stable after two observations.
For Spanish adult-directed speech, proportion of time looking at the mouth (g for a single observation = 0.95) and proportion of time looking at the eyes (g for a single observation = 0.96) were both highly stable. These variables both exceeded our established threshold for acceptable stability with a single observation. For Spanish infant-directed speech, proportion of time looking at the mouth (g for a single observation = 0.74) and proportion of time looking at the eyes (g for a single observation = 0.93) were acceptably stable after two observations and one observation, respectively. Figure 3 depicts the results for variables derived from the eye-tracking measure.
Variables Derived From Psychophysical Measures
The stability of variables derived from psychophysical measures was mixed. Proportion of reported McGurk illusions in response to auditory “pa” and visual “ka” and TBW for audiovisual speech showed the highest stability (g coefficients = 0.84 and 0.74 for a single observation). These variables were acceptably stable after one and two observations, respectively. The remaining variables, the proportion of reported McGurk illusions in response to auditory “ba” and visual “ga” stimuli (g for a single observation = 0.47) and whole-word recognition of audiovisual speech presented at –3 dB SNR (g for a single observation = 0.31) and –6 dB SNR (g for a single observation = 0.19), were less stable. The D studies indicated that proportion of reported Ba/Ga McGurk illusions would be acceptably stable after five observations. Whole word recognition of audiovisual speech presented at –3 dB SNR would be acceptably stable after nine observations, and whole word recognition of audiovisual speech at –6 dB SNR would require more than 10 observations to achieve acceptable stability. Figure 4 summarizes the results for variables derived from psychophysics tasks.
Variables Derived From ERP Measure
The stability of variables derived from the ERP measure was highly heterogeneous. Of these variables, P2 amplitudes for the auditory-only (g for a single observation = 0.74) and audiovisual (g for a single observation = 0.90) conditions were the most stable, exceeding our criterion for acceptable stability after two observations and one observation, respectively. N1 amplitudes were relatively less stable (g = 0.00 for a single observation in the auditory-only condition, and g = 0.41 for a single observation in the audiovisual condition). D studies indicate that six observations would be required to achieve acceptable stability for N1 amplitude in the audiovisual condition and that it may not be possible to obtain a stable estimate of N1 amplitude in the auditory-only condition in school-age children with ASD, even with repeated sampling (i.e., the model shows no sign of converging on acceptable stability even for estimated coefficients at 10 or more observations).
In the auditory-only condition, latency variables for both N1 (g for a single observation = 0.05) and P2 (g for a single observation = 0.24) had low stability. In the audiovisual condition, stability for latency of N1 (g for a single observation = 0.11) and P2 (g for a single observation = 0.00) was also low. According to D studies, it would take more than 10 observations to achieve acceptable stability for latencies of N1 and P2 across conditions. Refer to Figure 5 for a summary of findings for variables derived from the ERP task. Table 3 provides a detailed summary of estimated stability for each variable of interest according to varied numbers of observations, to facilitate planning for future studies.
This study examined the stability of several variables derived from commonly used measures of multisensory function in children with ASD. Stability of variables is critical if measures of multisensory function are to be employed in studies aiming to predict heterogeneity in broader ASD and related symptomatology and/or to assess intervention efficacy in children on the autism spectrum, though such psychometric work has been limited to date (Basu Mallick et al., 2015; Powers et al., 2009). The present results indicate that the stability of variables derived from measures of multisensory function differs across (and in some cases within) measure type in school-age children with ASD. Averaging scores across repeated observations will often be required to obtain acceptably stable estimates and increase the likelihood of detecting effects of interest, as it relates to multisensory function in this clinical population.
Variables Derived From Eye Tracking Are Highly Stable
Variables derived from the eye-tracking measure were highly stable. This result is somewhat surprising, given the limited sampling (i.e., less than 1 minute of data collection per condition) and relative lack of structure (i.e., passive viewing with little instruction beyond a request to “watch the movie”) associated with this sampling context. Though such eye-tracking tasks do not necessarily tap “integration” of multisensory stimuli, the present findings suggest that measures of eye-gaze patterns yield variables that are highly stable and, thus, have potential construct validity for indexing attention to multisensory stimuli, which has been theoretically and empirically linked to social, communication, and language development in children with or at risk for ASD (Klin et al., 2002; Santapuram et al., 2019; Tenenbaum et al., 2014). It is also notable that these brief and low-demand measures have high potential to translate to clinical practice if sufficient support is obtained for their validity in predicting symptomatology and/or detecting effects of interventions targeting looking behavior and more distal ASD symptoms.
Stability of Variables Derived From Psychophysics Measures Is More Mixed
In contrast, the stability of variables derived from psychophysics tasks was more heterogeneous. Although two indices exceeded the a priori criterion we established for acceptable stability with only one or two observations (e.g., TBW for audiovisual speech and one variable indexing magnitude of integration in response to incongruent audiovisual McGurk stimuli), other variables would necessitate much more extensive sampling to achieve acceptable stability. We considered the possibility that restricted variance among participants could explain the relatively low stability of some variables derived from psychophysics tasks. However, there was substantial variability among participants in variables derived from all measures of multisensory function employed in the present work, in accord with the extant literature. For example, on both McGurk tasks, participants showed a high degree of heterogeneity in their responses, reporting rates of perceived fusion ranging from 0%–100%, which is consistent with prior research reporting on individual differences in integration on McGurk tasks (e.g., Basu Mallick et al., 2015). Thus, it is unlikely that a truncated range of responses could explain the relatively low stability observed for variables derived from McGurk tasks in the present report.
There are some other possible explanations for the variability observed for stability of scores derived from psychophysical measures. First, it is notable that the listening-in-noise task represents the only measure for which the exact same stimuli were not employed across observation days. The use of different wordlists was necessary to control for the possibility of children simply “learning” the words with repeated exposure. The relative instability observed for speech-in-noise variables across sessions may, however, reflect less than optimal balancing of stimuli, though wordlists were reportedly designed to be equally audible/intelligible in prior work (Picou et al., 2017).
The contrast in stability for variables derived from different McGurk tasks is somewhat more challenging to explain. These tasks technically did involve different stimuli, but the stimuli utilized in each task were highly similar consonant-vowel syllables that were spoken by the same speaker in the same manner and that have previously been shown to induce a perceptual fusion (e.g., Iarocci et al., 2010; Irwin et al., 2011). The two versions of the McGurk task differed only in the specific audiovisual stop consonant-vowel combinations employed. The acoustic features that facilitate fusion for these two incongruent canonical syllable pairs do differ. Specifically, the audio “pa” plus visual “ka” combination (commonly inducing a percept of “ta” or “ha”) reflects an instance of ambiguity in voice onset time, whereas the audio “ba” plus visual “ga” combination (frequently leading to a fused percept of “da” or “tha”) reflects an instance of ambiguity in the frequency of the second formant. It is notable that this acoustic distinction of consonants (e.g., second formant frequency of “ba” vs “ga”) is less consistent, especially for natural speech stimuli (Ohde & Sharf, 1992). It is unclear whether this explanation could account for less reliable fusion of the “ba” and “ga” stimuli. It is clear, however, that the increased measurement error present for the auditory “ba” plus visual “ga” stimulus combination could account for the failure to replicate findings across the large extant literature that has employed these particular stimuli to investigate the magnitude of multisensory integration for audiovisual speech in children with ASD (Basu Mallick et al., 2015; Woynaroski et al., 2013).
Variables Derived From the ERP Measure Are Relatively Unstable
The variables derived from our ERP task were the least stable of all indices we explored. Nevertheless, differences in relative stability across indices were apparent. Consistent with prior literature, latency variables were, on the whole, less stable than amplitude variables (Cassidy et al., 2012; Huffmeijer et al., 2014) in our sample. Additionally, N1 indices were less stable than P2 indices, likely because the N1 component is still emerging and not fully consolidated in early childhood, for children with or without ASD (Espy et al., 2004). These findings collectively point toward a focus on one ERP variable—P2 amplitude—for future research into the neural response to multisensory speech in school-age children with autism.
The present study has clear implications for future research focused on multisensory function in children with ASD, but it is not without limitations. First, it is notable that our sample size was small. The small n was necessary, given the extensive and repetitive nature of measurement, and was within the range of sample sizes frequently considered sufficient for estimating the amount of variance attributable to construct/s of interest versus other facets of the measurement context (i.e., 10–20 participants; Bottema-Beutel et al., 2019; Woynaroski et al., 2017; Yoder et al., 2016). However, a known ramification of small sample size, reflected in wide confidence intervals, is reduced confidence that results will represent variable stability in similar participants. Furthermore, our participant sample was limited to children who were relatively older and higher functioning (i.e., cognitively and linguistically able), specifically children who were age 7 and up and capable of completing the broad range of tasks utilized in the present study, including tasks that necessitated attending and actively responding for an extended period of time. Our participant sample is comparable to those of previous studies employing similar measures of multisensory functioning (e.g., Basu Mallick et al., 2015; Grossman et al., 2015; Hillock et al., 2011; Iarocci et al., 2010). Future studies are needed, though, to explore the stability of variables of multisensory function in children with ASD who represent the broader range of chronological ages, developmental stages, and functioning levels. It is expected that stability of variables derived from these frequently used measures of multisensory function would likely be lower than observed here in children who are chronologically or developmentally younger (Sandbank & Yoder, 2014). Additionally, our findings may not generalize beyond the methods used in this study; for example, it is possible that a different number of trials in the psychophysical and ERP tasks or different stimuli may yield different results.
To our knowledge, this is the first study to comprehensively examine the stability of variables derived from commonly used measures of multisensory function in children with ASD. Results suggest that the stability of such variables is highly heterogeneous. Though a number of indices demonstrated relatively high stability, suggesting they hold some promise for detecting effects of interest as they relate to multisensory function in future research (and perhaps ultimately in clinical practice), other variables were much less stable. Thus, obtaining representative estimates of constructs of interest related to multisensory processing may require averaging scores across repeated observations or, in some cases, may not be feasible. Collectively, our findings highlight the importance of considering psychometrics in planning future studies focused on children with autism and other neurodevelopmental conditions.
This work was supported by NIH U54 HD083211, NIH/NCATS KL2TR000446, NIH/NIDCD R21 DC016144, NIH/NIDCD F31 DC015956, and NIH T32 MH064913. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the funding agencies. Results from this manuscript were previously presented at the 2018 International Multisensory Research Forum and the 2019 Gatlinburg Conference on Research and Theory in Intellectual and Developmental Disabilities.