Elsevier

Cortex

Volume 104, July 2018, Pages 12-25
Cortex

Research Report
Does dynamic information about the speaker's face contribute to semantic speech processing? ERP evidence

https://doi.org/10.1016/j.cortex.2018.03.031Get rights and content

Abstract

Face-to-face interactions characterize communication in social contexts. These situations are typically multimodal, requiring the integration of linguistic auditory input with facial information from the speaker. In particular, eye gaze and visual speech provide the listener with social and linguistic information, respectively. Despite the importance of this context for an ecological study of language, research on audiovisual integration has mainly focused on the phonological level, leaving aside effects on semantic comprehension. Here we used event-related potentials (ERPs) to investigate the influence of facial dynamic information on semantic processing of connected speech. Participants were presented with either a video or a still picture of the speaker, concomitant to auditory sentences. Along three experiments, we manipulated the presence or absence of the speaker's dynamic facial features (mouth and eyes) and compared the amplitudes of the semantic N400 elicited by unexpected words. Contrary to our predictions, the N400 was not modulated by dynamic facial information; therefore, semantic processing seems to be unaffected by the speaker's gaze and visual speech. Even though, during the processing of expected words, dynamic faces elicited a long-lasting late posterior positivity compared to the static condition. This effect was significantly reduced when the mouth of the speaker was covered. Our findings may indicate an increase of attentional processing to richer communicative contexts. The present findings also demonstrate that in natural communicative face-to-face encounters, perceiving the face of a speaker in motion provides supplementary information that is taken into account by the listener, especially when auditory comprehension is non-demanding.

Introduction

In human verbal communication, there is a natural prevalence for face-to-face interactions, involving the multimodal interplay of visual and auditory signals sent from the speaker to the listener. Though auditory information alone is sufficient for effective communication (Giraud & Poeppel, 2012), seeing the interlocutor's facial motions apparently provides further advantages (e.g., Crosse et al., 2015, Fort et al., 2013, Peelle and Sommers, 2015, Rohr and Abdel Rahman, 2015, van Wassenhove, 2013). Some authors refer to this effect as visual enhancement (Peelle & Sommers, 2015), underscoring that human communication involves multisensory adaptation. Audiovisual integration in language processing is becoming, therefore, an area of growing interest.

Most of the literature on audiovisual integration in language processing has focused on the phonological level. Visual speech seems to increase the ability of a listener to correctly perceive utterances (Cotton, 1935, Sumby and Pollack, 1954), increase the speed at which phonemes are perceived (Soto-Faraco, Navarra, & Alsius, 2004), and may even alter the perception of phonemes (McGurk & MacDonald, 1976). This multisensory gain depends on various factors, including spatial congruency, temporal coincidence, behavioral relevance, and experience (for review, see van Atteveldt, Murray, Thut, & Schroeder, 2014).

Audiovisual integration has also been studied with electrophysiological measures like event-related brain potentials (ERPs). This technique is characterized by fine-grained temporal resolution and allows investigating the neural mechanisms underlying multisensory integration at different levels. At the phonological level, the facilitation provided by audiovisual integration is reflected in shorter latencies (Alsius et al., 2014, Baart et al., 2014, Knowland et al., 2014, Stekelenburg and Vroomen, 2007, van Wassenhove et al., 2005) and smaller amplitudes (Hisanaga et al., 2016, Stekelenburg and Vroomen, 2007, Stekelenburg and Vroomen, 2012, van Wassenhove et al., 2005) of the auditory N1 and P2 components of the ERP. Moreover, studies with functional magnetic resonance imaging and magnetic field potentials have reported that visual input about the speaker's lip positions or movements can modulate the activity of the primary auditory cortex (Calvert et al., 1997, Lakatos et al., 2008).

Available evidence suggests that audiovisual integration also facilitates lexical access at the semantic level as shown with cross-modality priming. In this paradigm, a silent video of a speaker uttering a (prime) word is followed by the auditory-only version of the critical word. Such priming by visual speech can improve semantic categorizations (Dodd, Oerlemens, & Robinson, 1989), lexical decisions (Fort et al., 2013, Kim et al., 2004), and word recognition (Buchwald, Winters, & Pisoni, 2009) of critical words. These findings support an influence of visual speech on lexical or post-lexical processes and indicate that visual and auditory speech modalities share cognitive resources (Buchwald et al., 2009).

In typical audiovisual communication, facial gestures precede the auditory input by about 150 ms (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009). A set of lexical candidates might therefore be available before the utterance can be heard (Fort et al., 2013) and, consequently, semantic processing might be easier if the visually pre-activated word matches the auditory input. Further, audiovisual presentation of complex texts improves performance compared to auditory-only conditions, as shown with comprehension questionnaires (Arnold and Hill, 2001, Reisberg et al., 1987). In sum, the perception of the speaker's facial dynamics cannot only improve phonetic perception but also semantic comprehension.

Evidence regarding the neural correlates underlying audiovisual integration at the semantic level during sentence processing from on-line measures of brain activity, such as ERPs, is surprisingly scarce. To our knowledge, the only pertinent ERP study has been reported by Brunellière, Sánchez-García, Ikumi, and Soto-Faraco (2013). In a first experiment, these authors manipulated the semantic constraints (expectancy) of critical words within audiovisual sentences, as well as the articulatory saliency of lip movements. These variables interacted during the late part of the N400, an ERP component reflecting the access to semantic knowledge during language comprehension (Kutas & Federmeier, 2011). As compared to low visual articulatory saliency, high saliency increased the N400 amplitude for unexpected words and yielded a wide effect across the scalp. In a second experiment, Brunellière et al. (2013) compared the effect of visual articulatory saliency with respect to an audio-alone condition without manipulating semantic constraints. Words with high articulatory saliency yielded a significant N400 effect that was enhanced under audiovisual presentation relative to the audio-alone condition, which the authors interpreted in terms of late phonological effects.

The present study aimed to add further evidence to the scarce literature on audiovisual integration at the semantic level by comparing the neural processing of expected and unexpected words in spoken sentences. Sentences were presented either in a dynamic audiovisual mode, showing videos of the speaker, as compared to a still face mode, showing pictures of the speaker. This paradigm allowed exploring how the dynamics of speaker's facial movements impact the semantic processing of words during sentence comprehension. Our study therefore focused on semantic processing of connected speech while the speaker's face is seen in dynamic versus static mode.

The majority of studies on audiovisual processing of language did not consider that visual perception of face-to-face contexts is not restricted to oro-facial (i.e., mouth) speech movements. However, the perception of the eyes and their gaze direction strongly captures and directs attention (Conty et al., 2006, von Grünau and Anston, 1995, Senju and Hasegawa, 2005) and modulates the activity of neurons in auditory cortex (van Atteveld et al., 2014). Evolutionary evidence supports the importance of eyes in human communication, like the white sclera adaptation specific to humans (Kobayashi and Kohshima, 2001, Tomasello et al., 2007). A study in macaques demonstrated enhanced activity of ventrolateral prefrontal neurons in response to combining vocalizations with pictures of direct-gaze faces (Romanski, 2012), demonstrating the role of this brain area in the integration of social-communicative information.

Gaze perception seems to influence social interactions and communication among humans. For instance, conversations typically begin with eye contact between individuals (Schilbach, 2015) and communicative intent is usually signalled by direct gaze (Farroni et al., 2002, Gallagher, 2014). Hence, eyes are extremely informative both about the mental state of the interaction partner and for ascertaining what a speaker demands from a listener (Myllyneva & Hietanen, 2015). Eye contact between persons can modulate concurrent cognitive and behavioural activities, a phenomenon known as the “eye contact effect” (Senju & Johnson, 2009), which are mediated by the social brain network, including areas such as fusiform gyrus, superior temporal sulcus, medial prefrontal and orbitofrontal cortex, and amygdala. Hence, eyes are evidently “special” visual stimuli for humans.

As reviewed above, facial information can be relevant for language comprehension, and is not restricted to oro-facial movements, but also includes other cues such as eye gaze. The present ERP study explored the effects of facial dynamic information on semantic processing by manipulating the presence of two main sources of information in the face, mouth (visual speech) and eyes (gaze). To this aim, we compared the amplitude of the semantic N400 component elicited by unexpected words within audiovisual connected speech. Expectancy of critical words within a given sentence was manipulated by a preceding context sentence. This allowed comparing exactly the same stimulus material across conditions with a maximum degree of experimental control, by merely exchanging the preceding context sentence. In parallel to the auditory material, participants were presented with two kinds of visual information, consisting in either a video of the speaker's face and upper torso (dynamic conditions) or stills of the speaker taken from these videos (static control conditions). In three experiments, we investigated the effects of visual speech and gaze on the semantic processing of connected speech on ERPs, focusing on the N400 component: In Experiment 1, participants were presented with the whole speaker's face in dynamic and static versions. In Experiments 2 and 3, the specific contributions of facial information to the observed effects were studied by concealing either the eyes or the mouth, respectively.

In Experiment 1, we expected modulations of the N400 component by dynamic information provided by the whole face. According to the literature reviewed above, visibility of lip movements should affect lexical access and semantic processing of unexpected words, leading to increased amplitudes of the N400 component (Brunellière et al., 2013). As direct eye contact has been shown to attract attention and to activate a broad network of social brain areas, resources for semantic processes might be diminished, presumably reflected in reduced N400 amplitudes. Therefore, we expected an increase of the N400 amplitude when the speaker's eyes are occluded and only visual speech is available (Experiment 2) and the N400 might be reduced when the eyes are available and the mouth is covered (Experiment 3).

Section snippets

Experiment 1. Whole face presentation

This experiment investigated audiovisual processing of words in spoken sentences in an ecological context, that is, perceiving the whole face of the speaker while listening to connected speech. In the video condition, participants could therefore focus on both eye gaze and visual speech available from the whole face of the speaker while concurrently listening to auditory speech.

Experiment 2 – Eyes covered

The aim of this experiment was to explore the influence of dynamic mouth movements (visual speech) on semantic processing of connected speech comprehension. The procedure of Experiment 2 was the same as in Experiment 1, except that the eyes of the speaker were covered (Fig. 1).

Experiment 3. Mouth covered

By occluding the mouth region (Fig. 1), Experiment 3 investigated whether information other than lip movements–leaving the eyes as most important facial features visible–contributes to the effect of dynamic versus static presentation mode, observed in Experiment 1.

Data analysis

In order to directly compare the potential impact of different face areas visible during speech processing, we conducted analyses of the effects of type of presentation and expectancy of words across all three facial feature presentation conditions: whole face (Exp.1), eyes covered (Exp. 2), and mouth covered (Exp. 3). To this aim, a mixed ANOVA was first performed including the factors Facial Feature (Experiment) as group factor and Expectancy and Presentation Mode as within-subject factors.

General Discussion

In the present study, we investigated whether dynamic features of a speaker's face can facilitate the semantic processing of connected speech, as compared to a static condition. We recorded ERPs to expected and unexpected words, embedded in spoken sentences, in two different conditions. In one condition, the spoken utterances appeared together with a video of the speaker; in the other condition, a static picture of the face was shown. Importantly, the same critical words embedded in the same

Conclusions

The main finding of the present study is the higher attentional processing to contexts that resemble most strongly natural communicative situations, as long as semantic speech processing is not very demanding. Contrary to our predictions, we could not find any modulation of the N400 semantic effect by the concomitant dynamic facial information. The speaker's dynamic facial features did not affect semantic processing. When semantic comprehension is more demanding (i.e., when an unpredicted or

Conflict of interest statement

All the authors declare no conflict of interest.

Acknowledgments

The authors thank Johannes Rost for stimulus preparation, Hossein Sabri for data collection, Anna Eiserbeck for her help in data analysis, and Guido Kiecker for technical support. This research was supported by the structured graduate program “Self Regulation Dynamics Across Adulthood and Old Age: Potentials and Limits”. It was funded by the DFG excellence initiative “Talking heads” from the DAAD [grant code 57049661], and the Spanish Ministerio de Economía y Competitividad [grant code

References (52)

  • L. Schilbach

    Eye to eye, face to face and brain to brain: Novel approaches to study the behavioral dynamics and neural mechanisms of social interactions

    Current Opinion in Behavioural Science

    (2015)
  • S. Schindler et al.

    People matter: Perceived sender identity modulates cerebral processing of socio-emotional language feedback

    Neuroimage

    (2016)
  • C.E. Schroeder et al.

    Low-frequency neuronal oscillations as instruments of sensory selection

    Trends in Neuroscience

    (2009)
  • A. Senju et al.

    The eye contact effect: Mechanisms and development

    Trends in Cognitive Science

    (2009)
  • S. Soto-Faraco et al.

    Assessing automaticity in audiovisual speech integration: Evidence from the speeded classification task

    Cognition

    (2004)
  • M. Tomasello et al.

    Reliance on head versus eyes in the gaze following of great apes and human infants: The cooperative eye hypothesis

    Journal of Human Evolution

    (2007)
  • A. Alsius et al.

    Effect of attentional load on audiovisual speech perception: Evidence from ERPs

    Frontiers in Psychology

    (2014)
  • P. Arnold et al.

    Bisensory augmentation: A speech reading advantage when speech is clearly audible and intact

    British Journal of Psychology

    (2001)
  • H. Brouwer et al.

    On the proper treatment of the N400 and P600 in language comprehension

    Frontiers in Psychology

    (2017)
  • A.B. Buchwald et al.

    Visual speech primes open-set recognition of spoken words

    Language Cognitive Process

    (2009)
  • G.A. Calvert et al.

    Activation of auditory cortex during silent lipreading

    Science

    (1997)
  • C. Chandrasekaran et al.

    The natural statistics of audiovisual speech

    PLoS Computational Biology

    (2009)
  • H. Colonius et al.

    Multisensory interaction in saccadic reaction time: A time-window-of-integration model

    Journal of Cognitive Neuroscience

    (2004)
  • L. Conty et al.

    Searching for asymmetries in the detection of gaze contact versus averted gaze under different head views: A behavioural study

    Spatial Vision

    (2006)
  • J.C. Cotton

    Normal “visual hearing”

    Science

    (1935)
  • M.J. Crosse et al.

    Congruent visual speech enhances cortical entrainment to continuous auditory speech in noise-free conditions

    The Journal of Neuroscience

    (2015)
  • View full text