Computational Cognition and Learning Lab

Home

Projects

Publications

People

Equipment

Courses

Opportunities

Ongoing Research Projects

 

Embodied Language Acquisition

Language is about symbols and those symbols must be grounded in the physical environment during human development. Most recently, there has been an increased awareness of the essential role of inferences of speakers' referential intentions in grounding those symbols. Experiments have shown that these inferences as revealed in eye, head and hand movements serve as an important driving force in language learning at a relatively early age. The challenge ahead is to develop formal models of language acquisition that can shed light on the leverage provided by embodiment. We present an implemented computational model of embodied language acquisition that learns words from natural interactions with users. The system can be trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with user-centric multisensory information from non-speech modalities, such as user's perspective video, gaze positions, head directions and hand movements. A multimodal learning algorithm is developed that firstly spots words from continuous speech and then associates action verbs and object names with their grounded meanings. The central idea is to make use of non-speech contextual information to facilitate word spotting, and utilize user's attention as deictic reference to discover temporal correlations of data from different modalities to build lexical items. We report the results of a series of experiments that demonstrate the effectiveness of our approach.

Developmental and Statistical Models of early word learning

Previous work on early language acquisition has shown that word meanings can be acquired by an associative procedure that maps perceptual experience onto linguistic labels based on cross-situational observation. A new trend termed social-pragmatic theory focuses on the effect of the child's social-cognitive capacities, such as joint attention and intention reading. We argue that statistical and social cues can be seamlessly integrated to facilitate early word learning. To support this idea, we first introduce a statistical learning mechanism that provides a formal account of cross-situational observation. The main part of this work focuses on a unified model that is able to make use of different kinds of social cues, such as joint attention and prosody in maternal speech, in the statistical learning framework. In a computational analysis of infant data, the quantitative results of our unified model outperforms the purely statistical learning method in computing word-meaning associations.

Action Recognition

Humans perceive an action stream as a sequence of clearly segmented ``action units''. This gives rise to the idea that action recognition is to interpret the continuous human behaviors as a sequence of action primitives such as `` picking up a coffee pot''. The novel approach of our work is to segment the continuous actions in natural tasks by detecting agent-centered switches of attention. Based on the fact that eye and head movements are closely linked to attention, we develop a method to detect attention by integrating eye gaze and head position information. Then, attention switches are calculated and used to segment the action sequence into action units which are recognized by Hidden Markov Models. An experimental system is built for recognizing actions in the natural task of ``stapling a letter'', which demonstrates the effectiveness of the approach. We also observe that when asked to describe others' activities, an observer usually produces verbal descriptions that correspond to subtasks but not action units. Thus, the observer conceptualizes the sensory input into the abstract level corresponding to tasks or subtasks, then verbalizes the perceptual results to yield utterances. In light of this, this work concentrates on recognizing tasks instead of action primitives. With the ability to track the course of gaze and head movements, our approach uses gaze and head cues to detect agent-centered attention switches that can then be utilized to segment an action sequence into action units. Based on recognizing those action primitives, parallel hidden Markov models are applied to model and integrate the probabilistic sequences of the action units of different body parts.

Multimodal Perceptual Interface

The next generation of computers is expected to interact and communicate with users in a cooperative and natural manner while users engage in everyday activities. By being situated in users' environments, intelligent computers should not only have basic perceptual abilities but also use the knowledge of associations between different perceptual inputs. Toward this goal, we develop a multimodal perceptual interface in which a virtual agent is able to interact with users in real time, verbally describe what users are doing (action recognition) and what they are looking at (visual object recognition), and perform actions (action generation) according to spoken commands (speech understanding).