![]() |
|
||||||||
Ongoing Research Projects |
|||||||||
|
Embodied Language AcquisitionLanguage is about symbols and those symbols must be grounded in the physical environment during human development. Most recently, there has been an increased awareness of the essential role of inferences of speakers' referential intentions in grounding those symbols. Experiments have shown that these inferences as revealed in eye, head and hand movements serve as an important driving force in language learning at a relatively early age. The challenge ahead is to develop formal models of language acquisition that can shed light on the leverage provided by embodiment. We present an implemented computational model of embodied language acquisition that learns words from natural interactions with users. The system can be trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with user-centric multisensory information from non-speech modalities, such as user's perspective video, gaze positions, head directions and hand movements. A multimodal learning algorithm is developed that firstly spots words from continuous speech and then associates action verbs and object names with their grounded meanings. The central idea is to make use of non-speech contextual information to facilitate word spotting, and utilize user's attention as deictic reference to discover temporal correlations of data from different modalities to build lexical items. We report the results of a series of experiments that demonstrate the effectiveness of our approach. |
|
Developmental and Statistical Models of early word learningPrevious work on early language acquisition has shown that word meanings can be acquired by an associative procedure that maps perceptual experience onto linguistic labels based on cross-situational observation. A new trend termed social-pragmatic theory focuses on the effect of the child's social-cognitive capacities, such as joint attention and intention reading. We argue that statistical and social cues can be seamlessly integrated to facilitate early word learning. To support this idea, we first introduce a statistical learning mechanism that provides a formal account of cross-situational observation. The main part of this work focuses on a unified model that is able to make use of different kinds of social cues, such as joint attention and prosody in maternal speech, in the statistical learning framework. In a computational analysis of infant data, the quantitative results of our unified model outperforms the purely statistical learning method in computing word-meaning associations. |
|
Action RecognitionHumans perceive an action stream as a sequence of clearly segmented
``action units''. This gives rise to the idea that action recognition
is to interpret the continuous human behaviors as a sequence of
action primitives such as `` picking up a coffee pot''. The novel
approach of our work is to segment the continuous actions in natural
tasks by detecting agent-centered switches of attention. Based on
the fact that eye and head movements are closely linked to attention,
we develop a method to detect attention by integrating eye gaze
and head position information. Then, attention switches are calculated
and used to segment the action sequence into action units which
are recognized by Hidden Markov Models. An experimental system is
built for recognizing actions in the natural task of ``stapling
a letter'', which demonstrates the effectiveness of the approach.
We also observe that when asked to describe others' activities,
an observer usually produces verbal descriptions that correspond
to subtasks but not action units. Thus, the observer conceptualizes
the sensory input into the abstract level corresponding to tasks
or subtasks, then verbalizes the perceptual results to yield utterances.
In light of this, this work concentrates on recognizing tasks instead
of action primitives. With the ability to track the course of gaze
and head movements, our approach uses gaze and head cues to detect
agent-centered attention switches that can then be utilized to segment
an action sequence into action units. Based on recognizing those
action primitives, parallel hidden Markov models are applied to
model and integrate the probabilistic sequences of the action units
of different body parts. |
|
Multimodal Perceptual InterfaceThe next generation of computers is expected to interact and communicate
with users in a cooperative and natural manner while users engage
in everyday activities. By being situated in users' environments,
intelligent computers should not only have basic perceptual abilities
but also use the knowledge of associations between different perceptual
inputs. Toward this goal, we develop a multimodal perceptual interface
in which a virtual agent is able to interact with users in real
time, verbally describe what users are doing (action recognition)
and what they are looking at (visual object recognition), and perform
actions (action generation) according to spoken commands (speech
understanding). |
Computational Cognition and Learning Lab
The Department of Psychological and Brain Sciences
1101 East Tenth Street
Bloomington, IN 47405
PHONE: 812 -856-1920
Comments: Computational Cognition
and Learning Lab Director
Copyright 2004, The
Trustees of Indiana University
Copyright
Complaints
Web Site and Graphic Design by Heather Winne