InSitu

Home

Sponsors

Sponsors

Description:

While service robotics has been making steady progress in recent years in various fields such as computer vision, manipulation, navigation and natural language understanding, we are still far away from truly natural human robot interaction (HRI), which will ultimately be required for robots acting together with humans in the same environment. We believe that this is in part due to two important modalities often being studied in isolation, namely visual scene understanding (VSU) and natural language understanding (NLU).

Imagine a "room cleaning" scenario, where a human instructs a robot to put objects in their proper places in a living room. For a human instructed in a similar fashion it would be rather unnatural to listen to the task description with closed eyes and then look at the scene to try to match linguistic expressions to referents in the scene. Yet this is how robots with distinct subsystems often operate.

Natural interaction however does not work like that. When instructing a robot, humans will naturally look towards an intended object or point to it, gazing back at the robot to check whether it is attending to the object. If the robot is able to follow the human eye gaze to the target object, both human and robot will establish joint attention which will allow the human instructor to check quickly (and often subconsciously) that the robot understood the request correctly. In addition to looking at the object, humans will typically also expect a robot to verbally acknowledge understanding by saying "OK" or "got it", or ask for clarification effectively such as "the one by the table?". Feedback is often already required for partial utterances, again through eye gaze, verbal acknowledgments, or through the immediate initiation of an action such as the robot reaching for a book after it heard "put the red book..." while the utterance is still going on.

Note that in this kind of interactive setting, vision and natural language processing can mutually and incrementally constrain each other. For example, visually observing a scene that is being talked about can support understanding of ambiguous or underspecified utterances while they are being processed - "the red book on the floor" will most likely refer to a book visible to the instructor and not the one behind her back. Similarly, a syntactically ambiguous sentence like "put the book on the table on the shelf" will become clear as soon as the robot detects a book on the table, thus using visually observed spatial relations to constrain parsing and semantic analysis.

Conversely, incremental processing of a verbal description of a scene can direct visual processing to the relevant elements, e.g., "Put the red [here visual attention shifts to red image regions] shoe on [here attention shifts to horizontal supporting surfaces] the box". In addition, non-linguistic cues such as pointing and gaze direction can be incrementally integrated with partial meanings to steer attention to those elements of the scene relevant to the current discourse situation.

Against this background visual scene and language understanding are tightly coupled, forming the vision-language loop. In this loop, language primes vision by modulating attention and visually reconstructed scene elements are fed back as referents for language understanding. These processes are interleaved at a fine temporal granularity to make best use of partial interpretations in both directions.

In this project, we will tackle the critical problem of integrating visual scene understanding with natural language understanding in an unprecedented way. We believe that for robots to reach human-like performance in the simple kinds of natural interactions described above in natural human-like environments, the vision, natural language, and also action subsystems of the robotic architecture need to be very tightly integrated to be able to mutually constrain each other. This, in turn, requires concurrent processing of vision, language, and actions where all algorithms must be interruptible and able to incorporate new information incrementally on the fly. It also requires a software framework that allows seamless integration of components and algorithms at a fine temporal granularity, down to tens of milliseconds.