Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

Jan 31, 2026 ยท 1 min read

Hiring-manager view

This project, formerly listed as SIFToM, connects language, perception, and decision-making in embodied AI. It is a useful signal for applied scientist roles that require modeling user intent, reasoning under uncertainty, and evaluating behavior in simulated and real-world settings.

Scientific problem

Robots need to infer intended goals from spoken instructions even when the acoustic signal is noisy or ambiguous. The challenge is to model both what the speaker said and what a human listener likely perceived.

Method

  • Contributed to a theory-of-mind formulation for pragmatic spoken instruction following.
  • Used vision-language modeling and model-based mental inference to infer robot goals from speech-related evidence and task context.
  • Evaluated the approach on simulated and real-world embodied task data.

Evaluation signal

The central evaluation question is whether the model improves goal inference and instruction-following robustness when speech, perception, and intent are uncertain.