1

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in …

SEAGULL: An Embodied Agent for Instruction Following through Situated Dialog

The growing demand for advanced AI necessitates the development of an intelligent agent capable of perceiving, reasoning, acting, and communicating within an embodied environment. We introduce SEAGULL, an interactive embodied agent designed for the …

DANLI: Deliberative Agent for Following Natural Language Instructions

Recent years have seen an increasing amount of work on embodied AI agents that can perform tasks by following human language instructions. However, most of these agents are reactive, meaning that they simply learn and imitate behaviors encountered in …

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Human communication is multimodal in nature; it is through multiple modalities such as language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. …