
The growing demand for advanced AI necessitates the development of an intelligent agent capable of perceiving, reasoning, acting, and communicating within an embodied environment. We introduce SEAGULL, an interactive embodied agent designed for the inaugural Alexa Prize SimBot Challenge, which can complete complex tasks in the Arena simulation environment through dialog with users. SEAGULL is engineered to be efficient, user-centric, and continuously improving. To achieve these goals, we develop a modular system that combines neural and symbolic components. Our natural language understanding module employs a hierarchical pipeline to convert user utterances into logical symbolic representations of their intentions and semantics. Meanwhile, a neural vision module detects object classes, states, and spatial relations. These multi-sensory inputs are then processed by a state tracker to update the agent’s beliefs regarding the world state, user intentions, and task progress. A central policy interprets the neurosymbolic beliefs and selects one of several available skills, including navigation, planning, and dialog. We place particular emphasis on optimizing dialog flow and user experience, ensuring that users have a responsive, natural, informative, and engaging interaction with our bot. Furthermore, we have developed tools and pipelines to augment our vision and language data, continually enhancing our system’s robustness and performance.