AutoLink
Search documents
小米7篇论文入选顶会AAAI,前沿领域全覆盖!
自动驾驶之心· 2025-12-22 03:23
Core Viewpoint - Xiaomi has made significant strides in AI research, with seven papers accepted at AAAI 2026, showcasing its comprehensive capabilities across various AI domains, including sound editing, speech Q&A, embodied intelligence, and autonomous driving [5][6][41]. Group 1: Research Achievements - Xiaomi's seven accepted papers cover a wide range of AI research areas, demonstrating its commitment to foundational technology and long-term investment in AI [6][41]. - The research topics include sound effect editing, speech question answering, 3D embodied agents, visual language navigation, retrieval models, inference decoding strategies, and autonomous driving [6][41]. Group 2: AutoLink Framework - AutoLink addresses the challenges of large-scale text-to-SQL by allowing models to explore database schemas iteratively rather than loading all data at once, achieving a strict recall of 97.4% on Bird-Dev and 91.2% on Spider-2.0-Lite [9][10]. - This framework enables LLMs to act like intelligent agents, dynamically identifying relevant schema parts for SQL generation, thus enhancing efficiency and scalability [10]. Group 3: SpecFormer Model - SpecFormer redefines the role of draft models in speculative decoding by integrating unidirectional and bidirectional attention, allowing for faster decoding without the need for complex draft trees [12][13][15]. - This model can understand context while generating predictions in parallel, leading to lower training costs and better hardware compatibility for large-scale deployments [15]. Group 4: CLSR for Long-form Speech - CLSR (Contrastive Language-Speech Retriever) improves long-form speech question answering by extracting relevant segments from lengthy audio recordings, enhancing accuracy and efficiency [17][20]. - This approach reduces irrelevant information and allows large models to focus on key content, significantly improving performance in speech Q&A tasks [20]. Group 5: AV-Edit for Sound Editing - AV-Edit revolutionizes sound effect editing by integrating visual, audio, and textual semantics, allowing for precise and contextually relevant sound modifications [21][24]. - The model utilizes a three-modal generative framework to achieve high-quality sound editing that aligns with video content, outperforming traditional methods [24]. Group 6: ORS3D for Task Scheduling - ORS3D introduces a new task definition for embodied agents, focusing on parallel task execution and efficient scheduling in 3D environments [26][29]. - The GRANT model incorporates scheduling tokens to optimize task execution, demonstrating competitive performance in language understanding and spatial reasoning [28][29]. Group 7: SpNav for Spatial Navigation - SpNav addresses the gap in embodied intelligence navigation by combining high-level human instructions with spatial understanding, enabling robots to navigate complex environments effectively [33][35]. - The framework utilizes a dataset of 10,000 trajectories to train agents in understanding spatial descriptions and executing precise navigation plans [35]. Group 8: VILTA for Autonomous Driving - VILTA (VLA-in-the-Loop Trajectory Adversary) enhances autonomous driving strategies by generating adversarial trajectories for rare and complex scenarios, improving system robustness [37][40]. - This method integrates visual language models to refine trajectory generation, ensuring that the resulting paths are both diverse and physically feasible [40].