视觉语言动作模型(VLAs)
Search documents
“玲龙一号”全球首堆冷试成功;复旦团队造出可告别铅污染的太阳能电池,光电转换效率达17.7%丨智能制造日报
创业邦· 2025-10-17 03:24
Group 1 - The world's first sub-angstrom snapshot spectral imaging chip "Yuheng" has been developed by a team led by Professor Fang Lu from Tsinghua University, marking a significant advancement in intelligent photonics technology for high-precision imaging and measurement [2] - The RoboChallenge, the first large-scale, multi-task benchmark test for real robots operating in real physical environments, has been launched, providing reliable and comparable evaluation standards for visual language action models in robotics [2] Group 2 - A new type of artificial muscle developed by a research team from Ulsan National Institute of Science and Technology can switch between "soft and flexible" and "hard and powerful" states, capable of lifting objects weighing 4,000 times its own weight, which may advance soft robotics, wearable devices, and medical assistive technologies [4][5] - Fudan University has developed a tin-based perovskite solar cell that achieves a world record light-to-electricity conversion efficiency of 17.7% while being environmentally friendly throughout its lifecycle [5] - The "Linglong No. 1" modular small reactor has successfully completed its cold test, laying a solid foundation for subsequent fuel loading and commercial operation, with an expected annual power generation of 1 billion kWh, sufficient to meet the electricity needs of 526,000 households in Hainan and reduce carbon emissions by approximately 880,000 tons [5]
RoboChallenge基准测试开启真机实测时代!“全市场唯一两百亿规模”机器人ETF(562500) 早盘弱势震荡,行业龙头回调
Mei Ri Jing Ji Xin Wen· 2025-10-16 06:32
Group 1 - The Robot ETF (562500) is experiencing a slight decline of 0.97%, remaining close to its opening price, with a trading volume of approximately 314 million shares and a turnover of 321 million yuan, indicating active trading but cautious market sentiment [1] - Among the 73 constituent stocks, 13 have risen while 59 have fallen, reflecting a bearish trend with industry leaders under pressure, suggesting ongoing expectations for short-term adjustments [1] - The RoboChallenge, a large-scale benchmark test for robots in real physical environments, has been launched, addressing challenges in performance validation and providing reliable evaluation standards for visual language action models (VLAs) in robotics [1] Group 2 - Nomura Orient Securities expresses optimism regarding investment opportunities in the Tesla humanoid robot supply chain, with key catalysts including the upcoming third-generation humanoid robot and expected mass production in 2026 [2] - The Robot ETF (562500) is the only robot-themed ETF in the market with a scale exceeding 20 billion, covering various segments such as humanoid robots, industrial robots, and service robots, facilitating investor access to the entire robotics supply chain [2]
具身走向现实世界!RoboChallenge:从仿真到实体,全球首个大规模多任务真机任务基准
具身智能之心· 2025-10-15 11:03
Core Insights - The article discusses the launch of RoboChallenge, a large-scale, multi-task benchmark testing platform for embodied intelligence, initiated by Dexmal and Hugging Face, aimed at addressing the lack of real machine testing in the field [5][41]. Group 1: Challenges in the Embodied Intelligence Field - The embodied intelligence sector has seen rapid advancements, but the absence of real machine testing and limitations of existing evaluation systems have become significant bottlenecks [3][4]. - Current mainstream benchmarks primarily rely on simulation environments, leading to issues where algorithms that perform well in simulations fail in real-world applications [4][10]. Group 2: Introduction of RoboChallenge - RoboChallenge is the first large-scale benchmark testing platform that allows real robots to perform tasks in a physical environment, providing a more reliable and comparable evaluation standard for visual language action models (VLAs) [5][10]. - The platform aims to overcome challenges related to performance validation in real environments, standardized testing conditions, and accessibility [5][10]. Group 3: Features of RoboChallenge - RoboChallenge features a "remote robot" paradigm, allowing users to interact with real machines without needing hardware, thus lowering the entry barrier for researchers and developers [15][19]. - The platform supports a wide range of tasks, with an initial benchmark set (Table30) comprising 30 diverse tasks designed to evaluate core capabilities of VLA models [12][26]. Group 4: Evaluation Mechanism - The evaluation mechanism combines end-to-end task success rates with process scoring, ensuring a rigorous and transparent assessment of models [16][20]. - RoboChallenge employs a "visual input matching" method to ensure consistency in testing conditions, reducing variability caused by human testers [23][25]. Group 5: Open and Collaborative Ecosystem - RoboChallenge promotes an open ecosystem by providing free access to evaluation services, publicly sharing task demonstration data, and ensuring transparency in results [34][41]. - The platform encourages collaboration among researchers, developers, and industry professionals, fostering innovation in the field of embodied intelligence [38][41]. Group 6: Future Directions - RoboChallenge plans to expand its capabilities by introducing more robot types and challenging tasks, aiming to enhance the evaluation of embodied intelligence in real-world scenarios [42].
普林斯顿大学最新!VLM2VLA:将 VLM 微调为 VLA,并避免灾难性遗忘
具身智能之心· 2025-10-07 10:00
Core Insights - The article discusses the catastrophic forgetting problem in the context of fine-tuning Visual Language Models (VLMs) into Visual Language Action Models (VLAs) for robotic control, highlighting the mismatch between pre-training and fine-tuning data distributions [2][4]. Group 1: Catastrophic Forgetting - Catastrophic forgetting occurs when the model loses its original reasoning and multimodal understanding capabilities during the action generation training process [2]. - The root cause of this issue is the distribution mismatch between the internet-scale pre-training data (primarily image-text pairs) and the low-dimensional action vector data used for robotic fine-tuning [2]. Group 2: VLM2VLA Approach - VLM2VLA aims to address the distribution mismatch by converting low-dimensional actions into natural language descriptions, aligning the fine-tuning data with the pre-training data [3][4]. - The method employs low-rank adaptation (LoRA) for fine-tuning, minimizing modifications to the VLM backbone and avoiding catastrophic forgetting [4]. Group 3: Hierarchical Action Representation - The VLM2VLA framework decomposes action prediction into a three-level reasoning process, utilizing natural language descriptions at all levels [6]. - High-level subtask prediction generates intermediate tasks based on initial observations and overall task instructions [6]. - Mid-level motion planning produces spatially oriented movement descriptions, while low-level action generation creates executable action sequences with language annotations [6]. Group 4: Data Reconstruction Pipeline - VLM2VLA utilizes Gemini 2.5 to automatically reconstruct raw robotic trajectory datasets into language-annotated datasets, ensuring compatibility with VLM pre-training formats [9]. - The reconstruction process involves providing context, decomposing trajectories into subtasks, and standardizing the format to align with VLM data [9]. Group 5: Efficient Fine-Tuning Strategy - The fine-tuning of the Gemma-3-12B-IT model is conducted using LoRA on linear layers without altering the VLM architecture or requiring joint training with internet-scale data [12][13]. - Key training parameters include a LoRA rank of 16, learning rate of 5e-5, and an effective batch size of 8 [12][13]. Group 6: Experimental Validation - Experiments focus on three core questions comparing VLM2VLA with baseline models, assessing the retention of multimodal understanding, competitive robotic manipulation performance, and the ability to generalize knowledge to new scenarios [14][15]. - VLM2VLA demonstrates competitive performance in both in-distribution and out-of-distribution tasks, showcasing its hierarchical reasoning capabilities [17][19]. Group 7: Limitations and Future Directions - The model currently faces challenges such as reasoning delays and the need for larger-scale robotic language-annotated datasets to enhance generalization capabilities [19]. - Future improvements may include optimizing decoding strategies, expanding language annotation for dexterous actions, and integrating validation capabilities within the VLM itself [19][22].