Workflow
合成数据
icon
Search documents
清华邓志东:“世界模型智能体”重塑智驾格局,算力竞赛已开启
Xin Jing Bao· 2025-09-30 07:34
在云端,要对海量的真实与合成数据进行预训练和完成世界模型的构建,可能需要数十万张AI加速卡 和数十个EFLOPS(百亿亿次浮点运算)级别的算力支撑,这构成了极高的资金与技术壁垒。在车端, 为了实现低成本、低延迟与高效能的实时响应,车载智能芯片的算力需求正从目前的最高500-600 TOPS,朝着2500 TOPS以上迈进。这场竞赛不仅考验着企业的资源投入,更考验着其在芯片设计、架 构创新与系统整合上的综合实力。 面对业界最关心的数据挑战,尤其是类似FSD入华可能面临的"水土不服"问题,邓志东表示,一个出租 车司机比新手安全,主要源于他积累了更长的驾驶里程,而非智商更高或书本知识更丰富。要让自动驾 驶的安全性超越人类,若采用世界模型智能体方法,AI所需要的学习里程必须是人类司机的上千倍。 依靠实车路测采集真实数据成本极高、周期漫长,因此利用"数字孪生"技术生成海量的"合成数据"成为 了破局点。这意味着,能提供高质量仿真平台与数据服务的公司,将在未来产业链中更有价值。 邓志东介绍,目前一场激烈的"算力军备竞赛"已经拉开帷幕。这是一场在云端与车端同时进行的双线战 争。 新京报贝壳财经讯(记者林子)"智能驾驶正迎来它 ...
撞墙的不是Scaling Laws,是AGI。
自动驾驶之心· 2025-09-28 23:33
Core Viewpoint - The article posits that scaling laws do not necessarily lead to AGI (Artificial General Intelligence) and may even diverge from it, suggesting that the underlying data structure is a critical factor in the effectiveness of AI models [1]. Group 1: Data and Scaling Laws - The scaling laws are described as an intrinsic property of the underlying data, indicating that the performance of AI models is heavily reliant on the quality and distribution of the training data [14]. - It is argued that the raw internet data mix is unlikely to provide the optimal data distribution for achieving AGI, as not all tokens are equally valuable, yet the same computational resources are allocated per token during training [15]. - The article emphasizes that the internet data, while abundant, is actually sparse in terms of useful contributions, leading to a situation where AI models often only achieve superficial improvements rather than addressing core issues [8]. Group 2: Model Development and Specialization - GPT-4 is noted to have largely exhausted the available internet data, resulting in a form of intelligence that is primarily based on language expression rather than specialized knowledge in specific fields [9]. - The introduction of synthetic data by Anthropic in models like Claude Opus 3 has led to improved capabilities in coding, indicating a shift towards more specialized training data [10]. - The trend continues with GPT-5, which is characterized by a smaller model size but greater specialization, leading to a decline in general conversational abilities that users have come to expect [12]. Group 3: Economic Considerations and Industry Trends - Due to cost pressures, AI companies are likely to move away from general-purpose models and focus on high-value areas such as coding and search, which are projected to have significant market valuations [7][12]. - The article raises concerns about the sustainability of a single language model's path to AGI, suggesting that the reliance on a "you feed me" deep learning paradigm limits the broader impact of AI on a global scale [12].
复旦大学窦德景解读中国AI发展:加强场景应用引导 在数据可信领域强化竞争力
HOME 窦德景 ◎记者 李兴彩 近日,在上证首席讲坛第二十三期节目上,复旦大学计算机学院特聘教授、北电数智首席科学家窦德景 就AI大模型的突破点和未来应用场景进行了深入浅出的分享,并同期接受了上海证券报记者的专访。 作为人工智能领域的资深学者与产业实践者,窦德景深耕AI领域二十余载,既见证了行业有起有落的 发展历程,也亲身参与了从技术研发到产业落地的全链条实践。在生成式AI掀起全球变革浪潮的当 下,他以横跨产学研的独特视角,解读中国AI发展的核心逻辑与未来机遇。 AI要突破必须扎根具体场景 从学术殿堂到产业一线,窦德景的履历勾勒出一条跨领域的产业和个人成长轨迹。 1996年,窦德景从清华大学电子工程系本科毕业后,赴耶鲁大学攻读电气工程硕士学位,随后又师从世 界著名人工智能学者德鲁·麦克德莫特(Drew Mcdermott)攻读人工智能方向的博士学位。此后,窦德 景历任斯坦福大学生物医学信息研究中心客座副教授、美国俄勒冈大学计算机和信息科学系正教授,发 表超过250篇论文,谷歌学术引用量超1.3万次,成为国际AI领域的知名学者。 2010年后,随着深度学习技术的突破,AI迎来第三次高潮,窦德景选择投身产业实践 ...
机器人北京上学记
经济观察报· 2025-09-21 04:57
Core Viewpoint - The article emphasizes the importance of high-quality data in the development of embodied intelligence, highlighting that this data must be collected in real or simulated environments to train robots effectively, similar to teaching a child through demonstration and correction [1][5]. Group 1: Data Collection and Training - In Beijing, various companies and institutions are establishing data collection centers for embodied intelligence, with a focus on creating immersive environments that replicate real-life scenarios for robots to learn tasks like opening refrigerators and serving tea [3][4]. - The training process involves thousands of data collectors who perform repetitive tasks to teach robots to execute actions naturally and accurately, with a significant emphasis on the quality of the data collected [4][22]. - The Beijing Human-Robot Innovation Center has created a 1:1 replica of various environments, such as kitchens and supermarkets, to facilitate realistic training for robots [6][8]. Group 2: Economic Value of Data - High-quality embodied intelligence data is now recognized as having clear economic value, being tradable and eligible for government subsidies, which can aid in financing and expanding applications [5][12]. - The Beijing Economic and Technological Development Zone has introduced measures to incentivize data collection, including financial rewards for high-quality data sets and the issuance of "data vouchers" to support businesses [17][18]. Group 3: Technological Approaches - The industry is currently exploring diverse technological routes for data collection, with some companies focusing on real-world data while others prioritize synthetic data for efficiency and cost-effectiveness [29][30]. - Companies like Galaxy General are adopting a "virtual-real combination" approach, using synthetic data primarily while supplementing it with real data for fine-tuning, which significantly enhances training efficiency [30][31]. Group 4: Workforce and Training Roles - The role of data collectors, now termed embodied intelligence trainers, is crucial in the data collection process, requiring physical capability and coordination to perform tasks that robots will eventually learn [24][25]. - The job market for data collectors is evolving, with companies seeking individuals who can adapt to the physical demands of the role, and there is a growing trend of remote data collection systems being implemented [26][28].
机器人北京上学记
Jing Ji Guan Cha Wang· 2025-09-21 03:37
经济观察报记者 周悦 叠衣服,是千寻智能教机器人做家务的第一课。 在北京海淀的一栋写字楼里,采集员坐在机械臂前,夹起、对齐、折叠、放下——每个动作要重复上百遍,只为让机器人学会"像人一样"进行家务劳动。 在北京的不同区域,类似的训练正同步展开:向西,石景山人形机器人数据训练中心,上百台机器人在"九年一贯制"训练区与"机器人大学"场景区中,学习 开门、拿取物品、插花等动作;向南,北京经济技术开发区(下称"北京亦庄")的北京人形机器人创新中心(国地共建具身智能机器人创新中心,下称"北 京人形"),则将厨房、客厅、超市、加油站等空间1:1复刻,打造沉浸式的采集工厂,整栋楼里分布着数百台左右数据采集本体,包括人形、轮式、机械臂 等。 经济观察报走访发现,北京多家企业与机构已布局数据采集中心,包括智源研究院、银河通用、北京人形机器人创新中心、星海图与千寻智能等,规模从三 四十人到上百人不等。 当前,具身智能正处于"百家争鸣"的技术探索阶段,路线多元,但一个共识日益清晰:高质量数据,是机器人能否走出实验室、真正进入社会的关键。 与大语言模型依赖海量文本语料不同,具身智能模型必须在真实或仿真环境中学习动作、语言、视觉等多模 ...
数据:99%+1%,能实现“从0到10000”——银河通用王鹤:让机器人甩掉遥控器,“睁开眼”干活
Xin Hua She· 2025-09-15 21:46
Core Insights - The article discusses the advancements in humanoid robots, particularly focusing on the capabilities of the Galbot developed by Beijing Galaxy General Robotics Co., which can operate autonomously without remote control [2][3][6] - The key challenge in achieving true autonomy in robots lies in the quality and richness of data, which is essential for enhancing their cognitive abilities and adaptability [3][9][10] Group 1: Company Developments - Beijing Galaxy General Robotics has launched the world's first city-level humanoid robot demonstration zone, featuring an unmanned supermarket operated by robots [2] - The company has successfully implemented humanoid robots in various sectors, including industrial applications like assembly line handling and sorting, as well as retail, with plans to open 100 smart pharmacies nationwide by the end of the year [5][12] - The Galbot has achieved notable success in competitions, showcasing its advanced capabilities compared to other robots that often rely on pre-programmed sequences and remote control [2][5] Group 2: Technological Challenges - The transition from action intelligence to cognitive intelligence in robots is heavily dependent on the availability of high-quality data, which is crucial for improving their robustness and generalization capabilities [3][9] - The current landscape shows a division among robotics companies, with some focusing on showcasing impressive movements while others, like Galaxy General, prioritize practical applications in real-world scenarios [4][12] - The company emphasizes that achieving a commercial breakthrough in robotics will depend on identifying scalable applications that can be replicated across various environments [12] Group 3: Data and Model Development - High-quality synthetic data is deemed essential for training robots, with 99% of their capabilities potentially derived from such data, while only 1% requires real-world data collection [9][10] - The development of a closed-loop feedback model is critical for enabling robots to perform tasks autonomously across different scenarios, which Galaxy General is actively pursuing [6][7] - The company believes that the quality of data is more important than quantity, as diverse and representative data sets lead to more effective learning and adaptability in robots [10][11]
机器人跨越“三重门”——具身智能创新者亲历的现实与趋势丨议事厅
Xin Hua Wang· 2025-09-15 03:44
Group 1 - The humanoid robot industry is experiencing a dichotomy, with significant advancements in practical applications contrasted by challenges in scaling production and securing orders [1][5][36] - Investment in humanoid robotics has surged, with over 20 companies in the sector pursuing IPOs, marking a transformative year for mass production of humanoid robots [1][5] - The development of embodied intelligence is at a crossroads, requiring a balance between technological innovation and practical profitability [1][15] Group 2 - Companies like Beijing Galaxy General Robotics are leading the way in deploying humanoid robots in various sectors, achieving significant milestones in industrial and retail applications [5][8] - The key challenge for humanoid robots lies in their ability to operate autonomously without remote control, which is dependent on advanced data and model training [10][12] - High-quality data is crucial for enhancing the capabilities of humanoid robots, with a focus on diverse and rich datasets to improve their performance in real-world scenarios [12][30] Group 3 - The success of humanoid robots in competitive environments, such as soccer, demonstrates their potential for real-world applications and helps in refining their operational capabilities [36][41] - The industry faces a "chicken or egg" dilemma, where technological advancements must align with market demand to create a sustainable business model [37][42] - The transition from demonstration to practical application is essential for the industry, with a focus on creating a commercial ecosystem that supports ongoing development and deployment [35][42]
银河通用张直政:具身大模型的发展需要上万亿条数据
Di Yi Cai Jing· 2025-09-11 07:33
Group 1 - The development of embodied large models may require trillions of data points, as stated by Zhang Zhizheng, co-founder of Galaxy General Robotics [1] - The challenge lies in the insufficient and unsustainable nature of real data collection, making synthetic data an inevitable choice [1]
合成数据的「毒」与「药」,模型崩溃有何新解?
机器之心· 2025-08-30 01:30
Group 1 - The core viewpoint of the article highlights the advancements in synthetic data research, particularly in understanding the collapse mechanisms of models during self-training with synthetic data and establishing application processes in various stages of model development [1]. Group 2 - Research over the past year has revealed new findings regarding the "toxicity" of synthetic data, indicating that model collapse occurs during iterative training, leading to a gradual pollution of the training dataset [5]. - In the early collapse stage, models begin to lose information about the distribution tails (low-probability events), while in the late collapse stage, models converge to outputs that bear little resemblance to the original data distribution [6][7]. - The occurrence of this collapse is influenced by model design, learning processes, and the quality of the data used [7]. - Various generative models, including language models, Variational Autoencoders (VAE), and Gaussian Mixture Models (GMM), are prone to collapse phenomena [8]. - However, some researchers argue that the risks of model collapse may be overstated, suggesting that maintaining a certain proportion of real data and following proper training processes can mitigate these issues [4][5]. Group 3 - Despite the risks associated with model collapse, synthetic data plays an irreplaceable role in model training, prompting the industry to propose a systematic framework for generating and applying synthetic data [9]. - A table summarizing the usage of synthetic data across various stages of model training is referenced, indicating its significance in pre-training, fine-tuning, post-training, and evaluation [10].
清华大学张小劲谈数据标注:高质量数据集走到哪,AI就到哪
Nan Fang Du Shi Bao· 2025-08-29 06:50
Core Insights - The data annotation industry is at a new strategic stage, indicating a maturation process with evolving roles and responsibilities among companies [3] - The relationship between high-quality datasets and artificial intelligence is symbiotic, driving advancements in both fields [6][8] Industry Development - The demand for data annotation is shifting towards economically developed regions and AI frontier areas, reflecting a trend in labor distribution [4] - The industry is primarily concentrated in information technology and scientific research, with a notable demand for annotation in AI research sectors [4] - Traditional manual annotation is facing intense competition and transformation, with future prospects leaning towards automation and intelligent tools [4] Future Trends - The synthetic data field is gaining attention due to the limitations of real-world data and the high costs associated with annotation processes [5] - A 2x2 matrix categorization of data annotation companies reveals trends based on scene strength and foundational strength, indicating diverse development paths [5] - The development of AI-assisted annotation and fully automated technologies is essential for transitioning from labor-intensive to knowledge-intensive processes [8] Recommendations for Industry Growth - Establish multi-round quality inspection and feedback mechanisms to ensure high-quality data for AI models [8][9] - Develop targeted annotation systems to leverage China's rich application scenarios and data resources [9] - Enhance collaboration between academia and industry to accelerate technology transfer and standardization [9] - Focus on skill training and optimizing human resource allocation to support high-quality annotation work [9]