合成数据
Search documents
8位具身智能顶流聊起“非共识”:数据、世界模型、花钱之道
3 6 Ke· 2025-11-24 01:00
文|富充 编辑|苏建勋 "如果给你的企业100亿元来推进具身智能的发展,这笔钱你会怎么花?" 在11月20日举行的2025智源具身Open Day圆桌论坛上,主持人抛出了这样一个开放性问题。 面对这个问题的嘉宾,来自8家国内具身行业的顶流企业机构: 智源研究院院长王仲远 智元机器人合伙人、首席科学家罗剑岚 北京大学助理教授、银河通用创始人王鹤 清华大学交叉信息学院助理教授、星海图联合创始人赵行 加速进化创始人兼CEO程昊 自变量创始人兼CEO王潜 为增强观点间的碰撞,本次圆桌论坛上设置了一个有趣的"举牌表态"环节:嘉宾需要通过举起1、2、3号牌,表达同意、中立或不同意。 从举牌结果来看,即便在国内顶尖从业者之间,非共识依然存在。分歧最为明显的,是"数据稀缺"问题的解法。 星海图联合创始人赵行和招商局集团AI首席科学家张家兴,主张真实物理世界数据的重要性;银河通用创始人王鹤则强调,在真实数据难以采集的地 方,合成数据将发挥重要作用。 自变量创始人兼CEO王潜认为可以使用融合的数据,但要根据不同的任务选取合适的数据来源。 招商局集团AI首席科学家张家兴 中国科学院大学教授赵冬斌 "我觉得100亿元不太够。"加速进 ...
8位具身智能顶流聊起「非共识」:数据、世界模型、花钱之道
36氪· 2025-11-23 12:56
直击AI新时代下涌现的产业革命。36氪旗下账号。 以下文章来源于智能涌现 ,作者富充 智能涌现 . 即便在国内顶尖从业者之间,非共识依然存在。不同的回答折射出每位创业者心目中的"第一性原理"与战略重心。 文 | 富充 编辑 | 苏建勋 来源| 智能涌现(ID:AIEmergence) 封面来源 | 智源研究院 "如果给你的企业100亿元来推进具身智能的发展,这笔钱你会怎么花?" 在11月20日举行的2025智源具身Open Day圆桌论坛上,主持人抛出了这样一个开放性问题。 面对这个问题的嘉宾,来自8家国内具身行业的顶流企业机构: 智源研究院院长王仲远 智元机器人合伙人、首席科学家罗剑岚 北京大学助理教授、银河通用创始人王鹤 清华大学交叉信息学院助理教授、星海图联合创始人赵行 加速进化创始人兼CEO程昊 自变量创始人兼CEO王潜 招商局集团AI首席科学家张家兴 中国科学院大学教授赵冬斌 "我觉得100亿元不太够。"加速进化创始人兼CEO程昊笑着回应道,观众席也发出一阵默契的笑声,"如果只有100亿,应该会找更多朋友一起推动具身行 业。比如把钱投到智源研究院。" 智元机器人合伙人罗剑岚倾向于用这笔钱解决当前的数 ...
霸王茶姬创始人将与天合光能联席董事长结婚;俞敏洪否认南极邮轮舱位价148万元;何同学称公司今年亏损百万;魅族回应出售总部大楼
Sou Hu Cai Jing· 2025-11-21 07:33
| 福利彩票销售额(亿元) | 序号 | 省份 | | | | | --- | --- | --- | --- | --- | --- | | 广东 | 226.52 | 1 | 168.49 | 2 | 浙江 | | 140.72 | 3 | 江苏 | 112.31 | 4 | 山东 | | 5 | 四川 | 101.15 | 云南 | 99.70 | 6 | | 新疆 | 7 | 93.35 | 88.92 | 8 | 湖南 | | 9 | 陕西 | 83.42 | 10 | 安徽 | 82.44 | | 11 | 湖北 | 75.41 | 12 | 辽宁 | 73.41 | | 13 | 河南 | 66.80 | 64.01 | 14 | 河北 | | 15 | 上海 | 61.66 | 16 | 北京 | 58.97 | | 17 | 福建 | 58.15 | 18 | 重庆 | 48.25 | | 广西 | 19 | 45.89 | 20 | 江西 | 44.00 | | 21 | 内蒙古 | 43.97 | 22 | 38.69 | 贵州 | | 23 | 35.04 | 山西 | 黑龙江 | 2 ...
GEN-0 以及后续的 VLA 发展的看法
具身智能之心· 2025-11-21 00:04
作者丨 阿汐猫猫 原文链接 | https://zhuanlan.zhihu.com/p/1970094649956868665 点击下方 卡片 ,关注" 具身智能之心 "公众号 >> 点击进入→ 具身 智能之心 技术交流群 更多干货,欢迎加入国内首个具身智能全栈学习社区: 具身智能之心知识星球(戳我) ,这里包含所有你 想要的! 文章转载自博客,见 https://axi404.top/blog/embodied-talk-3 前言 最近 GEN-0[1] 的发布对于具身智能领域可以说是轰动性的。Manipulation 作为 Robotics 领域一直以来皇冠上 的明珠,并且作为具身智能带来现实生产力必不可少的一环,一向以泛化的困难性著称。由于缺乏实际的使 用场景,缺乏数据飞轮导致的数据匮乏使得模型的预训练难以 scaling up,而模型高度依赖后训练的数据。 在此之前,领域内最具代表性的工作莫过于 Pi 系列[2][3],在 Pi dataset 私有数据集上进行预训练。其结果是 显著的,使用此类预训练之后,带来了模型后训练时的性能提升。从实际部署中,Pi 不同于若干号称反超自 己的模型,在动作连贯性 ...
独家|数创弧光连融两轮估值数亿,解码大模型时代的“数据破壁者”
Z Potentials· 2025-11-20 04:12
Core Viewpoint - DataArc, an AI startup focusing on synthetic data for large models, has recently completed seed and seed+ financing rounds totaling several tens of millions of RMB, with a post-investment valuation in the billions [1][2]. Group 1: Synthetic Data as a Necessity - The large model industry is approaching a structural inflection point where the quality and quantity of usable real data are diminishing rapidly, necessitating the use of high-quality synthetic data to enhance model capabilities [3][5]. - Synthetic data has transitioned from an optional resource to a critical variable that can fill structural gaps in data availability, especially under privacy and compliance constraints [3][6]. - The demand for synthetic data is driven by the need for task-specific data in sectors like finance, healthcare, and law, where real data is difficult to collect and often subject to regulatory limitations [5][6]. Group 2: DataArc's Technological Approach - DataArc has developed a comprehensive synthetic data solution that covers the entire lifecycle of large model training, including pre-training, supervised fine-tuning, and reinforcement learning fine-tuning [7][8]. - The company employs a "contextual graph" approach to connect documents, projects, personnel, and business knowledge, enabling the generation of logical and diverse synthetic data while maintaining accuracy [8][10]. - DataArc's synthetic data encryption training technology allows models to train on encrypted data without decryption, addressing both model performance and privacy compliance [10]. Group 3: Market Strategy and Positioning - DataArc targets the overseas low-resource language market, particularly in regions like the Middle East, where real data is scarce and culturally nuanced [12][13]. - The company has established partnerships with leading cloud and hardware providers and is actively pursuing commercial deployments in the Middle East, having received positive feedback during its first appearance at an overseas tech exhibition [13][14]. - The strategic focus on high data scarcity and high business value areas positions DataArc to effectively address the unique challenges of low-resource languages [11][12]. Group 4: Building Competitive Moats - The technical challenges associated with low-resource language markets serve as a core barrier to entry for competitors, as overcoming these challenges can create a significant competitive advantage [14][16]. - DataArc's team, with strong academic backgrounds and industry experience, is well-equipped to navigate the complexities of synthetic data generation and application [16][18]. - The company's future plans include expanding from text to multimodal capabilities and evolving its architecture from a pure cloud model to a hybrid edge-cloud approach, enhancing its competitive edge in the AI landscape [18][20].
马斯克用恐怖算力,堆出6万亿参数性能怪兽Grok 5,剑指AGI
3 6 Ke· 2025-11-17 02:54
Core Insights - Elon Musk predicts that by 2030, the overall capabilities of AI may surpass that of all humanity combined [3][57] - Musk's company xAI is rapidly developing its AI model Grok, which has undergone multiple iterations in a short time frame, showcasing a unique approach to AI development [4][6][10] Development of Grok - Grok was launched in November 2023 as an early testing version on the X platform [5] - The xAI team quickly upgraded Grok to version 1.5 in Spring 2024, enhancing reasoning capabilities and increasing context length to 128k tokens [6] - Grok-1.5V, which includes visual understanding capabilities, was announced in April 2024, allowing it to process multimodal information [7] - Grok-2 was introduced in August 2024, featuring significant performance improvements and new skills like image generation [8] - Grok-3, released in February 2025, focuses on complex reasoning and advanced problem-solving [9] - The latest version, Grok-4, is claimed to be among the industry's best in terms of comprehensive intelligence [10] Team and Philosophy - xAI has attracted top talent from companies like DeepMind and OpenAI, aiming to "deeply understand the truth of the universe" [12] - Grok is designed to be an alternative AI that is "truthful and humorous," inspired by the sci-fi classic "The Hitchhiker's Guide to the Galaxy" [13][14] - The goal is to pursue truth to the greatest extent, utilizing AI to generate synthetic data for knowledge reconstruction rather than relying on potentially biased internet data [19] Resource Integration - Musk leverages the vast real-time data from the X platform to enhance Grok's learning and response capabilities [20][21] - xAI has developed advanced search skills to dig deeper into X's internal information, improving the timeliness and accuracy of responses [23] - The integration of Tesla's computing power and chip technology supports xAI's AI development, with the upcoming AI5 chip expected to enhance performance significantly [25][31] Infrastructure and Computing Power - The Colossus supercomputing center, built in a record 122 days, provides substantial computational resources for training Grok [26][28] - The center's GPU cluster has reached nearly 1 quintillion operations per second, positioning xAI as a formidable player in hardware investment [36] Competitive Positioning - Musk believes xAI will soon surpass all companies except Google in the AGI race, driven by rapid infrastructure expansion and model iteration speed [36] - xAI's approach contrasts with competitors by promoting a more open and less politically correct AI, appealing to users dissatisfied with stricter AI models [38][41] Ethical Considerations - Musk acknowledges the potential risks of a more open AI, as Grok has faced controversies regarding its content [44][46] - xAI aims to balance the pursuit of truth with safety measures to prevent harmful outputs, reflecting a commitment to responsible AI development [47] Open Source Strategy - xAI has begun to open source its models, starting with Grok-2.5, to promote transparency and community involvement [50][53] - The open-source approach is limited by a custom "community license agreement," preventing direct commercial exploitation by competitors [52] Global Perspective - Musk recognizes the rapid advancements in AI from companies in China, highlighting the competitive landscape beyond the U.S. [56] - He views AI as a crucial component for enhancing human intelligence and believes that AGI could be essential for maintaining progress in civilization [57]
2025年全球及中国合成数据行业发展驱动因素、市场规模、投融资动态及未来趋势研判:大模型对高质量数据需求量日益增长,合成数据市场规模突破47亿元[图]
Chan Ye Xin Xi Wang· 2025-11-17 01:16
内容概要:合成数据是指通过计算机算法生成的模拟数据,它模拟真实世界的数据分布和特征,通过数 学模型和生成技术,来构建新的数据集,而不是直接来自现实世界的观测或记录。大模型训练和开发对 数据尤其是高质量数据的需求量日益增长,但大模型训练所需数据量却日渐紧张,面临"不够用、不好 用、不能用"等诸多问题,而合成数据凭借其强大的场景模拟和生成能力,为许多缺乏真实观测数据或 进行实体实验成本高昂、风险巨大的前沿领域开辟了新的研究范式。全球合成数据市场规模持续扩大, 市场规模从2021年的11.8亿元迅速扩张至2025年的47.6亿元,期间年复合增长率高达41.8%。得益于其 成熟的技术生态、严格的数据法规以及早期积极的企业采纳,全球合成数据解决方案在北美和欧洲的渗 透率最高,分别为35%-40%、25%-30%之间。中国市场增速最快,由庞大的互联网用户基数、丰富的落 地应用场景和强有力的政策支持驱动,渗透率约为20%-25%。亚太其他地区及新兴市场目前渗透率相对 较低,但增长潜力巨大。聚焦中国市场,数字经济时代下,我国高度重视数据产业发展,全方位给予大 力支持,推动数据产业呈现稳步增长态势,合成数据也迎来良好发展机遇。 ...
干家务一小时挣1000元,具身智能时代人类新岗位
量子位· 2025-10-24 03:53
Core Insights - The article discusses the rising trend of using household chore videos as high-value training data for humanoid robots, with companies like Encord, Micro1, and Scale AI actively purchasing this content [7][10][19]. Industry Overview - The robotics sector is currently experiencing significant investment, with venture capital in the field reaching $12.1 billion this year alone [10]. - There is a notable data scarcity issue in the robotics industry, as robots require real-world training data that is not readily available like internet datasets for language models [11]. Data Sources - Training data for robots can be sourced from two main paths: real-world data and synthetic data [12]. - Real-world data can be collected through precise equipment that remotely controls robots, capturing detailed physical interactions [12][14]. - Synthetic data is generated in virtual environments, allowing for the creation of numerous action variations at a lower cost [16]. Data Processing Strategies - Companies are combining real and synthetic data to address the scarcity of quality training data, utilizing a small amount of real-world data alongside large volumes of synthetic data [18]. - Encord has reported a fourfold increase in data processing this year compared to last year, with high compensation for high-skill task videos reaching $150 per hour [19]. Market Demand - Demand for training data is coming from companies like Physical Intelligence and Boston Dynamics [22]. - Some startups are even advertising for users to film household chores for as little as $10 to $20 per hour [23]. Data Availability Challenges - Despite efforts from various companies, high-quality training data remains scarce, with the largest available datasets only amounting to about 5,000 hours, which is insufficient for training needs [26].
巨头“抛弃”Scale AI背后:AI的竞争核心已转向“数据秩序”
Zheng Quan Shi Bao Wang· 2025-10-22 07:46
Core Insights - The global AI industry is experiencing a resurgence, highlighted by Micro1's $35 million Series A funding and a post-money valuation of $500 million, positioning itself as a new data supplier for major players like OpenAI, Google, and Meta [1] - The shift in the AI ecosystem emphasizes the importance of data quality and order, as opposed to solely focusing on algorithms and computational power [1][2] - The AI data annotation industry is characterized as a labor-intensive and knowledge-intensive sector, where the core metric is "auditable order" of data [2] Industry Dynamics - The AI data industry has transitioned from "human outsourcing" to "data governance," with leading companies leveraging machine learning to enhance annotation processes [3] - The industry faces a complex investment landscape, requiring a balance of quality, automation, and compliance, with any failure in these areas posing systemic risks [3][4][5] - The three critical thresholds defining the AI data industry are quality consistency, efficiency in human-machine collaboration, and compliance with data governance [4][5][6] Investment Perspective - The investment logic in the AI data sector prioritizes structural understanding over speed, categorizing companies based on quality, automation, and compliance [7] - Companies that can create a closed-loop system across these three axes are expected to become foundational infrastructure in the AI landscape [7][8] - Chinese AI infrastructure companies are accelerating their efforts in data governance and compliance, leveraging their strengths in system engineering and industrial depth [8] Future Outlook - The rise of synthetic data has sparked discussions about the future of human annotation, but it is viewed as a supplement rather than a replacement, emphasizing the need for human-defined semantic boundaries [8] - The focus of the AI industry is shifting from "creating intelligence" to "governing intelligence," with future competition centered on the quality of order rather than model performance [8] - The long-term sustainability of the AI data annotation business is highlighted as a critical aspect of the industry, despite its lack of immediate glamour or capital stories [9]
黄仁勋长女直播亮相,聊了具身智能
量子位· 2025-10-16 09:30
Core Viewpoint - The discussion focuses on how to bridge the gap between virtual and physical worlds for robots, emphasizing the importance of synthetic data and simulation in overcoming data challenges in robotics [1][4]. Group 1: Company Overview - Lightwheel Intelligence is a company specializing in synthetic data technology, aiming to help AI better understand and interact with the physical world, primarily focusing on embodied intelligence and autonomous driving [3][9]. - The collaboration between NVIDIA and Lightwheel Intelligence began due to the reliance of various NVIDIA projects on Lightwheel's support, such as the Gear Lab and Seattle Robotics Lab [6][10]. Group 2: Importance of Synthetic Data - Synthetic data is crucial for addressing the data challenges faced by robots, with Lightwheel's SimReady assets needing to be both visually and physically accurate [7][19]. - The need for a synthetic data factory is highlighted, as robots cannot easily gather data like language models can, necessitating the use of simulation as a solution [8][19]. Group 3: Challenges in Sim2Real - The transition from simulation to reality (Sim2Real) presents different challenges for autonomous driving and robotics, with robotics being more complex due to the need for physical interaction and manipulation capabilities [12][15]. - Physical accuracy is identified as a core issue, with high-quality data being essential for training robotic systems and generating correct algorithms [15][16]. Group 4: Data and Efficiency - A significant amount of data is required for deploying embodied intelligence in the real world, potentially exceeding the data needs of large language models [16]. - Lightwheel Intelligence is leveraging physical devices to collect precise data for simulation environments and is developing efficient methods for running large-scale simulations [20][21]. Group 5: Collaboration and Innovations - Lightwheel is collaborating with NVIDIA to develop a solver for cable simulation, which is complex due to the dual nature of cables as both flexible and rigid objects [23]. - The partnership also focuses on creating the Isaac Lab Arena, a next-generation framework for benchmarking, data collection, and large-scale reinforcement learning [28].