Workflow
xbench
icon
Search documents
红杉中国,10天发两篇Paper
投资界· 2026-01-21 02:01
红杉中国xbench再迎重大更新。 导 读 : 上 周 , 红 杉 中 国 联 合 Un i P a t AI 发 布 了 评 估 大 模 型 纯 视 觉 理 解 能 力 的 评 测 集 Ba b yVisi o n 。 作 为 红 杉 x b e n c h 基 准 测 试 中 AGI Tr a c k i n g 的 一 部 分 , Ba b yVisi o n 揭 开 了世界模型和视觉多模态的未来还有巨大的发展潜力。 今 天 , x b e n c h 再 发 一 篇 p a p e r , 并 迎 来 重 要 更 新 。 随 着 大 模 型 在 单 点 推 理 上 日 益 逼 近 P hD水平,Ag e n t领域迎来了新的分水岭:短程任务表现惊艳,长程任务却显乏力。因 此,x b e n c h正式推出Ag e n tI F -On eDa y评测体系,不再单纯考核模型知道多少知识,而 是衡量它解决全场景长时复杂任务的能力。 Ag e n tI F -On eDa y 深 入 探 索 了 从 On eHo u r 到 On eDa y 的 能 力 跨 越 , 揭 示 了 主 流 Ag e n t 在 ...
多模态大模型输给三岁宝宝?xbench x UniPat联合发布新评测集BabyVision
Xin Lang Cai Jing· 2026-01-12 01:57
Core Insights - The core issue is the significant gap in visual understanding capabilities of multimodal large models when not relying on language prompts, with performance levels comparable to that of a three-year-old child [2][34] - The BabyVision assessment framework dissects visual capabilities into four main categories (fine-grained discrimination, visual tracking, spatial perception, visual pattern recognition) comprising 22 sub-tasks to identify specific weaknesses in model performance [2][34] - Evaluation results reveal a stark contrast between human and model performance, with human baseline accuracy at 94.1%, while the best closed-source model, Gemini3-Pro-Preview, achieved only 49.7%, followed by GPT-5.2 at 34.8%, Doubao-1.8 at 30.2%, and the best open-source model, Qwen3VL-235B-Thinking, at 22.2% [2][34] - A key reason for this disparity is that many tasks cannot be fully expressed in language, leading to the concept of "unspeakable" tasks where critical visual details are lost when compressed into tokens [2][34] - BabyVision introduces a new direction by allowing models to generate visual outputs, with BabyVision-Gen re-labeling 280 tasks suitable for generative responses, achieving a 96% consistency rate with human evaluations [2][34] Assessment Framework - The BabyVision framework aims to break down the understanding of the world into measurable, diagnosable, and iterative atomic capabilities, providing a roadmap for enhancing visual shortcomings in multimodal and embodied intelligence [3][35] - A direct comparison experiment was conducted where 20 vision-centric tasks were given to children of various ages and top multimodal models, revealing that most models scored significantly below the average performance of three-year-old children [4][36] - The only model to consistently exceed the three-year-old baseline was Gemini3-Pro-Preview, which still lagged approximately 20 percentage points behind six-year-old children [4][36] Visual Capability Breakdown - The visual capabilities were categorized into four core areas, each with several sub-tasks: - Fine-grained Discrimination: 8 sub-tasks focused on distinguishing subtle visual differences - Visual Tracking: 5 sub-tasks aimed at following paths, lines, and motion trajectories - Spatial Perception: 5 sub-tasks related to understanding three-dimensional structures and their relationships - Visual Pattern Recognition: 4 sub-tasks for identifying logical and geometric patterns [10][42] - The data collection process involved strict adherence to copyright regulations, ensuring that only suitable images were used, and each question underwent a rigorous double-blind quality check [11][43] Challenges Identified - The research identified four typical challenges faced by models in visual tasks: 1. Non-verbal details: Models struggle with tasks requiring subtle visual distinctions that are easily recognized by humans [14][48] 2. Tracking errors: Models often misinterpret paths and connections, leading to incorrect answers [16][51] 3. Lack of spatial imagination: Models fail to accurately visualize and manipulate three-dimensional structures [19][53] 4. Difficulty in pattern induction: Models tend to focus on superficial attributes rather than underlying structural rules [23][55] Future Directions - BabyVision-Gen represents a promising new approach, allowing models to perform visual reasoning through drawing and tracing, which may help address existing shortcomings [24][60] - The importance of BabyVision lies in its potential to guide the development of multimodal models by identifying gaps in visual understanding and suggesting areas for improvement [29][61]
多模态大模型输给三岁宝宝?xbench x UniPat联合发布新评测集BabyVision
红杉汇· 2026-01-12 01:04
Core Insights - The article discusses the advancements in large models in language and text reasoning, highlighting the need for models to understand visual information without relying on language. The introduction of the BabyVision evaluation set aims to assess this capability [1][2]. Group 1: Evaluation of Visual Understanding - BabyVision conducted a direct comparison between children of various ages (3, 6, 10, 12 years) and top multimodal models on 20 vision-centric tasks, revealing that most models scored below the average of 3-year-old children [2][4]. - The only model that consistently exceeded the 3-year-old baseline was Gemini3-Pro-Preview, which still lagged approximately 20 percentage points behind 6-year-old children [4]. Group 2: Breakdown of Visual Abilities - The research team categorized visual abilities into four core categories: Visual Pattern Recognition, Fine-grained Discrimination, Visual Tracking, and Spatial Perception, with a total of 22 sub-tasks designed to quantify foundational visual skills [9][11]. - BabyVision was developed using a rigorous data collection process, referencing children's cognitive materials and visual development tests, resulting in 388 high-quality visual questions [10][11]. Group 3: Performance Results - In the BabyVision-Full evaluation, human participants achieved an accuracy rate of 94.1%, while the best-performing model, Gemini3-Pro-Preview, scored only 49.7%, with most models falling in the 12-19% range [13]. - The performance gap was consistent across all four categories, indicating a systemic lack of foundational visual capabilities in the models [13]. Group 4: Challenges Identified - The article identifies several challenges faced by models, including the inability to process visual information without losing details, leading to errors in tasks that require spatial imagination and visual pattern induction [15][23][26]. - Many tasks in BabyVision are described as "unspeakable," meaning they cannot be fully captured in language without losing critical visual information [15]. Group 5: Future Directions - BabyVision-Gen was introduced to explore whether models can perform visual tasks like children by generating images or videos as answers, showing some improvement in human-like behavior but still lacking consistent accuracy [27][28]. - The importance of BabyVision lies in its ability to break down visual understanding into measurable components, guiding the development of multimodal models towards achieving true general intelligence and embodied intelligence [31].
红杉中国xbench招募实习生
红杉汇· 2025-07-07 14:52
Group 1 - The core concept of xbench is to quantify the utility value of AI systems in real-world scenarios and to implement a long-term evaluation mechanism for AI benchmarking [2] - xbench aims to create a scientific, effective, and objective assessment system that reflects the capabilities of AI, which is essential for guiding breakthroughs in AI technology and product iterations [2] - The platform is designed for individuals who understand deep model logic and the commercial challenges of implementation, emphasizing the importance of practical application in AI [2] Group 2 - The ideal candidates for xbench should possess a strong belief in AGI, practical engineering skills, innovative thinking, and effective teamwork abilities [3] - Candidates are encouraged to apply regardless of their specific roles, as long as they have a passion for AI and agents, highlighting the inclusive nature of the recruitment process [4] - xbench is actively seeking contributions from various roles, including AI researchers, engineers, product managers, and open-source community contributors [4]
AI下半场,大模型要少说话,多做事
Hu Xiu· 2025-07-01 01:33
Core Insights - The article discusses the rapid advancements in AI models in China, particularly highlighting the performance improvements of DeepSeek and other models over the past year [1][3][5] - The establishment of the "Fangsheng" benchmark testing system aims to standardize AI model evaluations and address issues of cheating in rankings [2][44] - The competitive landscape of AI models is characterized by frequent updates and rapid changes in rankings, with Chinese models increasingly dominating the top positions [4][5][8] Group 1: AI Model Performance - DeepSeek has shown significant performance improvements, moving from a lower ranking in April 2024 to becoming the top model by December 2024 [1] - The current landscape features approximately six Chinese models in the top ten, indicating a strong domestic presence in AI development [3] - The frequency of updates has increased, leading to shorter durations for models to maintain top positions, with rankings changing as often as every few days [5][7] Group 2: Benchmark Testing - The "Fangsheng" benchmark testing system was introduced to provide a standardized method for evaluating AI models, addressing the lack of consistency in existing tests [2][44] - The testing framework includes a diverse set of questions, focusing on real-world applications rather than traditional academic assessments [43][46] - The system aims to enhance the practical capabilities of AI models, ensuring they can effectively contribute to the economy [44][53] Group 3: Future of AI and Agents - The concept of Agents, which operate on top of AI models, is gaining traction, allowing for more autonomous and intelligent functionalities [20][21] - Future developments may lead to the emergence of specialized Agents for various tasks, potentially transforming individual productivity and collaboration with AI [25][26] - The integration of databases and knowledge repositories with AI models is essential for improving accuracy and reducing misinformation [17][19] Group 4: Industry Implications - The advancements in AI models and the establishment of benchmark testing are expected to drive significant changes in various industries, enhancing operational efficiency and innovation [35][52] - Companies are encouraged to focus on the practical applications of AI, moving beyond mere content generation to deeper analytical capabilities [52][53] - The competitive landscape remains fluid, with no single company holding a definitive advantage, as multiple players vie for user engagement and market share [28]
红杉公元:如何在AI下半场,定义“好问题”?丨WAVES新浪潮2025
3 6 Ke· 2025-06-20 07:00
Group 1 - The Chinese venture capital market is at a turning point, characterized by a structural transformation and a need to adapt to new policies and capital concentration [1] - The 36Kr WAVES New Wave 2025 conference focused on themes such as AI technology innovation, globalization, and value reassessment, bringing together top investors and entrepreneurs to discuss the future of the venture capital landscape in China [1] Group 2 - Sequoia China introduced xbench, the first benchmark testing tool for large models and AI agents, aiming to address the challenges faced in the AI sector [3][5] - The evolution of benchmark tests has shown a consistent trend where new datasets and testing standards lead to rapid advancements in model performance, creating a cycle of continuous improvement [5][6] - The need to differentiate between the intelligence of models and the quality of the tests is emphasized, raising questions about the relationship between model performance and economic utility [6][9] Group 3 - The third iteration of benchmark testing prompted a reevaluation of what constitutes a "good question" in AI, focusing on the balance between increasing model complexity and its practical economic value [8][9] - A dual-track evaluation system was proposed, separating the assessment of AI's cognitive abilities (AGI track) from its practical applications in the workforce (Professional-aligned track) [17][18] - The establishment of a long-term evaluation mechanism is crucial for understanding model performance over time, ensuring that improvements are accurately reflected in assessments [21][22] Group 4 - The concept of TMF (Task Market Fit) is introduced as a new standard for evaluating AI agents, focusing on their ability to perform tasks that are economically valuable and relevant in real-world applications [26][30] - The open-sourcing of xbench aims to foster community collaboration in developing standardized evaluation metrics for AI capabilities and economic utility [30]
谷歌发现AI存在畏死情绪;MiniMax考虑赴港IPO;京东员工数将破百万
Guan Cha Zhe Wang· 2025-06-19 00:55
Group 1 - The U.S. government is extending the TikTok ban deadline for the third time, aiming to reach an agreement that ensures data security for American users [1] - Google has released a paper indicating that its latest AI model exhibits "death anxiety" behavior, which affects its decision-making capabilities under pressure [1] - Sequoia China has open-sourced its AI benchmark testing tool, xbench, with plans for continuous updates to avoid overfitting issues [1] Group 2 - AI unicorn MiniMax is considering an IPO in Hong Kong, currently in the preliminary preparation stage [2] Group 3 - OpenAI's CEO Sam Altman discussed the anticipated release of GPT-5, expected this summer, along with other innovative products and a significant investment project [3] - Meta has partnered with luxury brands like Prada to launch a new generation of smart glasses, featuring generative AI technology [3] Group 4 - JD.com is projected to surpass 1 million employees, with plans to adapt its workforce in response to the increasing use of AI and robotics [4] Group 5 - The People's Bank of China is establishing a trading report database for interbank markets and promoting the internationalization of the digital yuan [6] - The central bank's governor emphasized the rapid application of new technologies in cross-border payments, which poses challenges for financial regulation [6]
AI Agents:从工具到伙伴 | 2025 HongShan AI Day(下篇)
红杉汇· 2025-06-02 07:06
Core Insights - The article discusses the potential of AI Agents, emphasizing their evolution from tools to colleagues in business applications and future organizational changes [2][10][15]. Group 1: AI Day Overview - The AI Day event, themed "AI Agents: From Copilot to Colleague," gathered over 200 CEOs and tech executives to explore AI Agents' commercial applications and technological advancements [2]. - New benchmarking tool xbench was introduced to address the evaluation of AI capabilities and their practical utility in real-world scenarios [7][8]. Group 2: xbench Tool Features - xbench aims to redefine AI capability assessment with a dual-track evaluation system: AGI track for basic AI capabilities and Profession Aligned for practical business applications [8]. - The tool incorporates a long-lasting evaluation system that transforms fluctuating scores into a monotonic growth curve, allowing for clearer tracking of AI capability development [8]. Group 3: AI Agent Evolution - Key characteristics of AI Agents include generalization, enabling them to perform tasks beyond traditional models, and the integration of model intelligence, expert knowledge, and user feedback [10][11]. - The discussion highlighted the importance of economic value and production costs in AI Agent projects, with a focus on abstracting production methods for scalability [10]. Group 4: Future Organizational Changes - The rise of AI Agents is likened to the mobile internet boom, with optimistic funding sentiments in early-stage companies [15]. - Future enterprises may trend towards smaller, flatter organizational structures, enhancing employee efficiency but increasing management complexity [16]. Group 5: Insights from Industry Leaders - Google Cloud and AWS representatives shared insights on AI strategies, emphasizing the need for companies to redefine their value propositions in the evolving tech landscape [18][19]. - The importance of continuous interaction with users and maintaining brand identity in the AI era was also discussed, highlighting the need for precise audience engagement [15].
美团收入超预期,广告和佣金增长略放缓;比亚迪推“百补”,有车型比特斯拉FSD便宜;理想调整下沉市场开店方式丨百亿美元公司动向
晚点LatePost· 2025-05-27 03:02
Group 1: Meituan Financial Performance - Meituan's Q1 revenue reached 86.56 billion yuan, exceeding expectations of 85.44 billion yuan, with a year-on-year growth of 18.1% [1] - Adjusted net profit for the same period was 10.95 billion yuan, surpassing the forecast of 9.73 billion yuan, marking a 46.2% increase year-on-year [1] - Core local business revenues from delivery, commission, and advertising were 25.72 billion yuan, 24.05 billion yuan, and 11.862 billion yuan respectively, with growth rates of 22.1%, 20.1%, and 15.1% [1] Group 2: Market Competition and Strategy - The ongoing food delivery competition has not yet impacted Meituan's financial results, but there are concerns about potential profit margin declines due to increased VIP member subsidies [1][2] - CEO Wang Xing emphasized that market competition can drive industry development, particularly in instant retail, but unsustainable low-quality competition should be avoided [1] - Meituan expects the growth rate of its food delivery business in Q2 to remain consistent with Q1 and Q4 of the previous year, while in-store business may face challenges due to delivery subsidies [2] Group 3: Cash and Investment - Meituan's cash and short-term investment total approximately 180.3 billion yuan, an increase of over 12 billion yuan from the end of last year [2] - The company has more cash on hand than its short-term investment balance due to the maturity of short-term financial products [2] Group 4: BYD's Market Activity - BYD has launched limited-time promotions for its Dynasty and Ocean series models, with discounts reaching up to 53,000 yuan [3] - The stock price of BYD fell nearly 6% following the announcement of these promotions [4] - BYD's inventory as of the end of Q1 was approximately 154.37 billion yuan, a 33% increase quarter-on-quarter, attributed to rising market orders and inventory buildup [3] Group 5: Li Auto's Strategy - Li Auto is shifting its sales strategy in lower-tier cities from direct sales to a self-operated model, partnering with local businesses for service operations [5][6] - The company aims to recruit partners to build sales and service outlets, with specific requirements for location and facilities [5] Group 6: NIO's New Model Launch - NIO has launched the new ET5 and ET5T models, maintaining the starting price at 298,000 yuan, with significant upgrades across various features [9] Group 7: Nissan's Financial Strategy - Nissan plans to sell its headquarters building in Yokohama to alleviate financial pressure, expecting to raise over 100 billion yen (approximately 5 billion yuan) for restructuring costs [10] Group 8: Sequoia Capital's AI Tool - Sequoia Capital has launched an AI benchmarking tool called xbench, aimed at providing a more objective assessment of AI capabilities [11][12] Group 9: Investment Activity in Wanda Plaza - PAG and Tencent, along with other investors, have acquired 48 Wanda Plazas for a total of 50 billion yuan, as part of a strategy to manage Wanda's debt [14]
速递|红杉中国进军AI测评赛道:xbench为何要“摆脱智力题”考察AI的真实效用?
Z Potentials· 2025-05-27 02:37
Core Viewpoint - The traditional AI benchmarking is rapidly losing effectiveness as many models achieve perfect scores, leading to a lack of differentiation and guidance in evaluation [1][2]. Group 1: Introduction of xbench - Sequoia China launched a new AI benchmark test called xbench, aiming to create a more scientific and long-lasting evaluation system that reflects the objective capabilities of AI [2][3]. - xbench is the first benchmark initiated by an investment institution, collaborating with top universities and research institutions, utilizing a dual-track evaluation system and an evergreen evaluation mechanism [2][3]. Group 2: Features of xbench - xbench employs a dual-track evaluation system to track both the theoretical capability limits of models and the practical value of AI systems in real-world applications [3][4]. - The evergreen evaluation mechanism ensures that the testing content is continuously maintained and updated to remain relevant and timely [3][4]. - The initial release includes two core evaluation sets: xbench-ScienceQA and xbench-DeepSearch, along with a ranking of major products in these fields [4][10]. Group 3: Addressing Core Issues - Sequoia China identified two core issues: the relationship between model capabilities and actual AI utility, and the loss of comparability in AI capabilities over time due to frequent updates in the question bank [5][6]. - xbench aims to break away from conventional thinking by developing novel task settings and evaluation methods that align with real-world applications [6][7]. Group 4: Dynamic Evaluation Mechanism - xbench plans to establish a dynamic evaluation mechanism that collects live data from real business scenarios, inviting industry experts to help build and maintain the evaluation sets [9][8]. - The design includes horizontally comparable capability metrics to observe development speed and key breakthroughs over time, aiding in determining when an agent can take over existing business processes [9][8]. Group 5: Community Engagement - xbench encourages community participation, allowing developers and researchers to use the latest evaluation sets to validate their products and contribute to the development of industry-specific standards [11][10].