Workflow
BabyVision
icon
Search documents
五百行代码打造SOTA视觉智能体!UniPat AI最新开源
量子位· 2026-03-16 07:14
Core Insights - The article discusses the impressive advancements in multimodal large models' coding capabilities, while highlighting their frequent errors in basic visual tasks [1][2] - UniPat AI's SWE-Vision framework allows models to write and execute Python code to enhance their visual judgment accuracy, achieving state-of-the-art results across five major visual benchmarks [1][5] Group 1: Model Performance and Limitations - Multimodal large models have shown remarkable progress in coding, comparable to experienced engineers, but struggle with understanding the visual world accurately [2][3] - The BabyVision benchmark revealed that models often provide seemingly reasonable reasoning but fail in basic measurements, counting, and spatial relationship judgments [2][3] Group 2: SWE-Vision Framework - SWE-Vision is a minimalist visual intelligence framework that enables models to utilize coding as a tool to compensate for visual processing inaccuracies [3][6] - The framework includes a simple tool layer with only two functions: execute_code for running Python in a persistent Jupyter environment and finish for outputting final answers [7][8] Group 3: Execution and Iteration - SWE-Vision operates through a standard agentic loop, allowing the model to organize user queries and images, execute code, and iterate based on results until a final answer is reached [9][15] - The persistent Jupyter kernel allows for state retention across multiple calls, enabling step-by-step analysis similar to human analysts [11][18] Group 4: Results and Implications - SWE-Vision achieved significant improvements over leading visual language models, with notable scores in various benchmarks: 64.4 in BabyVision, 94.0 in MathVision, 50.1 in Zero-Bench-Sub, 69.0 in OmniSpatial, and 82.5 in CharXiv-RQ [5][22] - The framework demonstrates that introducing coding capabilities can systematically elevate the visual performance of advanced models, particularly in basic perception and precise processing tasks [20][28] Group 5: Future Directions - Future developments aim to integrate coding as an inherent capability of visual intelligence agents, enhancing their ability to perceive, act, and reflect [30][31] - Key areas for improvement include recognizing when visual reasoning requires code assistance, validating intermediate results, and seamlessly merging observation with computation [32]
UniPat AI开源SWE-Vision:五百行代码打造SOTA视觉智能体!
机器之心· 2026-03-16 01:31
Core Insights - The article discusses the impressive advancements in multimodal large models' coding capabilities, while highlighting their frequent errors in basic visual tasks. UniPat AI has developed a minimalist visual intelligence framework called SWE-Vision, which allows models to write and execute Python code to process and validate their visual judgments. SWE-Vision has achieved state-of-the-art results across five mainstream visual benchmark tests [1][3][9]. Group 1: Model Limitations and Observations - Multimodal large models have made significant progress in coding, comparable to experienced engineers, but struggle with understanding the visual world, often making errors in basic measurements, counting, and spatial relationships [3][4]. - The BabyVision benchmark revealed that models often provide seemingly reasonable reasoning but fail in fundamental visual processing tasks, indicating a gap in their capabilities [3][4]. - A key observation is that while models can "see," they often cannot process visual information accurately, prompting the idea of using coding as a tool to enhance visual processing precision [5][7]. Group 2: SWE-Vision Framework - SWE-Vision is designed as a minimalist visual intelligence agent, focusing on two main tools: execute_code and finish, allowing models to utilize familiar programming actions without overwhelming them with specialized visual APIs [10][11][12]. - The framework includes a standard agentic loop that enables the model to organize user queries and images, execute code, and return results for further decision-making [13][16]. - SWE-Vision operates in a persistent Jupyter environment, allowing for state retention across multiple code executions, which facilitates a more human-like iterative analysis process [14][21]. Group 3: Performance and Results - SWE-Vision has shown remarkable improvements in five different visual benchmark tests, enhancing the performance of leading large language models (LLMs) such as GPT-5.2-xhigh and Seed-2.0-Pro [9][30]. - The results indicate that the introduction of code execution capabilities systematically elevates the visual performance limits of advanced models, particularly in basic perception and precise processing tasks [28][34]. - The framework's design allows for multi-step analysis and verification, contrasting with traditional models that rely on intuitive observation [24][25]. Group 4: Future Directions - The article suggests that future developments should focus on integrating "code-enhanced vision" as a native capability of visual intelligence agents, requiring a shift towards interactive environments that support reinforcement learning and tool usage [36][37]. - Key future directions include learning to identify when visual reasoning requires code assistance, actively verifying intermediate results, and seamlessly integrating observation with computation [39][40].
科研AI出了个狠角色:开源30B小模型,硬刚Gemini和Claude
量子位· 2026-03-09 02:01
Core Viewpoint - The article discusses the capabilities of the UniScientist model developed by UniPat AI, emphasizing its ability to conduct autonomous scientific research despite having only 30 billion parameters, outperforming larger closed-source models in various scientific benchmarks [2][3][36]. Group 1: Model Capabilities - UniScientist can autonomously propose hypotheses, collect evidence, execute reproducible deductions, and iteratively validate until conclusions are established [2][10]. - The model addresses the limitations of existing AI in scientific research, which often only mimic the appearance of research without true validation or reproducibility [7][8]. - It integrates a dynamic system approach to scientific research, allowing for continuous evolution of evidence states and hypothesis refinement [17][20]. Group 2: Data Engine and Research Process - The data engine of UniScientist is designed to balance the scale and diversity of data generated by the model with the quality and verifiability provided by human experts [12][16]. - The model's research process is formalized into a series of verifiable unit tests, breaking down open scientific questions into independent, verifiable rubric items [24][25]. - The dataset includes over 4,700 research-grade instances, covering more than 50 disciplines and 400 research directions, with each instance validated by experts [26][30]. Group 3: Performance and Benchmarking - UniScientist achieved a score of 28.3 on the FrontierScience-Research benchmark, surpassing several larger models, and reached a score of 33.3 in the results aggregation mode [36][37]. - The model's performance indicates that it has learned to integrate retrieval, deduction, validation, and writing into a coherent research workflow [42]. - Even without tools, the model demonstrated significant performance improvements, suggesting enhanced research reasoning capabilities through training [40][41]. Group 4: Future Directions - The next steps for UniScientist involve expanding its capabilities to include real-world experimental resources and computational infrastructure for controlled orchestration and execution [47]. - The integration of a code interpreter aims to transition the research process from narrative reasoning to a "test-correct" cycle, allowing hypotheses to be instantiated as computational experiments [44][45].
AI 真能做研究吗?UniPat AI开源UniScientist,用30B小模型给出肯定答案
机器之心· 2026-03-09 02:00
Core Viewpoint - The article emphasizes that most large models can generate text that resembles research but very few can actually conduct research by proposing hypotheses, collecting evidence, and iteratively validating until conclusions are reached [1] Group 1: Research Capability - Many models performing "research tasks" merely mimic scientific writing without genuine validation, often falling into logical traps and lacking reproducibility [4][5] - UniPat AI's UniScientist, with only 30 billion parameters, claims to possess "autonomous scientific research" capabilities, continuously proposing, falsifying, and refining hypotheses until evidence stabilizes [6][7] Group 2: Data Bottleneck - The article identifies a significant bottleneck in constructing high-quality research training data, with existing methods being either purely human or purely synthetic [8][11] - Human experts excel in verification, while large language models are better at generating diverse research questions and solutions, suggesting a more efficient division of labor [12] Group 3: Formalized Scientific Research - UniScientist models the open scientific process as a dynamic system based on two operations: Active Evidence Integration and Model Abduction [15] - The system executes three actions: generating hypotheses, acquiring external evidence, and updating hypotheses until evidence is sufficiently stable [17][18] Group 4: Verifiable Research Units - UniScientist introduces Evolving Polymathic Synthesis, which expands validated scientific claims into research-level questions and synthesizes evaluation rubrics that assess scientific discoveries rather than superficial qualities [20][21] - The dataset includes over 4,700 research-level instances, each with 20+ rubric items covering more than 50 disciplines and 400 research directions [22] Group 5: Collective Intelligence - UniScientist incorporates a training goal for result aggregation, allowing the model to merge strengths from multiple candidate research outputs to produce a more robust final result [30][31] Group 6: Performance Metrics - The evaluation results are notable, with UniScientist-30B-A3B achieving a score of 28.3 on FrontierScience-Research, surpassing several leading closed-source models [33] - Even without tools, the model's performance showed significant improvement, indicating enhanced research reasoning capabilities through training [35][36] Group 7: Future Directions - The next steps involve integrating real-world experimental capabilities, transitioning from narrative reasoning to a "test-correct" cycle, allowing hypotheses to be instantiated as computational experiments [39][41]
榜单更新,字节Seed2.0表现亮眼,我们还测了爆火的龙虾 |xbench 月报
红杉汇· 2026-03-04 02:49
Core Insights - The article discusses the latest updates from xbench regarding various AI models, particularly focusing on the BabyVision benchmark and the competitive landscape among leading models [1][14]. Group 1: Model Performance and Rankings - The latest leaderboard updates show that Doubao-Seed-2.0-pro ranks first among domestic models with an average score of 69.2, significantly outperforming its competitors in terms of output token cost, which is only one-fourth of Gemini 3 Pro's cost [5]. - Qwen3.5-plus achieved a score of 65.6, marking a notable improvement of 10.6 points from its predecessor, indicating a shift in focus towards stability and cost-effectiveness in model performance [7]. - GLM-5 scored 65.0, reflecting a 4.2 point increase from GLM-4.7, while maintaining high inference efficiency [8][9]. Group 2: Benchmarking and Evaluation - The BabyVision benchmark, developed by xbench in collaboration with various AI companies and researchers, has been adopted by several new models, showcasing its relevance in the industry [14]. - Doubao-Seed-2.0-pro leads the BabyVision leaderboard with a score of 62.60%, demonstrating its strong capabilities in multimodal visual understanding tasks [12]. - The competitive landscape is evolving, with models increasingly focusing on real-world agent tasks rather than just single-point benchmarks [28]. Group 3: Technological Advancements - Seed2.0, launched by ByteDance, enhances visual perception and reasoning capabilities, significantly improving the processing of complex documents and multimedia content [29][30]. - Qwen3.5 incorporates a hybrid attention mechanism and a sparse architecture, allowing for efficient deployment and improved inference throughput [33]. - GLM-5 introduces advanced capabilities in automated code generation and complex system reconstruction, marking a significant evolution in AI model functionality [34].
顶尖AI竟输给三岁宝宝,BabyVision测试暴露多模态模型硬伤
机器之心· 2026-01-12 05:01
Core Viewpoint - The article discusses the limitations of current large models in visual understanding, emphasizing that while they excel in language and text reasoning, their visual capabilities remain underdeveloped, akin to that of a three-year-old child [3][4][49]. Group 1: BabyVision Overview - UniPat AI, in collaboration with Sequoia China and various research teams, has launched a new multimodal understanding evaluation set called BabyVision to assess visual capabilities of AI models [3][4]. - BabyVision aims to create a new paradigm for AI training, evaluation, and application in real-world scenarios, focusing on generating measurable and iterative visual capabilities [4][49]. Group 2: Evaluation Methodology - BabyVision includes a direct comparison experiment with 20 vision-centric tasks given to children of different ages (3, 6, 10, 12 years) and top multimodal models [7]. - The evaluation strictly controls language dependency, requiring answers to be derived solely from visual information [8]. Group 3: Results and Findings - The results reveal that most models score significantly below the average performance of three-year-old children, with the best model, Gemini3-Pro-Preview, only achieving 49.7%, which is still 20 percentage points below the performance of six-year-olds [15][21]. - Human participants scored an impressive 94.1% accuracy on the BabyVision-Full test, highlighting the substantial gap between human and model performance [20][21]. Group 4: Challenges Identified - The study identifies four core challenges in visual reasoning for AI models: observing non-verbal details, maintaining visual tracking, lacking spatial imagination, and difficulty in visual pattern induction [27][30][36][39]. - These challenges indicate a systemic lack of foundational visual capabilities in current models, rather than isolated deficiencies [23]. Group 5: Future Directions - The article suggests that transitioning visual reasoning tasks to visual operations, as demonstrated in BabyVision-Gen, may help bridge the gap in visual understanding [42]. - The ongoing development of BabyVision aims to guide the evolution of multimodal large models by breaking down visual understanding into 22 measurable atomic capabilities [49].
多模态大模型输给三岁宝宝?xbench x UniPat联合发布新评测集BabyVision
Xin Lang Cai Jing· 2026-01-12 01:57
Core Insights - The core issue is the significant gap in visual understanding capabilities of multimodal large models when not relying on language prompts, with performance levels comparable to that of a three-year-old child [2][34] - The BabyVision assessment framework dissects visual capabilities into four main categories (fine-grained discrimination, visual tracking, spatial perception, visual pattern recognition) comprising 22 sub-tasks to identify specific weaknesses in model performance [2][34] - Evaluation results reveal a stark contrast between human and model performance, with human baseline accuracy at 94.1%, while the best closed-source model, Gemini3-Pro-Preview, achieved only 49.7%, followed by GPT-5.2 at 34.8%, Doubao-1.8 at 30.2%, and the best open-source model, Qwen3VL-235B-Thinking, at 22.2% [2][34] - A key reason for this disparity is that many tasks cannot be fully expressed in language, leading to the concept of "unspeakable" tasks where critical visual details are lost when compressed into tokens [2][34] - BabyVision introduces a new direction by allowing models to generate visual outputs, with BabyVision-Gen re-labeling 280 tasks suitable for generative responses, achieving a 96% consistency rate with human evaluations [2][34] Assessment Framework - The BabyVision framework aims to break down the understanding of the world into measurable, diagnosable, and iterative atomic capabilities, providing a roadmap for enhancing visual shortcomings in multimodal and embodied intelligence [3][35] - A direct comparison experiment was conducted where 20 vision-centric tasks were given to children of various ages and top multimodal models, revealing that most models scored significantly below the average performance of three-year-old children [4][36] - The only model to consistently exceed the three-year-old baseline was Gemini3-Pro-Preview, which still lagged approximately 20 percentage points behind six-year-old children [4][36] Visual Capability Breakdown - The visual capabilities were categorized into four core areas, each with several sub-tasks: - Fine-grained Discrimination: 8 sub-tasks focused on distinguishing subtle visual differences - Visual Tracking: 5 sub-tasks aimed at following paths, lines, and motion trajectories - Spatial Perception: 5 sub-tasks related to understanding three-dimensional structures and their relationships - Visual Pattern Recognition: 4 sub-tasks for identifying logical and geometric patterns [10][42] - The data collection process involved strict adherence to copyright regulations, ensuring that only suitable images were used, and each question underwent a rigorous double-blind quality check [11][43] Challenges Identified - The research identified four typical challenges faced by models in visual tasks: 1. Non-verbal details: Models struggle with tasks requiring subtle visual distinctions that are easily recognized by humans [14][48] 2. Tracking errors: Models often misinterpret paths and connections, leading to incorrect answers [16][51] 3. Lack of spatial imagination: Models fail to accurately visualize and manipulate three-dimensional structures [19][53] 4. Difficulty in pattern induction: Models tend to focus on superficial attributes rather than underlying structural rules [23][55] Future Directions - BabyVision-Gen represents a promising new approach, allowing models to perform visual reasoning through drawing and tracing, which may help address existing shortcomings [24][60] - The importance of BabyVision lies in its potential to guide the development of multimodal models by identifying gaps in visual understanding and suggesting areas for improvement [29][61]
多模态大模型输给三岁宝宝?xbench x UniPat联合发布新评测集BabyVision
红杉汇· 2026-01-12 01:04
Core Insights - The article discusses the advancements in large models in language and text reasoning, highlighting the need for models to understand visual information without relying on language. The introduction of the BabyVision evaluation set aims to assess this capability [1][2]. Group 1: Evaluation of Visual Understanding - BabyVision conducted a direct comparison between children of various ages (3, 6, 10, 12 years) and top multimodal models on 20 vision-centric tasks, revealing that most models scored below the average of 3-year-old children [2][4]. - The only model that consistently exceeded the 3-year-old baseline was Gemini3-Pro-Preview, which still lagged approximately 20 percentage points behind 6-year-old children [4]. Group 2: Breakdown of Visual Abilities - The research team categorized visual abilities into four core categories: Visual Pattern Recognition, Fine-grained Discrimination, Visual Tracking, and Spatial Perception, with a total of 22 sub-tasks designed to quantify foundational visual skills [9][11]. - BabyVision was developed using a rigorous data collection process, referencing children's cognitive materials and visual development tests, resulting in 388 high-quality visual questions [10][11]. Group 3: Performance Results - In the BabyVision-Full evaluation, human participants achieved an accuracy rate of 94.1%, while the best-performing model, Gemini3-Pro-Preview, scored only 49.7%, with most models falling in the 12-19% range [13]. - The performance gap was consistent across all four categories, indicating a systemic lack of foundational visual capabilities in the models [13]. Group 4: Challenges Identified - The article identifies several challenges faced by models, including the inability to process visual information without losing details, leading to errors in tasks that require spatial imagination and visual pattern induction [15][23][26]. - Many tasks in BabyVision are described as "unspeakable," meaning they cannot be fully captured in language without losing critical visual information [15]. Group 5: Future Directions - BabyVision-Gen was introduced to explore whether models can perform visual tasks like children by generating images or videos as answers, showing some improvement in human-like behavior but still lacking consistent accuracy [27][28]. - The importance of BabyVision lies in its ability to break down visual understanding into measurable components, guiding the development of multimodal models towards achieving true general intelligence and embodied intelligence [31].