Workflow
大型语言模型(LLMs)
icon
Search documents
微软推出深度视频探索智能体,登顶多个长视频理解基准
机器之心· 2025-06-30 03:18
Core Viewpoint - The article discusses the limitations of large language models (LLMs) and large visual-language models (VLMs) in processing information-dense long videos, and introduces a novel agent called Deep Video Discovery (DVD) that significantly improves video understanding through advanced reasoning capabilities [1][3]. Group 1: Deep Video Discovery (DVD) Overview - DVD segments long videos into shorter clips and treats them as an environment, utilizing LLMs for reasoning and planning to answer questions effectively [3][6]. - The system achieved a remarkable accuracy of 74.2% on the challenging LVBench dataset, surpassing previous models significantly [3][17]. - DVD will be open-sourced in the form of MCP Server, enhancing accessibility for further research and development [3]. Group 2: System Components - The system consists of three core components: a multi-granularity video database, a search-centric toolset, and an LLM as the agent coordinator [7][10]. - The multi-granularity video database converts long videos into a structured format, extracting various levels of information such as global summaries and segment-level details [10]. - The agent employs three main tools: Global Browse for high-level context, Clip Search for efficient semantic retrieval, and Frame Inspect for detailed pixel-level information [11][12][13]. Group 3: Performance Evaluation - DVD's performance was evaluated across multiple long video benchmarks, consistently outperforming existing models, including a 13.4% improvement over MR. Video and a 32.9% improvement over VCA [17]. - With auxiliary transcripts, the accuracy further increased to 76.0%, demonstrating the system's robustness [17]. - The analysis of different foundational models revealed significant behavioral differences, emphasizing the importance of reasoning capabilities in the agent's performance [18].
Karpathy 最新演讲精华:软件3.0时代,每个人都是程序员
歸藏的AI工具箱· 2025-06-19 08:20
Core Insights - The software industry is undergoing a paradigm shift from traditional coding (Software 1.0) to neural networks (Software 2.0), leading to the emergence of Software 3.0 driven by large language models (LLMs) [1][11][35] Group 1: Software Development Paradigms - Software 1.0 is defined as traditional code written directly by programmers using languages like Python and C++, where each line of code represents specific instructions for the computer [5][6] - Software 2.0 focuses on neural network weights, where programming involves adjusting datasets and running optimizers to create parameters, making it less human-friendly [7][10] - Software 3.0 introduces programming through natural language prompts, allowing users to interact with LLMs without needing specialized coding knowledge [11][12] Group 2: Characteristics and Challenges - Software 1.0 faces challenges such as computational heterogeneity and difficulties in portability and modularity [9][10] - Software 2.0 offers advantages like data-driven development and ease of hardware implementation, but it also has limitations such as non-constant runtime and memory usage [10][11] - Software 3.0, while user-friendly, suffers from issues like poor interpretability, non-intuitive failures, and susceptibility to adversarial attacks [11][12] Group 3: LLMs and Their Implications - LLMs are likened to utilities, requiring significant capital expenditure for training and providing services through APIs, with a focus on low latency and high availability [16] - The training of LLMs is compared to semiconductor fabs, highlighting the need for substantial investment and deep technological expertise [17] - LLMs are becoming complex software ecosystems, akin to operating systems, where applications can run on various LLM backends [18] Group 4: Opportunities and Future Directions - LLMs present opportunities for developing partially autonomous applications that integrate LLM capabilities while allowing user control [25][26] - The concept of "Vibe Coding" emerges, suggesting that LLMs can democratize programming by enabling anyone to code through natural language [30] - The need for human oversight in LLM applications is emphasized, advocating for a rapid generation-validation cycle to mitigate errors [12][27] Group 5: Building for Agents - The focus is on creating infrastructure for "Agents," which are human-like computational entities that interact with software systems [33] - The development of agent-friendly documentation and tools is crucial for enhancing LLMs' understanding and interaction with complex data [34] - The future is seen as a new era of human-machine collaboration, with 2025 marking the beginning of a significant transformation in digital interactions [33][35]
速递|AvatarOS获种子轮700万美元,打造AI驱动的3D高端虚拟形象
Z Potentials· 2025-03-11 03:27
Core Viewpoint - The article discusses the emergence of AvatarOS, a startup focused on creating high-quality virtual personas, leveraging advancements in generative AI to revitalize interest in virtual identities after the initial hype of the metaverse faded [1][2]. Company Overview - AvatarOS was founded by Isaac Bratzel, who has a strong background in the virtual influencer space, having previously worked at IPsoft and Brud [2]. - The company has raised $7 million in seed funding led by M13, with participation from Andreessen Horowitz Games Fund, HF0, Valia Ventures, and Mento VC [2][3]. Product Development - AvatarOS aims to create high-end virtual personas in 3D spaces, distinguishing itself from existing one-click content generation tools [4]. - The company is currently recruiting test users and has released a simple API for clients to integrate virtual personas into their websites [5]. - Future plans include developing tools for clients to create and customize their virtual personas, with a focus on unique human-like movements [5][6]. Market Positioning - The company recognizes the need for high-quality virtual images that stand out in a saturated content market, aiming to create lasting virtual entities that accumulate value over time [4]. - The investment from M13 is seen as an exploratory opportunity to find the right business model and clarify future directions for AvatarOS [3].