Workflow
机器之心
icon
Search documents
刚刚,AI大牛刘威视频创业公司Video Rebirth,完成5000万美元融资
机器之心· 2025-11-07 03:06
Core Insights - Video Rebirth, an AI video startup, has successfully raised $50 million in funding to enhance its video generation technology and expand its market reach [1][3] - The company aims to address significant gaps in existing AI video models, particularly in terms of precision, controllability, and physical realism for professional creators [3][4] Funding and Investment - The funding round attracted a strong lineup of investors, including leading dollar funds globally and in Singapore, internet giants, established gaming companies from China and South Korea, top chip manufacturers, and renowned family offices [1] - The capital raised will primarily be used for continuous iteration of video models, recruitment of top talent, and global market expansion [1] Company Vision and Technology - Founded by Dr. Wei Liu, a former Tencent scientist, Video Rebirth is focused on creating a "video-native world model" [1][3] - The company's core innovation lies in its advanced diffusion structure and Physics Native Attention mechanism, which enhances the generation of content that adheres to complex instructions while maintaining physical realism [3] - The company plans to release its 1.0 version product by December 2025, aiming to shift from consumer tools to providing high-fidelity video generation platforms for professional creators in advertising, e-commerce, film, animation, and gaming [1][3] Industry Context - The AI video generation sector is expected to experience rapid growth by 2025, yet there remains substantial room for improvement in meeting the demands of professional creators [3] - Video Rebirth's mission is to leverage original technology, focused organization, and efficient execution to drive industry development and build an ecosystem for the next generation of AI-generated entertainment [4]
Feed-Forward 3D综述:三维视觉如何「一步到位」
机器之心· 2025-11-06 08:58
Core Insights - The article discusses advancements in the field of 3D vision, particularly focusing on the transition from traditional methods to Feed-Forward 3D approaches, which enhance efficiency and generalization capabilities [2][4]. Summary by Sections Overview of Feed-Forward 3D - The article highlights the evolution of 3D reconstruction techniques, from Structure-from-Motion (SfM) to Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), emphasizing the shift towards Feed-Forward 3D methods that eliminate the need for per-scene optimization [2][6]. Key Technological Branches - Five main architectural categories of Feed-Forward 3D methods are identified, each contributing significantly to the field's progress [6][7]. - Neural Radiance Fields (NeRF) introduced a differentiable framework for volume rendering but faced efficiency issues due to scene-specific optimization. The emergence of conditional NeRF has led to various branches focusing on direct prediction of radiance fields [7][9]. - PointMap Models, led by DUSt3R, predict pixel-aligned 3D point clouds directly within a Transformer framework, enhancing efficiency and memory capabilities [9][10]. - 3D Gaussian Splatting (3DGS) represents scenes as Gaussian point clouds, balancing rendering quality and speed. Recent advancements allow for direct output of Gaussian parameters [10][12]. - Mesh, Occupancy, and SDF Models integrate traditional geometric modeling with modern techniques, enabling high-precision surface modeling [14][19]. Applications and Benchmarking - The paper summarizes the application of Feed-Forward models across various tasks, including camera pose estimation, point map estimation, and single-image view synthesis, providing a comprehensive benchmark of over 30 common 3D datasets [16][18][22]. - Evaluation metrics such as PSNR, SSIM, and Chamfer Distance are established to facilitate model comparison and performance assessment [18][23]. Future Challenges and Trends - The article identifies four major open questions for future research, including the integration of Diffusion Transformers, scalable 4D memory mechanisms, and the construction of multimodal large-scale datasets [27][28]. - Challenges such as the predominance of RGB-only data, the need for improved reconstruction accuracy, and difficulties in free-viewpoint rendering are highlighted [29].
谷歌AlphaEvolve太香了,陶哲轩甚至发了篇论文,启发数学新构造
机器之心· 2025-11-06 08:58
Core Insights - The paper showcases how AlphaEvolve, a tool developed by Google DeepMind, autonomously discovers new mathematical constructs and enhances understanding of long-standing mathematical problems [2][8]. - AlphaEvolve represents a significant advancement in the field of mathematical discovery, combining large language models (LLMs) with evolutionary computation and automated evaluation mechanisms [8][16]. - The research indicates that AlphaEvolve can rediscover known optimal solutions and improve upon them in several cases, demonstrating its potential to match or exceed existing best results [10][11]. Group 1: AlphaEvolve's Capabilities - AlphaEvolve can autonomously explore mathematical spaces and generate new structures, significantly reducing the time required for problem setup compared to traditional methods [11][12]. - The system operates on multiple abstract levels, optimizing both specific mathematical constructs and the algorithms used to discover them, showcasing a new form of recursive evolution [12][13]. - The research team tested AlphaEvolve on 67 problems across various mathematical domains, including analysis, combinatorics, geometry, and number theory [9]. Group 2: Methodology and Design - AlphaEvolve employs a complex search algorithm that optimizes solutions by iteratively refining candidate solutions, akin to a hill-climbing approach [18][19]. - The system's design allows it to evolve entire code files rather than just single functions, enabling it to handle more complex mathematical problems [20]. - The introduction of a search mode allows AlphaEvolve to evolve heuristic algorithms that can explore a vast number of candidate constructs efficiently [28][29]. Group 3: Integration of AI Tools - The research highlights a workflow that integrates multiple AI tools, such as Deep Think and AlphaProof, to achieve a complete cycle from intuitive discovery to formal verification [34]. - This integration demonstrates the potential for specialized AI systems to collaborate in mathematical research, enhancing the overall discovery process [34]. Group 4: Observations and Limitations - The study notes that while AlphaEvolve excels in discovering constructs within the current mathematical capabilities, it may struggle with problems requiring novel insights [43][44]. - The researchers observed that the design of the verification system significantly impacts the quality of results, emphasizing the need for robust evaluation environments [39]. - The findings suggest that AlphaEvolve's performance improves when trained on related problems, indicating the benefits of cross-problem training [42].
RLinf上新πRL:在线强化学习微调π0和π0.5
机器之心· 2025-11-06 08:58
Core Insights - The article discusses the advancements in the field of robotics, particularly focusing on the VLA models π0 and π0.5 developed by Physical Intelligence, which utilize flow matching techniques to generate high-dimensional and smooth continuous action sequences, demonstrating significant advantages in complex manipulation tasks [2][3]. Group 1: VLA Models and Challenges - VLA models heavily rely on large-scale, high-quality human demonstration data, which is costly and time-consuming to collect and annotate [2]. - Reinforcement learning (RL) allows agents to explore and iteratively improve through real interactions with the environment, reducing the dependency on extensive data and enhancing the performance ceiling of supervised fine-tuning (SFT) [2]. Group 2: πRL Framework - A collaborative effort from institutions like Tsinghua University, Peking University, and CMU has led to the development of the πRL framework for online reinforcement learning fine-tuning of flow matching VLA models [3]. - The πRL framework achieved an average success rate of 97.6% for π0 and 98.3% for π0.5 on the LIBERO testing platform, validating the effectiveness of the fine-tuning approach [3]. Group 3: Technical Innovations - πRL introduces two technical routes: Flow-Noise and Flow-SDE, addressing the challenge of directly calculating the log-likelihood of output actions in flow matching VLA [8][10]. - Flow-Noise models the denoising process as a discrete Markov process, enabling the direct computation of the joint probability density of the denoised sequence [10]. - Flow-SDE combines the denoising process with environmental interaction, constructing a two-layer Markov Decision Process (MDP) [20]. Group 4: Performance Improvements - The πRL framework demonstrated a success rate increase of over 40% across 4,352 grasp-and-place task combinations, achieving final success rates exceeding 80% [3][24]. - In the LIBERO testing platform, πRL improved the average success rate of π0 from 57.6% to 97.6% and π0.5 from 77.1% to 98.3%, surpassing the performance of fully data-trained flow matching VLAs [19]. Group 5: Generalization and Robustness - The πRL algorithm significantly enhances the generalization capabilities of both models in new environments, as evidenced by tests involving domain randomization [26]. - The framework's ability to reduce the average number of steps required to complete tasks indicates improved efficiency compared to supervised fine-tuning [28]. Group 6: Future Directions - Future developments of πRL will include more benchmark tests, deeper analysis of out-of-distribution (OOD) generalization capabilities, and further exploration of critic design for improved stability [35][36].
扩展外部测试时Scaling Law,中关村学院新发现:轻量级验证器可解锁LLM推理最优选择
机器之心· 2025-11-06 05:28
Core Insights - The article discusses the concept of Test-Time Scaling (TTS) as a method to enhance the reasoning capabilities of large language models (LLMs) by allocating more computational resources during the model's response phase [4][6] - It introduces the TrajSelector method, a lightweight yet powerful Best-of-N strategy that leverages the hidden states of large models to evaluate reasoning paths without the need for expensive process annotations or large reward models [7][10] Summary by Sections Research Background - TTS is categorized into internal and external methods, with the latter focusing on parallel reasoning to generate multiple paths for a final answer [4][6] Existing Methods and Their Limitations - Traditional Best-of-N methods include Majority Voting and Process Reward Model (PRM), both of which have significant drawbacks such as instability and inefficiency [5][10] TrajSelector Methodology - TrajSelector operates through a three-step pipeline: parallel sampling, step scoring, and aggregation to select the optimal reasoning path [12][14] - It utilizes a lightweight scoring model (0.6B parameters) to assess reasoning steps based on the hidden states of a larger strategy model, achieving better scoring performance with reduced parameter size [13][14] Training Approach - TrajSelector employs a weak supervision training scheme that eliminates the need for extensive manual annotations, allowing the model to learn effectively from large datasets [16][17] Experimental Results - The article presents performance metrics for various N values in Best-of-N tasks, demonstrating that TrajSelector outperforms traditional methods across multiple benchmarks [19][20] Conclusion - TrajSelector offers a significant advancement in optimizing reasoning for large models, emphasizing the importance of effectively utilizing existing model capabilities rather than merely increasing model size [22][23]
从扫街榜到Robotaxi,空间智能彻底打开了高德的想象空间
机器之心· 2025-11-06 05:28
Core Viewpoint - Gaode is transitioning from traditional map navigation to a broader application of spatial intelligence, aiming to integrate its capabilities into various scenarios, with a focus on the automotive sector and Robotaxi services [3][4][5]. Group 1: Gaode's Strategic Shift - Gaode has announced a partnership with Xiaopeng Motors to provide Robotaxi services globally, marking a significant step in integrating spatial intelligence with transportation services [5][7]. - The collaboration represents a key move for Gaode to transform the concept of spatial intelligence into a practical reality, enhancing its service offerings [7][12]. Group 2: Spatial Intelligence Capabilities - Gaode's spatial intelligence emphasizes critical capabilities such as spatial positioning, time prediction, and physical interaction, which are essential for understanding and navigating the real world [9][10]. - The system creates a closed loop of "prediction - action - verification," allowing real-time data feedback to refine its understanding of spatial contexts, a feature that language models struggle to achieve [12][23]. Group 3: Impact on Robotaxi Industry - The introduction of spatial intelligence into the Robotaxi sector offers new possibilities, particularly in enhancing vehicle perception and response to complex traffic situations [14][15]. - Gaode's "super-distance" capability allows for early detection of traffic incidents, enabling proactive alerts to vehicles before they reach congested areas, thus improving safety and efficiency [15][17]. Group 4: Broader Applications of Spatial Intelligence - Beyond Robotaxi, Gaode's spatial intelligence is being integrated into various applications, such as personalized travel planning and real-time navigation assistance, demonstrating its versatility [22][21]. - The technology is also being applied in B2B contexts, such as smart glasses and low-altitude economic platforms, indicating its potential to redefine interactions with the physical world across multiple industries [22][21].
ICML 2026新规「避坑」指南:参会非必须、原稿将公开、互审设上限
机器之心· 2025-11-06 05:28
Core Points - The ICML 2026 conference will take place from July 7 to July 12, 2026, in Seoul, South Korea, with a double-blind review process for all submitted papers [4] - Authors of accepted papers can choose whether to attend the conference in person or only have their papers included in the proceedings [7] - The original submission versions of accepted papers will be made publicly available, and authors of rejected papers can also choose to make their original submissions public [10] Submission Requirements - Papers must be submitted as a single file, with a maximum of 8 pages for the main text, while references, impact statements, and appendices have no page limit [5] - There will be no separate submission deadline for supplementary materials, and authors can add one extra page to the final version of accepted papers [6] - Papers that do not comply with the submission requirements will be rejected without review [11] Important Dates - The submission website will open on January 8, 2026, with the abstract submission deadline on January 23, 2026, and the full paper submission deadline on January 28, 2026 [14][15] Review Process - Authors are required to participate in the review process, with specific mutual review requirements for both papers and authors [17] - The double-blind review policy prohibits simultaneous submissions to multiple conferences or journals [18] - All submissions must be anonymized and should not contain any information that could reveal the authors' identities [21] Ethical Guidelines - Each paper must include a potential societal impact statement, which should be placed at the end of the paper and will not count towards the page limit [23] - Authors must submit a plain language summary to communicate the significance of their research to the public [24] - Violations of the review process or ethical guidelines may result in sanctions or rejection of the submission [22][23]
开源即爆火!英伟达重磅推出OmniVinci全模态大模型
机器之心· 2025-11-06 05:28
Core Insights - The article discusses NVIDIA's launch of OmniVinci, a new open-source multimodal large language model (LLM) that integrates visual, audio, and language understanding in a unified latent space, enabling AI to perceive and generate content across multiple modalities [2][10][42] - OmniVinci has achieved significant performance improvements over competitors in various multimodal benchmarks, demonstrating superior efficiency by using nearly six times less data to achieve its results [6][10][22] Multimodal Understanding - OmniVinci excels in several key multimodal tasks, including video-audio cross-modal understanding and audio comprehension, outperforming other models in benchmark tests [6][10] - The model's architecture includes three core innovations: OmniAlignNet for cross-modal semantic alignment, Temporal Embedding Grouping (TEG) for understanding event sequences, and Constrained Rotary Time Embedding (CRTE) for absolute time perception [10][12][13] Data Engine - The OmniVinci team has built a comprehensive multimodal data engine comprising 24 million dialogue samples across images, videos, audio, and speech, with a distribution of 36% images, 38% audio and speech, 11% video, and 15% multimodal data [15] - Two innovative learning methods are employed: Implicit Learning, which utilizes existing video-audio Q&A data, and Explicit Learning, which generates separate visual and audio descriptions for cross-correction [15][19] Key Insights from Experiments - The research team identified that single-modal labeling can lead to "modal hallucinations," emphasizing the importance of integrated approaches for comprehensive understanding [17] - The combination of audio and visual data significantly enhances model performance, with results showing that each additional learning step leads to performance improvements [19][20] - Reinforcement learning (RL) further enhances OmniVinci's capabilities, with audio providing a substantial boost to training efficiency [22] Real-World Applications - OmniVinci has demonstrated its capabilities in various real-world scenarios, such as understanding complex discussions in podcasts, transcribing speech, and executing voice commands for robotic actions [25][31][33] - The model can also analyze medical imaging while comprehending professional commentary, showcasing its potential in healthcare applications [35] - In sports broadcasting, OmniVinci can simultaneously interpret visual actions and commentary, proving its utility in live event analysis [39] Future Implications - The emergence of OmniVinci signifies a shift towards unified multimodal perception systems, reducing training costs and accelerating iterations for broader applications [43][44] - The potential applications range from intelligent robots that understand commands to healthcare AI that interprets medical data, indicating a rapidly approaching smarter future [43][44]
全网都吵疯了!小鹏的人形机器人,是不是真人
机器之心· 2025-11-06 03:28
Core Viewpoint - Xpeng Motors has transitioned from being solely an automotive company to an AI company, showcasing its humanoid robot IRON, which has sparked significant discussion globally [8][9]. Group 1: Robot Development and Features - Xpeng has been developing humanoid robots for 7 years, evolving from quadrupedal forms to a fully humanoid design [10]. - The new IRON features a human-like skeletal structure, bionic muscle system, and fully flexible skin, significantly reducing its mechanical appearance [11]. - Standing at approximately 1.78 meters and weighing 70 kg, IRON is taller than robots from competitors like NEO [12]. - IRON has 22 degrees of freedom in its hands and a total of 65 degrees of freedom, allowing it to perform complex tasks such as folding clothes and cleaning [15][16]. - The robot's advanced movement capabilities are supported by a sophisticated control system, although specific details have not been disclosed [18]. Group 2: AI and Interaction - The core of IRON is powered by Xpeng's self-developed AI brain, utilizing three Turing AI chips with a total computing power of 2,250 TOPS [26]. - It integrates three cognitive models (VLT, VLA, VLM) for visual perception, language understanding, and action decision-making, enabling seamless interaction [27]. - The head of IRON features a 3D curved display that serves as both a face and an interactive interface for more natural human-robot communication [28]. Group 3: Market Strategy and Future Plans - Xpeng plans to mass-produce IRON by 2026, primarily for use in its own commercial scenarios, such as showroom guides and sales assistants [35][37]. - The company has acknowledged the limitations of using robots in manufacturing, citing inefficiencies compared to human workers [35]. - Xpeng's CEO, He Xiaopeng, anticipates that humanoid robots will enter factories and homes, but the pace of adoption will be gradual, estimating 3-5 years for industrial applications and 5-10 years for household use [36]. - Xpeng will also launch the IRON SDK to invite third-party developers to create additional applications, with initial partnerships including major companies like Baosteel [38].
NeurIPS 2025 Spotlight | 选择性知识蒸馏精准过滤:推测解码加速器AdaSPEC来了
机器之心· 2025-11-06 03:28
Core Insights - The article discusses the introduction of AdaSPEC, an innovative selective knowledge distillation method aimed at enhancing speculative decoding in large language models (LLMs) [3][9][16] - AdaSPEC focuses on improving the alignment between draft models and target models by filtering out difficult-to-learn tokens, thereby increasing the overall token acceptance rate without compromising generation quality [3][11][16] Research Background - LLMs excel in reasoning and generation tasks but face high inference latency and computational costs due to their autoregressive decoding mechanism [6] - Traditional acceleration methods like model compression and knowledge distillation often sacrifice generation quality for speed [6] Method Overview - AdaSPEC employs a selective token filtering mechanism that allows draft models to concentrate on "easy-to-learn" tokens, enhancing their alignment with target models [3][9] - The method utilizes a two-stage training framework: first, it identifies difficult tokens using a reference model, and then it filters the training dataset to optimize the draft model [11][12] Experimental Evaluation - The research team conducted systematic evaluations across various model families (Pythia, CodeGen, Phi-2) and tasks (GSM8K, Alpaca, MBPP, CNN/DailyMail, XSUM), demonstrating consistent and robust improvements in token acceptance rates [14] - Key experimental results indicate that AdaSPEC outperforms the current optimal DistillSpec method, with token acceptance rates increasing by up to 15% across different tasks [15] Summary and Outlook - AdaSPEC represents a precise, efficient, and universally applicable paradigm for accelerating speculative decoding, paving the way for future research and industrial deployment of efficient LLM inference [16] - The article suggests two potential avenues for further exploration: dynamic estimation mechanisms for token difficulty and application of AdaSPEC in multimodal and reasoning-based large models [17]