量子位
Search documents
一张图0.1秒生成上半身3D化身!清华IDEA新框架入选ICCV 2025
量子位· 2025-08-21 02:36
Core Viewpoint - The article discusses the introduction of GUAVA, a novel framework developed by researchers from Tsinghua University and IDEA, which enables the creation of upper-body 3D avatars from a single image in just 0.1 seconds, without the need for multi-view videos or individual training [1][5][36]. Summary by Sections Introduction - GUAVA is recognized for its ability to create realistic and expressive upper-body avatars, which is valuable in fields such as film, gaming, and virtual meetings [4]. Challenges and Innovations - Creating avatars from a single image has been a significant challenge, particularly in achieving real-time rendering and ease of creation. GUAVA addresses these challenges by allowing inference reconstruction in seconds and supporting real-time animation [5][6]. Methodology - GUAVA introduces the Expressive Human Model (EHM) to enhance facial expression capture, overcoming limitations of existing models [12][36]. - The framework employs a two-branch model for avatar reconstruction, combining a "template Gaussian" and a "UV Gaussian" to maintain geometric structure while capturing detailed textures [14][15]. - Real-time animation is achieved by deforming the Ubody Gaussian based on new pose parameters, followed by optimization through a neural refiner to enhance rendering quality [16][17]. Experimental Results - The dataset for experiments included over 620,000 frames, focusing on upper-body videos, with evaluations based on ID consistency, efficiency, and viewpoint control [18][20]. - GUAVA outperformed existing 2D and 3D methods in rendering quality and efficiency, achieving approximately 50 FPS and a reconstruction time of around 0.1 seconds [22][23]. - In self-reenactment scenarios, GUAVA showed superior performance across all metrics compared to 2D methods, while also maintaining ID consistency in cross-reenactment scenarios [22][25]. Conclusion - GUAVA represents a significant advancement in the field of 3D avatar reconstruction, demonstrating improved rendering quality and efficiency over existing methods, with a reconstruction time of approximately 0.1 seconds and support for real-time animation [36][37].
小小具身智能成果,高中生在腾讯拿下!
量子位· 2025-08-20 10:21
Core Insights - The article highlights the achievements of high school students during Tencent's 2025 Spark Challenge Week, focusing on advancements in embodied intelligence and robotics [5][13][14] - The event showcased various projects, including a robot capable of navigating indoor environments based on human instructions, which has potential applications in delivery services [3][19] - The challenge also emphasized the importance of long text understanding, multi-modal perception, and quantum computing, indicating a strong focus on cutting-edge technology [13][26] Group 1: Embodied Intelligence and Robotics - A humanoid robot successfully completed a task of finding a soccer ball in an indoor environment based on human commands, demonstrating the potential for solving the "last 10 meters" delivery problem [3][19] - The project faced challenges in adapting from virtual environments to real-world testing, particularly in coordinate system conversion [17] - Future improvements will include fine-tuning large language models (LLMs), enhancing facial recognition systems, and optimizing navigation capabilities [19] Group 2: Long Text Understanding and Multi-Modal Perception - A long novel re-creation agent was developed, capable of style transfer, prologue creation, and text refinement, with a user interface that displays the AI's thought process [21] - The multi-modal perception system for visually impaired individuals integrates image, voice, and text inputs to assist with obstacle avoidance and object retrieval [24] - The system includes real-time distance measurement, voice interaction, and object grabbing modules, enhancing usability in complex environments [24] Group 3: Quantum Computing and Security - The quantum technology segment involved designing a structured approach to quantum algorithm challenges, divided into multiple phases [26] - In the security domain, an automated attack agent was created to enhance efficiency in penetration testing, significantly reducing human labor [26] Group 4: Talent Development and Future Opportunities - Out of 82 participants, 68 received offers from prestigious universities, indicating the high caliber of talent involved [27] - The Spark Challenge Week is part of Tencent's broader initiative to nurture young talent in computer science, with opportunities for further industry engagement and project involvement [31][32] - Previous participants have gone on to become influential figures in technology innovation, showcasing the program's long-term impact [32]
宇树180芭蕾机器人,究竟啥水平?
量子位· 2025-08-20 10:21
Core Viewpoint - The article discusses the upcoming humanoid robot "Ballet Dancer" from Yushun, highlighting its advanced features and positioning within the humanoid robot market [2][4][13]. Group 1: Product Overview - The "Ballet Dancer" is Yushun's fourth humanoid robot, following H1, G1, and R1, and is expected to be released by the end of October [12][13]. - The robot stands 180 cm tall and boasts 31 degrees of freedom, indicating a significant improvement in flexibility and movement capabilities compared to previous models [5][18][41]. - The design emphasizes a slender and elegant form, aligning with the company's branding of "agility" and "elegance" [16][20]. Group 2: Comparison with Previous Models - Previous models include: - H1: Released in August 2023, priced at 650,000 yuan, with 19 degrees of freedom. - G1: Set for release in May 2024, priced at 99,000 yuan, with 23 degrees of freedom. - R1: Scheduled for July 2025, priced at 39,900 yuan, with 24 degrees of freedom [14][15]. - The "Ballet Dancer" surpasses H1 in degrees of freedom by 63%, enhancing its upper limb movement capabilities [41][42]. Group 3: Market Positioning - Yushun's humanoid robots are strategically segmented into various applications, including industrial, research, and entertainment, catering to both B2B and B2C markets [27][28]. - The company aims to establish a comprehensive product line that includes full-size and half-size robots across different price points, reflecting a "full-size + full-scenario + full-price" strategy [27]. Group 4: Industry Context - The humanoid robot market is becoming increasingly competitive, with various companies introducing models with similar height and degrees of freedom [30][39]. - The "Ballet Dancer" is positioned within the second tier of humanoid robots, with its 31 degrees of freedom placing it among competitors like Xiaopeng's Iron and ZhiYuan's LingXi X2 [49][50].
突破Agent长程推理效率瓶颈!MIT&新加坡国立联合推出强化学习新训练方法
量子位· 2025-08-20 10:21
Core Viewpoint - The MEM1 framework, developed by MIT and the National University of Singapore, addresses the challenges faced by AI agents in managing complex tasks and memory efficiently, achieving significant improvements in inference speed and memory usage compared to traditional models [2][22]. Group 1: Framework Overview - MEM1 framework allows AI agents to autonomously manage their working memory and reasoning processes, akin to how humans organize thoughts after a period of work [4][10]. - The framework introduces a near constant memory usage model, significantly reducing the computational cost associated with increasing dialogue rounds [6][12]. Group 2: Performance Metrics - The MEM1-7B model demonstrates a 3.5 times faster inference speed compared to a traditional 14B model, while maintaining a peak token count that is approximately one-fourth of the latter [2][3]. - In a complex 16-target task, MEM1 outperformed larger models and those with external memory modules across accuracy, context length, and inference speed [17][18]. Group 3: Training Methodology - MEM1 employs an end-to-end reinforcement learning approach, utilizing an attention masking mechanism that allows the agent to focus on relevant historical information while compressing it efficiently [12][22]. - The training process involves three key operations: extracting key information, integrating it with internal memory, and pruning redundant content [14][20]. Group 4: Practical Applications - The MEM1 framework has been tested in various environments, including document retrieval QA, open-domain web QA, and multi-round online shopping scenarios, showcasing its adaptability and effectiveness in real-world applications [19][20]. Group 5: Industry Implications - The traditional approach in the industry has been to integrate external memory modules, which can be cumbersome and less effective; MEM1's approach suggests a shift towards self-managed memory systems through reinforcement learning [22].
黑神话宇宙开启!冯骥杨奇一拍即合不搞DLC,新作《钟馗》预告片直冲热搜第一
量子位· 2025-08-20 07:48
Core Viewpoint - The article discusses the announcement of the new game "Black Myth: Zhong Kui," which is the second installment in the Black Myth series, highlighting its significant debut at the 2025 Cologne Game Show and the excitement it has generated among players and the gaming community [1][3][40]. Group 1: Game Announcement and Reception - "Black Myth: Zhong Kui" is the second game in the Black Myth series, with its first CG teaser released at the 2025 Cologne Game Show [3][40]. - The teaser quickly gained popularity, reaching the top of Weibo's trending topics and accumulating over 8 million views shortly after its release [10][12]. - The game has sparked positive reactions from both domestic and international audiences, building on the success of its predecessor, "Black Myth: Wukong" [12][13]. Group 2: Game Development Insights - The development of "Black Myth: Zhong Kui" was influenced by a conversation between CEO Feng Ji and Art Director Yang Qi, leading to the decision to create a new game instead of a DLC for "Black Myth: Wukong" [30][32]. - The creative team aims to introduce new heroes, gameplay mechanics, visuals, technology, and stories in this new installment [32]. - The game's development began with a dream of a character riding a tiger, which inspired the design of Zhong Kui [24][25]. Group 3: Historical Context and Future Prospects - The company, Game Science, was founded in 2014 by Feng Ji and Yang Qi, with "Black Myth: Wukong" being conceptualized in 2016 and officially started in 2018 [34][36]. - "Black Myth: Wukong" received critical acclaim, winning the Best Visual Effects award at the Cologne Game Show in 2023, and is set to release on August 20, 2024 [37][38]. - The article expresses optimism for the future of "Black Myth: Zhong Kui," encouraging anticipation for its eventual release [42].
实测DeepSeek V3.1,不止拓展上下文长度
量子位· 2025-08-20 07:48
Core Insights - The article discusses the differences between DeepSeek V3.1 and its predecessor V3, highlighting improvements in programming performance, creative writing, translation quality, and response tone [2][6][40]. Group 1: Model Features - DeepSeek V3.1 has expanded context length to 128K tokens, while V3 has a maximum of 65K tokens [8][7]. - The new version supports multiple tensor formats, enhancing its usability across different platforms [1][6]. - The API for V3 still operates with a maximum context length of 65K tokens, indicating a significant upgrade in V3.1 [7][8]. Group 2: Performance Comparison - In programming tasks, V3.1 demonstrated a more comprehensive approach, providing detailed code and usage instructions compared to V3 [12][13]. - For creative writing, V3.1 produced a more poetic and emotional response, contrasting with V3's straightforward style [20][18]. - Both versions successfully solved a mathematical problem, but their presentation styles differed, with V3.1 offering a clearer explanation [23][24]. Group 3: Translation Capabilities - V3.1 showed improved understanding of complex sentences in translation tasks, although it missed translating some simple words [29][26]. - The translation of a biology paper's abstract revealed V3.1's enhanced capability in handling specialized terminology compared to V3 [28][27]. Group 4: Knowledge and Reasoning - In a knowledge-based query about a specific fruit type, both versions identified it as a drupe, but V3.1's reasoning strayed off-topic [30][36]. - V3.1 achieved a score of 71.6% on the Aider benchmark, outperforming V3 and indicating its competitive edge in non-reasoning tasks [42][40]. Group 5: User Feedback and Market Response - The release of V3.1 has generated significant interest, becoming a trending topic on social media platforms [40][41]. - Users have noted improvements in physical understanding and the introduction of new tokens, although some issues related to the online API have been reported [45][49].
DiT突遭怒喷,谢赛宁淡定回应
量子位· 2025-08-20 07:48
Core Viewpoint - The article discusses the recent criticisms of the DiT (Diffusion Transformers) model, which is considered a cornerstone in the diffusion model field, highlighting the importance of scientific scrutiny and empirical validation in research [3][10]. Group 1: Criticism of DiT - A user has raised multiple concerns about DiT, claiming it is flawed both mathematically and in its structure, even questioning the presence of Transformer elements in DiT [4][12]. - The criticisms are based on a paper titled "TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training," which introduces a strategy that allows early-layer tokens to be passed to deeper layers without modifying the architecture or adding parameters [12][14]. - The critic argues that the rapid decrease in FID (Fréchet Inception Distance) during training indicates that DiT's architecture has inherent properties that allow it to easily learn the dataset [15]. - The Tread model reportedly trains 14 times faster than DiT after 400,000 iterations and 37 times faster at its best performance after 7 million iterations, suggesting that significant performance improvements may undermine previous methods [16][17]. - The critic also suggests that if parts of the network are disabled during training, it could render the network ineffective [19]. - It is noted that the more network units in DiT that are replaced with identity mappings during training, the better the model evaluation results [20]. - The architecture of DiT is said to require logarithmic scaling to represent the signal-to-noise ratio differences during the diffusion process, indicating potential issues with output dynamics [23]. - Concerns are raised regarding the Adaptive Layer Normalization method, suggesting that DiT processes conditional inputs through a standard MLP (Multi-Layer Perceptron) without clear Transformer characteristics [25][26]. Group 2: Response from Xie Saining - Xie Saining, the author of DiT, responded to the criticisms, asserting that the Tread model's findings do not invalidate DiT [27]. - He acknowledges the Tread model's contributions but emphasizes that its effectiveness is due to regularization enhancing feature robustness, not because DiT is incorrect [28]. - Saining highlights that Lightning DiT, an upgraded version of DiT, remains a powerful option and should be prioritized when conditions allow [29]. - He also states that there is no evidence to suggest that the post-layer normalization in DiT causes issues [30]. - Saining summarizes improvements made over the past year, focusing on internal representation learning and various methods for enhancing model training [32]. - He mentions that the sd-vae (stochastic depth variational autoencoder) is a significant concern for DiT, particularly regarding its high computational cost for processing images at 256×256 resolution [34].
00后MIT华人女生辍学创业,已融1.5个亿
量子位· 2025-08-20 04:33
Core Viewpoint - The article highlights the rise of AI startups led by the post-2000 generation, focusing on Jessica Wu's company, Sola Solutions, which has successfully raised $21 million in funding and aims to revolutionize robotic process automation (RPA) through AI technology [1][5][19]. Company Overview - Sola Solutions was founded in 2023 by Jessica Wu and Neil Deshmukh, both of whom dropped out of MIT to pursue their entrepreneurial ambitions [6][9][33]. - The company is positioned as a "Copilot" in the RPA space, utilizing large language models (LLM) and computer vision to assist clients in automating complex repetitive tasks [11][17]. Funding and Growth - Sola Solutions has raised a total of $21 million, with $3.5 million from the seed round led by Conviction and $17.5 million in the latest Series A round led by Andreessen Horowitz [19][20]. - Since the beginning of the year, Sola's revenue has increased fivefold, and the volume of workflows has doubled [16]. Target Market and Applications - The company serves a diverse range of industries, including financial services, legal, insurance, and healthcare, and has clients among the Fortune 100 companies [17][18]. - Sola's technology allows users to record operational processes, automatically generating robot scripts for data extraction and validation without requiring programming skills [13][14]. Leadership and Expertise - Jessica Wu brings a unique blend of experience in mathematics, computer science, and finance, having previously worked in quantitative research and founded a clothing design company [6][30][32]. - Neil Deshmukh focuses on the technical aspects, having a background in computer vision and AI innovation [34][37]. Industry Context - The emergence of Sola Solutions coincides with a global trend of increased investment in backend automation, with AI software services potentially reducing workloads by 20% to 40% in traditional industries [37]. - The article notes a broader trend of successful AI startups being founded by young entrepreneurs, particularly those who have dropped out of prestigious institutions like MIT [38][39].
国产AI路由系统开源逆袭!仅用19%成本达到Gemini-2.5-Pro同等性能
量子位· 2025-08-20 04:33
Core Viewpoint - The article discusses the launch of the Avengers-Pro multi-model scheduling routing solution, which balances performance and cost for users of large models, making advanced AI capabilities more accessible [3][12]. Group 1: Performance and Cost Efficiency - Avengers-Pro integrates eight leading large models and achieves superior performance on six challenging datasets, surpassing GPT-5-medium by 7% and Gemini-2.5-Pro by 19% [5]. - The solution offers a 27% cost reduction while achieving performance equivalent to GPT-5-medium, and only 19% of the cost to match Gemini-2.5-Pro [5][20]. - Avengers-Pro achieves Pareto optimality, providing the highest accuracy at any given cost level and minimizing costs for specified accuracy targets [5][23]. Group 2: Technical Mechanism - The core mechanism of Avengers-Pro involves embedding and clustering user requests to dynamically match and allocate the most suitable model for different tasks [15][25]. - The framework consists of three main steps: embedding user requests into high-dimensional vectors, clustering similar tasks, and scoring models based on performance-cost evaluations [16][25]. - The system allows flexible switching between performance and cost optimization by adjusting a parameter α, catering to diverse application needs [17][30]. Group 3: Competitive Landscape - Avengers-Pro outperforms any single model in its pool, achieving an average accuracy of 0.66 compared to GPT-5-medium's 0.62 [20]. - The solution demonstrates significant cost savings while maintaining performance, proving its effectiveness in the current large model ecosystem [30][32]. - The intelligent routing concept is expected to lead to further breakthroughs in large model applications in the future [32].
厉害了,智谱造了全球首个手机通用Agent!人人免费,APP甚至直接操控云电脑
量子位· 2025-08-20 04:33
Core Viewpoint - The article introduces the world's first universal mobile agent, AutoGLM, developed by Zhipu AI, which allows users to perform tasks on their mobile devices through voice commands, significantly enhancing convenience and intelligence [5][6][9]. Group 1: Product Features - AutoGLM operates in the cloud, enabling seamless task execution without affecting the performance of other applications on the user's device [9][33]. - The agent can handle various tasks categorized into "lifestyle assistant" and "office assistant," allowing users to interact with it as if they were using a normal smartphone [11][15]. - Users can initiate complex tasks, such as comparing prices across multiple e-commerce platforms, with minimal input required [19][20]. Group 2: Technological Advancements - AutoGLM represents a significant upgrade from traditional chatbots by executing tasks autonomously rather than merely providing instructions [31]. - The cloud execution model alleviates the burden on local devices, ensuring that users can continue using their devices without interruption [36][37]. - The integration of a cloud computer allows AutoGLM to perform high-complexity tasks that local devices may struggle with due to limited processing power [36][41]. Group 3: Industry Implications - The launch of AutoGLM aligns with a growing trend in the industry towards cloud-based agents, as seen with other major players like Alibaba Cloud [38][40]. - The product validates the feasibility and reliability of cloud execution in the agent space, potentially setting a new standard for future developments [53][54]. - AutoGLM's capabilities reflect a shift in user interaction with machines, moving from simple communication to direct task execution [55][56].