Workflow
量子位
icon
Search documents
Nano Banana新增2大功能,还开放API了,一张图不到3毛钱
量子位· 2025-10-03 04:19
Core Insights - Nano Banana has officially opened its API, allowing developers to integrate it into their products and enabling large-scale content production for enterprises [9][10] - The API pricing is set at approximately $0.039 per image output, translating to about 0.28 yuan, with a cost of $30 for every 1 million image output tokens [2][15][16] - Google has introduced two new features: customizable aspect ratios and a pure image generation mode, enhancing its utility for content creators [3][8] Pricing and Cost Structure - Each image generated costs about $0.039 (approximately 0.28 yuan), with the maximum image size being 1024x1024 pixels, consuming around 1290 tokens [16] - The pricing for image generation is 12 times higher than the Gemini 2.5 Flash text mode [17] New Features - The first new feature allows users to customize aspect ratios, offering over ten options including 16:9, 9:16, 4:3, and 3:2, catering to various visual content needs [4][18] - The second feature supports pure image output mode, which returns only images without additional text, saving tokens and reducing contextual interference, ideal for real-time previews and e-commerce displays [7][8] Application and Usability - Users can create their own applications directly in Google AI Studio by inputting prompts, making it accessible for non-developers [13][14] - The new features are designed to meet the practical needs of content creators, positioning Nano Banana as a more practical tool [8]
用两个简单模块实现分割理解双重SOTA!华科大白翔团队等推出多模态新框架
量子位· 2025-10-03 04:19
Core Insights - The article discusses the evolution of multimodal large models from text-to-image generation to pixel-level tasks such as image segmentation, highlighting the challenges of imprecise segmentation results and hallucinations during understanding [1][2]. Group 1: Model Development - The research teams from Huazhong University of Science and Technology and Kingsoft Office proposed two core modules: Semantic Enhanced Feature Extractor (SEFE) and Interleaved Local Visual Coupling (ILVC) to address segmentation accuracy and hallucination issues [3][24]. - SEFE enhances object attribute reasoning by integrating semantic features with pixel-level features, leading to more precise segmentation results [4][25]. - ILVC provides fine-grained supervision by generating local descriptions based on segmentation masks, effectively reducing hallucinations [5][26]. Group 2: Model Performance - The newly developed multimodal large model, LIRA, achieved state-of-the-art (SOTA) performance in both segmentation and understanding tasks [6]. - Compared to InternVL2, LIRA maintains understanding performance while additionally supporting image segmentation tasks; it shows an average improvement of 8.5% in segmentation tasks over OMG-LLaVA and a 33.2% enhancement on MMBench [7]. Group 3: Experimental Results - LIRA demonstrated superior performance across multiple understanding and segmentation datasets, with a slight performance drop of only 0.2% when jointly trained on both comprehension and segmentation datasets [40]. - The integration of SEFE and ILVC resulted in a reduction of hallucination rates by 3.0% and 4.8% for models of sizes 1.8B and 7B, respectively [38]. Group 4: Future Directions - The article suggests that future research should explore the relationship between text and visual tokens, which may provide new insights for enhancing the understanding and segmentation capabilities of multimodal large models [43].
2025人工智能年度评选启动!3大维度5类奖项,正在寻找AI+时代领航者
量子位· 2025-10-03 04:19
组委会 发自 凹非寺 量子位|公众号 QbitAI 为了让更多从业者感受智能浪潮的跃迁,也为了给予更多同行同路人掌声与鼓舞,我们将正式启动 「2025人工智能年度榜单」评选报名 。 这是量子位人工智能年度榜单的 第8年 。八年来,我们见证了技术的突破与落地,产业的融合与重塑,也见证了一批又一批推动时代前行 的企业、人物与产品。 在人工智能重新定义一切的时代里,智能技术已不再是单一工具,而是产业与社会协同进化的驱动力。我们期待通过这场年度评选,去发现 并致敬那些真正引领变革、开拓边界的探索者与实践者。 本次评选将从 企业 、 产品 、 人物 三大维度,设立五类奖项。欢迎企业踊跃报名! 让我们共同见证年度之星,点亮未来的方向。 企业榜 产品榜 人物榜 2025 人工智能年度潜力创业公司 2025 人工智能年度 焦点人物 详细评选标准及报名方式如下。 2025 人工智能年度领航企业 2025 人工智能年度 领航企业 2025 人工智能年度 潜力创业公司 2025 人工智能年度 杰出产品 2025 人工智能年度 杰出解决方案 将面向中国人工智能领域,评选出最具综合实力的企业, 参选条件 : 评选标准 : 聚焦于中国人 ...
LeCun不想再忍了!亲口承认要辞职
量子位· 2025-10-03 04:19
Core Viewpoint - Yann LeCun, a Turing Award winner and a key figure in AI at Meta, is reportedly considering resigning from his position as Chief Scientist of FAIR due to dissatisfaction with recent organizational changes within the AI department at Meta [1][2][3]. Group 1: Organizational Changes and Impact - Recent months have seen significant organizational turmoil within Meta's AI division, leading to LeCun's growing frustration [3][9]. - A new policy requiring additional review from the TBD Lab before FAIR can publish research papers has been implemented, which LeCun views as a direct challenge to academic freedom [5][7][21]. - Meta has undergone four internal reorganizations of its AI department within just six months, creating instability and confusion among researchers [15][17]. Group 2: Personal Impact on LeCun - LeCun has reportedly been demoted in the internal power structure, with the appointment of a new chief scientist for the MSL Lab effectively reducing his influence [18][20]. - The new requirement for additional review of research outputs has further restricted LeCun's ability to publish and share his work, which has been a core aspect of FAIR's mission for the past 12 years [23][25]. Group 3: Team Morale and Internal Tensions - The new policies have led to widespread disappointment among the FAIR team, with some members feeling their academic freedom has been severely limited [27][28]. - Tensions are rising between long-standing employees and new hires, as Meta has aggressively recruited top talent from competitors, leading to disparities in resources and treatment [30][34]. - Reports indicate that the work environment has become highly competitive and stressful, with a culture of "territorial disputes" emerging within the AI departments [34][35]. Group 4: Broader Implications for Meta - The internal strife is not limited to the AI teams; employees from other departments, such as Reality Labs, have expressed dissatisfaction with the AI division's direction and management [38]. - The recent launch of Meta AI's new feature "Vibes" has not performed well in the market, further highlighting the challenges the company faces in maintaining its competitive edge [42][43].
斯坦福洗碗机器人新作!灵巧手跟人学采茶做早餐,CoRL 2025提名最佳论文
量子位· 2025-10-02 05:30
Core Viewpoint - The article discusses the development of DexUMI, a data collection and strategy learning framework that enables robots to learn dexterous tasks through human demonstration, significantly improving data collection efficiency and task success rates [2][35]. Group 1: DexUMI Framework - DexUMI utilizes human hands as a natural interface to transfer dexterous skills to various robotic hands, minimizing the embodied differences between human and robotic manipulation [2][17]. - The framework has achieved an average task success rate of 86% across multiple tasks and improved data collection efficiency by 3.2 times compared to traditional remote operation methods [7][32]. Group 2: Hardware and Software Innovations - The hardware component includes a wearable exoskeleton designed for each type of dexterous hand, optimizing parameters to match human hand movements while maintaining wearability [18]. - The software adaptation involves a data processing pipeline that ensures visual consistency between human demonstrations and robotic deployments, crucial for effective skill transfer [22][32]. Group 3: Testing and Results - DexUMI was tested on two different dexterous hand platforms, achieving high success rates in complex tasks such as opening egg cartons and performing tea ceremonies [32][33]. - The Inspire Hand and XHAND 1 were evaluated, with XHAND 1 demonstrating superior performance due to its fully actuated design and advanced tactile sensing capabilities [33][39]. Group 4: Future Implications - The research establishes a new paradigm for efficient data collection and strategy learning, potentially leading to a community for data sharing among researchers and industry players, enhancing the development of dexterous robotic applications [39][41].
Sora2甚至可以预测ChatGPT的输出
量子位· 2025-10-02 05:30
Core Insights - Sora2 demonstrates advanced capabilities in predicting ChatGPT outputs and rendering HTML, blurring the lines between video generation and interactive AI [2][6] - The system can simulate interactions, generating audio responses in a ChatGPT-like manner, showcasing its ability to create coherent and contextually relevant content [4][5] - Sora2 exhibits a strong understanding of physical phenomena, such as light refraction, without explicit prompts, indicating a high level of intelligence and information processing ability [14][18] Group 1: Sora2's Capabilities - Sora2 can generate interactive content, including video scenes and audio responses, effectively simulating a conversation with ChatGPT [4][6] - The system successfully rendered HTML code, producing results that closely match what would be seen in a real browser [7][12] - Sora2's ability to understand and simulate physical concepts, like glass refraction, was demonstrated through a practical test, impressing users with its accuracy [15][18] Group 2: Game Simulation and Information Processing - Sora2 accurately recreated elements from the game "Cyberpunk 2077," including map locations, terrain, and vehicle designs, showcasing its capability to extract and integrate key information [21][25] - Despite minor inaccuracies, Sora2's performance in simulating a side quest reflects its advanced information processing skills and understanding of complex scenarios [24][25] - There is speculation that Sora2's high-level performance may be based on training with large language models (LLMs), hinting at its potential for further undiscovered capabilities [26][27]
Murati翁荔陈丹琦公司发布首个产品,让大模型微调门槛暴降,要重新发明一个OpenAI
量子位· 2025-10-02 03:26
Core Insights - Thinking Machines Lab has launched its first product, Tinker, which simplifies model fine-tuning to the level of modifying Python code [1][12] - The company has moved past the "zero product, zero revenue" valuation of $84 billion [2] Product Overview - Tinker is a flexible API designed for fine-tuning language models, allowing researchers to control algorithms and data without managing infrastructure [12][13] - The initial support for Tinker includes Qwen3 and Llama3 series models, enabling easy switching between small and large models with a simple string modification in Python code [15] - Tinker’s API automates low-level training steps while handling scheduling, scaling, and error recovery [17] Technical Features - Tinker utilizes LoRA to allow multiple training tasks to share the same GPU, reducing costs and enabling more parallel experiments [22] - The gradient update strategy for Tinker is defined as: New parameters = Original parameters + Learning rate × Advantage value × Gradient of log probability [28] Industry Reception - Tinker has garnered significant attention in the industry, with beta testers noting its excellent balance between abstraction and tunability compared to other fine-tuning tools [30] - Research teams from prestigious institutions have already achieved notable results using Tinker [30] Strategic Vision - Thinking Machines Lab aims to reinvent a version of OpenAI that emphasizes open research sharing and greater freedom for researchers [10][11] - The company’s mission aligns with making cutting-edge models more accessible for customization based on individual needs [14]
英伟达一口气开源多项机器人技术,与迪士尼合作研发物理引擎也开源了
量子位· 2025-10-02 03:26
Core Viewpoint - NVIDIA has made significant advancements in robotics by releasing multiple open-source technologies, including the Newton physics engine, which enhances robots' physical intuition and reasoning capabilities, addressing key challenges in robot development [1][4][10]. Group 1: Newton Physics Engine - The Newton physics engine aims to solve the challenge of transferring skills learned in simulation to real-world applications, particularly for humanoid robots with complex joint structures [4]. - It is an open-source project managed by the Linux Foundation, built on NVIDIA's Warp and OpenUSD frameworks, utilizing GPU acceleration to simulate intricate robot movements [4]. - Leading institutions such as ETH Zurich and Peking University have already begun using the Newton engine, indicating its adoption by top-tier robotics companies and universities [4][3]. Group 2: Isaac GR00T N1.6 Model - The Isaac GR00T N1.6 model integrates the Cosmos Reason visual language model, enabling robots to understand and execute vague commands, a longstanding challenge in the industry [5][6]. - This model allows robots to convert ambiguous instructions into actionable plans while performing simultaneous movements and object manipulations [6]. - The Cosmos Reason model has surpassed 1 million downloads, and the accompanying open-source physical AI dataset has exceeded 4.8 million downloads, showcasing its popularity and utility [6]. Group 3: Training Innovations - The Isaac Lab 2.3 developer preview introduces a new workflow for teaching robots to grasp objects, utilizing an "automated curriculum" that gradually increases task difficulty [8]. - This approach has been successfully implemented by Boston Dynamics' Atlas robot, enhancing its manipulation capabilities [8]. - NVIDIA has collaborated with partners to develop the Isaac Lab Arena, a framework for large-scale experiments and standardized testing, streamlining the evaluation process for developers [8]. Group 4: Hardware Infrastructure - NVIDIA has invested in hardware advancements, including the GB200 NVL72 system, which integrates 36 Grace CPUs and 72 Blackwell GPUs, already adopted by major cloud service providers [9]. - The Jetson Thor, equipped with Blackwell GPUs, supports multiple AI workflows for real-time intelligent interactions, with several partners already utilizing this technology [9]. - Nearly half of the papers presented at CoRL referenced NVIDIA's technologies, highlighting the company's influence in the robotics research community [9]. Group 5: Comprehensive Strategy - NVIDIA's "full-stack" approach, encompassing open-source physics engines, foundational models, training workflows, and hardware infrastructure, is redefining the landscape of robotics development [10]. - The advancements suggest that the integration of robotics into everyday life may occur sooner than anticipated [11].
机器人“狂踹不倒”视频刷屏!太空舱遍布城市街巷,银河通用这几手秀麻了
量子位· 2025-10-02 02:12
Core Viewpoint - The article discusses the advancements in robotics technology, particularly focusing on the Any2Track framework developed by Galaxy General Robotics, which enhances the ability of robots to accurately mimic human movements while maintaining stability in real-world environments [2][7][29]. Group 1: Any2Track Framework - Any2Track is a two-stage reinforcement learning framework that balances precise motion imitation and disturbance resistance, overcoming the challenges of achieving both generality and adaptability in robotic movements [7][8][12]. - The framework consists of two main components: AnyTracker, which focuses on general motion tracking, and AnyAdapter, which enables dynamic adaptation to environmental changes [10][17][28]. - Experimental results show that Any2Track significantly outperforms traditional methods in motion tracking accuracy and adaptability under various disturbances [30][32][36]. Group 2: Practical Applications - Galaxy General Robotics has developed various end-to-end embodied models, such as GraspVLA and TrackVLA, which demonstrate significant breakthroughs in core tasks like precise manipulation and navigation [38][50]. - The "Galaxy Space Capsule" serves as a platform to deploy these robotic technologies in real-world scenarios, enhancing service capabilities in urban environments [40][50]. - The company aims to integrate intelligent robotics into everyday life, with applications ranging from retail to tourism, showcasing the potential of humanoid robots as a new technological hallmark for China [59][60]. Group 3: Technological Innovations - The company employs a data paradigm that prioritizes synthetic data generation complemented by real data, addressing the scarcity of real-world data in the field of embodied intelligence [54][58]. - This approach facilitates rapid and cost-effective production of high-quality data, accelerating the training and deployment of robots across diverse scenarios [55][56]. - The overarching goal is to enable humanoid robots to perform complex tasks in various industries, thereby enhancing productivity and service quality [58][59].
字节Seed发布PXDesign:蛋白设计效率提升十倍,进入实用新阶段
量子位· 2025-10-01 03:03
Core Insights - The article discusses the advancements in AI protein design, particularly through the introduction of the PXDesign method by ByteDance's Seed team, which significantly enhances the efficiency and success rates of protein design tasks [1][3][10]. Summary by Sections Introduction to PXDesign - PXDesign is a scalable protein design method that allows for the generation of hundreds of high-quality candidate proteins within 24 hours, achieving a generation efficiency approximately 10 times higher than mainstream methods [3][10]. - The method has demonstrated a wet lab success rate of 20%–73% across multiple targets, surpassing the success rates of existing models like DeepMind's AlphaProteo, which ranges from 9% to 33% [3][10]. Background and Significance - Proteins are fundamental to life processes, and recent Nobel Prizes in Chemistry highlight the importance of both protein structure prediction and design [6]. - The challenge lies not only in predicting structures but also in reverse designing proteins based on functional requirements, which is crucial for developing new therapies for diseases like cancer and infections [7][8]. Methodology of PXDesign - PXDesign employs a "generation + filtering" approach, where a large number of candidate designs are generated quickly, followed by a filtering process to identify the most promising candidates [13][21]. - The team explored two main technical routes: Hallucination and Diffusion, with PXDesign-d (Diffusion) showing superior performance in generating high-quality, diverse structures [15][16]. Advantages of PXDesign - PXDesign-d utilizes a DiT network structure, allowing for efficient training on larger datasets, which enhances generation speed and quality compared to other methods [17]. - The filtering process uses structural prediction models to select the most viable candidates, with Protenix outperforming AlphaFold 2 in accuracy and efficiency [25][26]. Tools and Services - The Protenix team has developed the PXDesign Server, a user-friendly web service that allows researchers to design and evaluate binder candidates without needing complex setups [28][29]. - The server offers two modes: Preview for quick debugging and Extended for in-depth research, significantly reducing the design cycle compared to traditional methods [30][32]. Evaluation Standards - To address the lack of unified evaluation standards in the field, the Protenix team introduced PXDesignBench, a comprehensive evaluation toolbox that integrates various assessment metrics and processes [32]. Industry Context - Other tech giants like Microsoft and Apple are also making strides in the biological field, indicating a growing trend of AI applications in biotechnology and pharmaceuticals [33].