Workflow
机器之心
icon
Search documents
Nano Banana Pro一手实测:我们玩嗨了
机器之心· 2025-11-21 10:17
Core Insights - The article discusses the capabilities of the newly released AI tool, Nano Banana Pro, particularly in generating images and understanding complex prompts related to engineering structures like the Huajiang Canyon Bridge [4][12][13]. Group 1: AI Capabilities - Nano Banana Pro demonstrated exceptional control and accuracy in generating images based on detailed prompts, including the ability to incorporate specific logos and contextual information from the internet [10][12]. - The AI was tested with challenging scenarios, such as transforming a night image of the Huajiang Canyon Bridge into a daytime scene, showcasing its ability to maintain detail and realism [16][19]. - The model's performance was further evaluated by asking it to describe the bridge's structure and principles, where it successfully identified and labeled various components, although some minor inaccuracies were noted [24][27]. Group 2: Testing Challenges - The AI faced increased difficulty when tasked with generating detailed blueprints and technical illustrations of the bridge, revealing some limitations in accurately placing data markers [32][33]. - Despite some errors, Nano Banana Pro was able to provide a general understanding of the construction process, indicating its potential as an educational tool [36][33]. Group 3: User Experience - The AI's ability to understand prompts in Chinese and generate high-quality results on the first attempt was highlighted as a significant advantage for users [36][37]. - The article also included lighter content, showcasing the AI's versatility in generating fun and creative images, such as transforming characters into different settings [50][64].
别问树模型了!死磕结构化数据,清华团队把大模型表格理解推到极限
机器之心· 2025-11-21 04:50
Core Insights - The article discusses the significance of structured data processing in the context of AI advancements, particularly highlighting the introduction of the LimiX model, which represents a paradigm shift in handling structured data [2][31][35] Group 1: LimiX Model Introduction - LimiX is a groundbreaking model that successfully integrates structured data processing into the era of large models, achieving what previous models could not [3][12][31] - It is capable of performing multiple tasks such as classification, regression, missing value imputation, and causal inference without the need for retraining [12][22] Group 2: Performance and Benchmarking - LimiX-16M has demonstrated superior performance in various benchmarks, outperforming traditional models like XGBoost and CatBoost, achieving optimal results in 58.6% of datasets [13][15] - In regression tasks, LimiX models secured the top two positions, with a combined win rate of 62% [15] - LimiX excels in missing value imputation, achieving state-of-the-art results in this area [18] Group 3: Real-World Applications - The model has been successfully implemented in industrial settings, such as food production, where it predicts complex relationships between process parameters and product quality, reducing average deviation to less than 9% [21] - In the electricity market, LimiX improved internal model error from 46.93% MAPE to 25.27% MAPE, showcasing its practical utility [21] Group 4: Accessibility and Community Engagement - LimiX-2M, a lightweight version of the model, has been made open-source, allowing researchers and small teams to utilize it effectively [22][29] - The model's community is active, with quick responses on GitHub, facilitating user engagement and support [30] Group 5: Future Implications - The introduction of LimiX signifies a shift towards a new paradigm in AI, emphasizing the importance of structured data in industrial applications [31][34] - The model's success positions China at the forefront of structured data modeling, with potential global implications for industrial AI [35][36]
超越 VTM-RA!快手双向智能视频编码器BRHVC亮相NeurIPS2025
机器之心· 2025-11-21 03:56
Core Viewpoint - The article discusses the challenges and advancements in bi-directional video coding, particularly focusing on the new BRHVC method developed by Kuaishou's audio and video technology team, which significantly improves compression performance over existing standards [2][29]. Video Coding Challenges - Video coding is essential for addressing the conflict between massive video data and limited transmission and storage resources, with uncompressed 4K video reaching up to 20 GB per minute [4]. - Current video coding techniques can reduce video bitrate by 1/100 to 1/1000, enabling applications like short videos, live streaming, and cloud gaming [4]. Bi-directional Coding - Bi-directional coding (RA mode) has been a "secret weapon" for efficient compression but faces challenges in deep learning-based intelligent video coding due to complex reference structures [2][7]. - The RA mode can save over 20% bitrate compared to low-latency modes while maintaining high quality, making it suitable for on-demand and storage scenarios [7]. Key Issues in RA Mode - The long-span frame motion processing is complicated due to the exponential growth of frame intervals, which can reach up to 32 frames, leading to significant motion complexity [8]. - There is a notable imbalance in the contribution of reference frames, where the value of information from two reference frames can differ significantly, affecting encoding efficiency [9][11]. BRHVC Framework - The BRHVC framework introduces two innovative modules: Bi-directional Motion Converge (BMC) and Bi-directional Contextual Fusion (BCF), addressing the challenges of long-span motion processing and reference contribution imbalance [13][20]. - BMC enhances motion estimation by aggregating multi-scale optical flow into a single latent variable, improving motion compensation accuracy in large displacement scenarios [16][17]. - BCF generates spatially adaptive weight maps to re-weight reference features based on their importance, effectively addressing occlusion issues in long-span frames [20][22]. Experimental Results - BRHVC achieved an average bitrate saving of 32.0% compared to traditional encoders like VTM-LDB, with a peak saving of 44.7% in Class D sequences [25]. - The framework also surpassed the VTM-RA encoder in encoding efficiency, demonstrating its effectiveness in bi-directional intelligent video compression [25]. Conclusion - The research highlights the core challenges in bi-directional intelligent video compression and presents the BRHVC framework as a significant advancement, providing a new direction for future developments in intelligent video coding [29].
Meta超级智能实验室又发论文,模型混一混,性能直接SOTA
机器之心· 2025-11-21 03:56
Core Insights - The article discusses the concept of Model Souping, which involves averaging the weights of multiple models of the same architecture to create a new, stronger model. This method is more lightweight and cost-effective compared to training a large unified model, while also leveraging the complementary capabilities of different models [1][2]. Group 1: Model Souping Methodology - Traditional Model Souping typically uses simple uniform averaging, which directly combines the parameters of candidate models with equal weights. The article introduces a systematic approach called Soup of Category Experts (SoCE), which selects optimal model candidates based on benchmark category composition and employs non-uniform weighted averaging to maximize overall performance [2][5]. - SoCE is based on the observation that model performance across different benchmark categories often shows weak correlation. This allows SoCE to strategically select expert models for each weakly correlated category cluster and combine them through optimized weighting [8][11]. Group 2: Experimental Results - The authors conducted extensive experiments to evaluate the effectiveness of SoCE across multiple dimensions. In the Berkeley Function Calling Leaderboard (BFCL), the 70 billion parameter model achieved an accuracy of 80.68%, setting a new state-of-the-art (SOTA) and improving by 2.7% over the previous best single model [14]. - For the 8 billion parameter model, SoCE reached an accuracy of 76.50%, surpassing the previous 8 billion model by 5.7%. The optimal weight configuration for the 8 billion model was identified as xLAM-2-8b-fc-r (0.7), ToolACE-2-8B (0.2), and watt-tool-8B (0.1) [16][18]. - The article presents a correlation heatmap illustrating the performance relationships among different categories, highlighting that strong correlations exist among multi-turn tasks, while weak or negative correlations are observed in other areas [6][8]. Group 3: Performance Improvement - The analysis indicates that the linear correlation between categories significantly improves after Model Souping. In 37 model souping experiments, the candidates showed higher scores in over 20 categories, with net performance gains across all categories [22][23]. - SoCE successfully identifies specialized models for different categories, leading to substantial performance enhancements [25].
两院院士增选结果揭晓:周志华、刘云浩当选科学院院士
机器之心· 2025-11-21 02:04
Core Points - The Chinese Academy of Sciences and the Chinese Academy of Engineering announced the results of the 2025 academician elections, electing 73 academicians from the former and 71 from the latter, further optimizing the structure of the academician team in China [2][3] - The average age of newly elected academicians from the Chinese Academy of Sciences is 57.2 years, with 67.1% being 60 years old or younger, and 5 female scientists among the elected [2][3] - Notably, several scholars related to the field of artificial intelligence were elected, highlighting China's ongoing breakthroughs and emphasis on cutting-edge technology [4][7] Summary by Sections Chinese Academy of Sciences - A total of 908 academicians are currently in the Chinese Academy of Sciences after this election [3] - The newly elected academicians include prominent figures in computer science and artificial intelligence, indicating a focus on advanced technology [7] - Notable elected members include Liu Yunhao, a professor at Tsinghua University, recognized for his research in computer system architecture and IoT [10][11] and Zhou Zhihua, a professor at Nanjing University, known for his work in machine learning theory and methods [12][15] Chinese Academy of Engineering - The Chinese Academy of Engineering elected 71 academicians and 24 foreign academicians in 2025 [25] - The election reflects a diverse range of expertise across various engineering disciplines, including mechanical, electronic, and environmental engineering [26][27][29][30] - The elected academicians are affiliated with prestigious institutions, contributing to advancements in their respective fields [26][27][29][30]
无需训练、只优化解码策略,DTS框架让大模型推理准确率提升6%,推理长度缩短23%
机器之心· 2025-11-21 02:04
Core Insights - The article discusses the advancements in Large Reasoning Models (LRMs) and introduces DTS (Decoding Tree Sketching), a new inference framework that addresses the issue of "overthinking" in models, which leads to longer and often incorrect reasoning paths [2][8][26]. Group 1: Problem Identification - The "overthinking" problem in reasoning models results in longer reasoning chains that are more prone to errors and self-repetition, decreasing accuracy [8][11]. - Existing methods to mitigate this issue often rely on additional training or aggressive pruning, which can be costly and unstable [8][11]. Group 2: DTS Framework - DTS employs two key strategies: high uncertainty branching reasoning and early stopping upon the first completion of a path, aiming to approximate the shortest and correct reasoning path [2][8][26]. - The framework does not require additional training or modifications to model weights, making it a plug-and-play solution [8][26]. Group 3: Empirical Results - In AIME2024/2025, DTS achieved an average accuracy improvement of 6% and a reduction in average reasoning length by approximately 23%, along with a 10% decrease in endless repetition rates [4][20]. - The empirical findings indicate a significant negative correlation between reasoning chain length and accuracy, with shorter reasoning chains often yielding higher correctness rates [9][11]. Group 4: Methodology - The reasoning process is conceptualized as a decoding tree, where nodes represent generated tokens and paths represent complete chains of thought (CoT) [12][13]. - DTS focuses on branching only at "key tokens" where uncertainty is high, thereby avoiding unnecessary complexity in the decoding tree [15][16]. Group 5: Conclusion and Future Directions - DTS provides a lightweight optimization route for reasoning models, allowing them to "think less but more accurately" [26][27]. - The approach is expected to integrate with multi-step reasoning, calibration, and uncertainty estimation, paving the way for more efficient and reliable reasoning in LRMs [27].
AAAI 2025 Oral | 火山引擎多媒体实验室提出VQ-Insight,AIGC视频画质理解大模型
机器之心· 2025-11-20 15:13
Core Insights - The article discusses the advancements made by ByteDance's Volcano Engine Multimedia Lab in the field of multimedia technology, particularly focusing on the VQ-Insight model for AI-generated video quality assessment [2][4][19] - VQ-Insight has been recognized at the AAAI 2026 conference, highlighting its significance in the artificial intelligence research community [2] Research and Development - The Volcano Engine Multimedia Lab collaborates with Peking University and has produced a paper titled "VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning," which was selected for oral presentation at AAAI 2026 [2][4] - The lab has achieved multiple accolades in international technical competitions and has published numerous papers in top-tier journals [2] Methodology - VQ-Insight employs a progressive visual quality reinforcement learning framework, which includes phases for image scoring, task-driven temporal learning, and joint fine-tuning with video generation models [6][19] - The model aims to enhance the understanding of video quality by focusing on temporal coherence and multi-dimensional quality assessment, addressing challenges in AI-generated content evaluation [4][6] Performance Metrics - VQ-Insight has demonstrated superior performance in various tasks, including AIGC video preference comparison and multi-dimensional scoring, outperforming state-of-the-art methods in multiple datasets [10][12][19] - In the AIGC preference comparison task, VQ-Insight achieved a performance score of 50.80 in VOAScore and 75.71 in VideoReward, indicating its effectiveness in evaluating video quality [11] Application and Impact - The model's capabilities can be directly applied to optimize video generation models, enhancing the quality of generated content by providing accurate preference data for training [17][19] - VQ-Insight serves as a plug-and-play reward and preference module for video generation training, contributing to the development of next-generation AIGC video technologies [19]
谷歌Nano Banana Pro上线,深度结合Gemini 3,这下生成世界了
机器之心· 2025-11-20 15:13
Core Viewpoint - Google has launched the Nano Banana Pro (Gemini 3 Pro Image), an advanced image generation model that enhances creative control, text rendering, and world knowledge, enabling users to create studio-level design works with unprecedented capabilities [3][4][6]. Group 1: Model Capabilities - Nano Banana Pro can generate high-resolution images at 2K and 4K, significantly improving detail, precision, stability, consistency, and controllability [8][9]. - The model supports a wide range of aspect ratios, addressing previous limitations in controlling image proportions [9][11]. - Users can combine up to 14 reference images while maintaining consistency among up to 5 characters, enhancing the model's ability to create visually coherent compositions [12][13][23]. Group 2: Creative Control - The model allows for "molecular-level" control over images, enabling users to select and reshape any part of an image for precise adjustments [25][26]. - Users can switch camera angles, generate different perspectives, and apply cinematic color grading, providing a high degree of narrative control [32][26]. Group 3: Text Generation - Nano Banana Pro features strong text generation capabilities, producing clear, readable, and multilingual text that integrates seamlessly with images [34][40]. - The model can translate text into different languages while maintaining high-quality detail and font style [41]. Group 4: Knowledge Integration - The model leverages Gemini 3's advanced reasoning to produce visually accurate content, incorporating a vast knowledge base into the generation process [44]. - It can connect to real-time web content for generating outputs based on the latest data, enhancing the accuracy of visual representations [45][46]. Group 5: User Accessibility - Nano Banana Pro will be available across various Google products, targeting consumers, professionals, and developers, with different access levels based on subscription types [59][60][61]. - The model will also be integrated into Google Workspace applications, enhancing productivity tools like Google Slides and Google Vids [62]. Group 6: Verification and Transparency - Google has introduced a new feature allowing users to verify whether an image was generated or edited by Google AI, enhancing content transparency [56][57]. - This capability is powered by SynthID, a digital watermarking technology that embeds imperceptible signals into AI-generated content [57].
DeepSeek悄悄开源LPLB:用线性规划解决MoE负载不均
机器之心· 2025-11-20 15:13
Core Insights - DeepSeek has launched a new code repository called LPLB (Linear-Programming-Based Load Balancer) on GitHub, which aims to optimize the workload distribution in Mixture of Experts (MoE) models [2][5]. - The project is currently in the early research stage, and its performance improvements are still under evaluation [8][15]. Project Overview - LPLB is designed to address dynamic load imbalance issues during MoE training by utilizing linear programming algorithms [5][9]. - The load balancing process involves three main steps: dynamic reordering of experts based on workload statistics, constructing replicas of experts, and solving for optimal token distribution for each batch of data [5][6]. Technical Mechanism - The expert reordering process is assisted by EPLB (Expert Parallel Load Balancer), and real-time workload statistics can be collected from various sources [6][11]. - LPLB employs a lightweight solver that uses NVIDIA's cuSolverDx and cuBLASDx libraries for efficient linear algebra operations, ensuring minimal resource consumption during the optimization process [6][11]. Limitations - LPLB currently focuses on dynamic fluctuations in workload, while EPLB addresses static imbalances [11][12]. - The system has some limitations, including ignoring nonlinear computation costs and potential delays in solving optimization problems, which may affect performance under certain conditions [11][12]. Application and Value - The LPLB library aims to solve the "bottleneck effect" in large model training, where the training speed is often limited by the slowest GPU [15]. - It introduces linear programming as a mathematical tool for real-time optimal allocation and leverages NVSHMEM technology to overcome communication bottlenecks, making it a valuable reference for developers researching MoE architecture training acceleration [15].
最大游戏up主也玩本地AI?让笔记本都能跑大模型的Parallax来了
机器之心· 2025-11-20 09:35
Core Viewpoint - PewDiePie, a prominent gaming influencer, has created a local AI system, sparking widespread discussion about the potential of local AI deployments versus cloud-based solutions [1][5][6]. Group 1: Local AI System Development - PewDiePie invested $20,000 to assemble a local AI system with 10 NVIDIA GPUs, including 8 modified RTX 4090 and 2 RTX 4000 Ada, capable of running models with parameters ranging from 70 billion to 245 billion [4]. - The local AI system allows for complete control over the AI environment, contrasting with traditional cloud-based AI models where users rent resources without ownership [10][11]. - The local AI's key advantages include privacy, performance, and composability, making it an attractive option for users concerned about data security and control [12][18]. Group 2: Rise of Local AI Projects - The emergence of local AI projects like Parallax has gained significant attention, with endorsements from various AI communities and platforms [16][23]. - Parallax is described as a fully autonomous local AI operating system, challenging the notion that AI must be cloud-based [24][25]. - The system supports cross-platform deployment across different devices, allowing users to maintain control over their models and data [26]. Group 3: Performance and Scalability - Parallax offers three operational modes: single device, local cluster, and global cluster, enabling flexible deployment options [29]. - Performance tests indicate that Parallax can significantly enhance inference speed and throughput compared to existing solutions, achieving up to 3.2 times higher throughput in GPU pool configurations [31]. - The system is compatible with over 40 open-source models and can run seamlessly on various operating systems, enhancing its accessibility [31]. Group 4: Getting Started with Parallax - The Parallax GitHub repository provides clear guidance for users to start deploying models on their devices [33]. - Users have successfully run models like Qwen 235B on personal devices, indicating the practicality of local AI setups [34]. - An ongoing event encourages users to showcase their local AI setups, with attractive prizes, further promoting engagement with the Parallax platform [37][38].