Workflow
多模态智能体
icon
Search documents
关注黑色、农业上游价格波动
Hua Tai Qi Huo· 2025-08-19 03:22
Report Summary 1. Industry Investment Rating - No information provided in the content. 2. Core Viewpoints - The report focuses on price fluctuations in the upstream of the black and agricultural industries, and also takes note of mid - level events in the production and service industries [1]. - It emphasizes the need to pay attention to the implementation of real - estate new policies and the development of artificial intelligence technology requirements [1]. 3. Summary by Industry Segment Upstream - In the black industry, the glass price has declined significantly year - on - year [2]. - In the agricultural industry, the prices of eggs and palm oil are rising. Specifically, the spot price of eggs on August 18 was 6.7 yuan/kg, with a year - on - year increase of 5.02%, and the spot price of palm oil was 9626.0 yuan/ton, with a year - on - year increase of 6.39% [2][47]. Midstream - In the chemical industry, the PX operating rate is increasing [3]. Downstream - In the real - estate industry, the sales of commercial housing in first - and second - tier cities have declined [4]. - In the service industry, the increase in the number of domestic flights has moderated [4]. 4. Key Data Charts - The report includes data charts on coal consumption, inventory, operating rates of various industries (such as PTA, PX, polyester, etc.), peak congestion indices of major cities, movie box office, flight execution, and real - estate transaction data [6]. 5. Key Industry Price Index Tracking - The report tracks the prices of various industries including agriculture, non - ferrous metals, black metals, non - metals, energy, chemicals, and real - estate. For example, in the agricultural industry, the spot price of corn on August 18 was 2317.1 yuan/ton, with a year - on - year decrease of 0.18%; in the black metal industry, the spot price of glass on August 18 was 14.3 yuan/square meter, with a year - on - year decrease of 5.12% [47].
字节Seed开源长线记忆多模态Agent,像人一样能听会看
量子位· 2025-08-18 06:55
Core Insights - The article discusses the launch of M3-Agent, a new multimodal intelligent agent framework by ByteSeed, which can process real-time visual and auditory inputs, build and update long-term memory, and develop semantic memory over time [2][7]. Group 1: M3-Agent Features - M3-Agent is capable of human-like perception, including hearing and seeing, and is designed to be free and open-source [2]. - It utilizes a new long video question-answering benchmark called M3-Bench, developed collaboratively by ByteSeed, Zhejiang University, and Shanghai Jiao Tong University, to evaluate memory effectiveness and reasoning based on memory [2][22]. Group 2: Performance Metrics - Experimental results show that M3-Agent significantly outperforms baseline models, including commercial models like Gemini-1.5-Pro and GPT-4o, across multiple benchmark tests [3][30]. - In the M3-Bench-robot benchmark, M3-Agent achieved a 6.3% accuracy improvement over the strongest baseline model, MA-LLM, while in M3-Bench-web and VideoMME-long, it surpassed the top baseline model, Gemini-GPT4o-Hybrid, by 7.7% and 5.3% respectively [34][35]. Group 3: Memory and Reasoning Capabilities - M3-Agent operates through two parallel processes: a memory process that continuously perceives real-time multimodal inputs to build and update long-term memory, and a control process that interprets external instructions and reasons based on stored memories to execute tasks [8][9]. - The memory process generates two types of memory: event memory, which records specific events observed in videos, and semantic memory, which derives general knowledge from segments [11][12]. Group 4: Benchmarking and Evaluation - M3-Bench consists of two subsets: M3-Bench-robot, which includes 100 real-world videos recorded from a robot's first-person perspective, and M3-Bench-web, which contains 920 videos from various online sources [26]. - The benchmark evaluates the agent's ability to recall past observations and reason based on memory through various question types, including multi-detail, multi-hop, cross-modal reasoning, and general knowledge extraction [24][27]. Group 5: Conclusion - The results indicate that M3-Agent excels in maintaining character consistency, enhancing human understanding, and effectively integrating multimodal information [36].
AI 编程冲击来袭,程序员怎么办?IDEA研究院张磊:底层系统能力才是护城河
AI前线· 2025-08-10 05:33
Core Insights - The article discusses the challenges and opportunities in the field of artificial intelligence, particularly focusing on the integration of visual understanding, spatial intelligence, and action execution in multi-modal intelligent agents [2][5][10]. Group 1: Multi-Modal Intelligence - The transition to a new era of multi-modal intelligent agents involves overcoming significant challenges in visual understanding, spatial modeling, and the integration of perception, cognition, and action [2][4]. - Achieving effective integration of language models, robotics, and visual technologies is crucial for the advancement of AI [5][9]. Group 2: Visual Understanding - Visual input is characterized by high dimensionality and requires understanding of three-dimensional structures and interactions, which is complex and often overlooked [6][7]. - The development of visual understanding is essential for robots to perform tasks accurately, as it directly impacts their operational success rates [7][8]. Group 3: Spatial Intelligence - Spatial intelligence is vital for robots to identify objects, assess distances, and understand structures for effective action planning [7][10]. - Current models, such as the visual-language-action (VLA) model, face challenges in accurately understanding and locating objects, which affects their practical application [8][9]. Group 4: Research and Application Balance - Researchers in the industrial sector must balance foundational research with practical application, focusing on solving real-world problems rather than merely publishing papers [12][14]. - The ideal research outcome is one that combines both research value and application value, avoiding work that lacks significance in either area [12][13]. Group 5: Recommendations for Young Professionals - Young professionals should focus on building solid foundational skills in computer science, including understanding operating systems and distributed systems, rather than solely on experience with large models [17][20]. - Emphasis should be placed on understanding the principles behind AI technologies and their applications, rather than just performing parameter tuning [19][20].
智象未来亮相 WAIC:多模态智能体,重塑创作的未来版图
Cai Fu Zai Xian· 2025-07-29 03:28
Core Insights - The core viewpoint of the article emphasizes the technological breakthroughs and commercialization practices of multimodal AI in content creation, as articulated by the CTO of HiDream.ai during the 2025 World Artificial Intelligence Conference (WAIC) [1] Multimodal AI Development - HiDream.ai focuses on addressing real creative pain points, exploring a path of "technology foundation, scene breakthrough, and value closure" for commercialization [1] - The company believes that true AI commercialization involves end-to-end empowerment from model capabilities to service forms and final outcomes [1] Commercialization Framework - The company has established a progressive commercialization system of "MaaS-SaaS-RaaS": - MaaS (Model as a Service) serves as the foundation, aiming to create a multimodal base model worth billions that supports the generation and understanding of images, videos, audio, and text [1] - SaaS (Software as a Service) acts as a bridge, developing products for vertical scenarios and building platforms for individual creators to lower the barriers to creation [2] - RaaS (Result as a Service) represents the end goal, delivering tangible results to clients through commercial video marketing services and new media creation agents, positioning AI as a true productivity tool [3] Technological Advancements - HiDream.ai's multimodal model has undergone three significant iterations, enhancing its core advantages of deep understanding, precise control, and high-quality output [4] - The model's evolution includes: - Version 1.0 launched in August 2023, achieving multimodal alignment - Version 2.0 in June 2024, enhancing spatiotemporal modeling - Version 3.0 in December 2024, incorporating multi-scenario learning and memory enhancement [4] Performance Metrics - HiDream's open-source models have seen significant success, with over 600,000 downloads and high rankings on international authority lists [6] - The HiDream-I1 model reached the top of the Artificial Analysis leaderboard within 24 hours of its open-source release, marking a milestone for Chinese self-developed models [6] Product Offerings - The company has developed a comprehensive toolchain centered around "agents" for content creation, covering image generation, video creation, and marketing communication [11] - The vivago agent focuses on short video creation, allowing users to provide various media inputs for automatic analysis and content generation [11] - HiClip, a long video editing agent, addresses issues of content overload and inefficient distribution by extracting key segments and generating audio summaries [12] Ecosystem Collaboration - HiDream.ai is building an ecosystem network across various industries, including cross-border, internet, film, new media, and cultural tourism, to create a win-win scenario of "technology-scene-ecosystem" [13] Vision for Creators - The company aims to empower every creator to unleash their creative potential, ensuring that AI truly understands and assists in the creative process [15]
机器人高层指挥低层做,“坐标系转移接口”一次演示实现泛化学习 | ICML2025
量子位· 2025-07-22 04:35
Core Viewpoint - The HEP (Hierarchical Equivariant Policy via Frame Transfer) framework, developed by Northeastern University and Boston Dynamics RAI, aims to enable AI to adapt to complex real-world scenarios with minimal demonstrations, enhancing efficiency and flexibility in robotic learning [1][4]. Summary by Sections HEP Framework Highlights - The HEP framework efficiently expresses 3D visual information while balancing detail restoration and computational speed [2]. Core Innovations - The framework addresses the long-standing issues of data scarcity and generalization in AI applications by utilizing a hierarchical policy learning framework transfer interface, which allows for strong inductive bias while maintaining flexibility [4]. Simplified and Efficient Hierarchical Structure - The high-level policy sets global objectives, while the low-level policy optimizes actions in a local coordinate system, significantly improving operational flexibility and efficiency [5]. - The model automatically adapts to spatial transformations such as translation and rotation, greatly reducing the dependence on data volume for generalization [5]. Key Concepts - HEP is based on two core ideas: hierarchical policy structure and the "coordinate transfer interface," where the high-level policy provides a "reference coordinate" for the low-level policy to optimize execution details [7]. - The coordinate transfer interface enhances the flexibility of the low-level policy while transmitting the high-level policy's generalization and robustness capabilities [9]. Effectiveness Demonstration - The research team tested the HEP framework on 30 simulated tasks in RLBench, including high-precision and long-duration tasks, and further validated it on three real-world robotic tasks [10]. - The high-level policy predicts a "key pose" for global planning, while the low-level policy generates detailed motion trajectories based on this key pose [11]. Results - The hierarchical strategy shows significant advantages in complex long-range tasks, with the HEP framework learning robust multi-step collaborative tasks with only 30 demonstration data, outperforming non-hierarchical methods [14]. - In the Pick & Place task, HEP achieved 1-shot generalization learning with just one demonstration, significantly improving data efficiency [15]. - The coordinate transfer interface successfully transmits the high-level adaptability to spatial changes to the low-level policy, making the overall strategy easier to extend to new scenarios [16]. - HEP's success rate improved by up to 60% compared to traditional methods under environmental changes and disturbances from unrelated objects [17]. Future Implications - The coordinate transfer interface imposes soft constraints on the low-level policy, ensuring flexibility and providing a natural interface for future integration of multimodal and cross-platform high-level strategies [19].
演讲生成黑科技,PresentAgent从文本到演讲视频
机器之心· 2025-07-18 08:18
Core Viewpoint - PresentAgent is introduced as a multimodal agent capable of transforming lengthy documents into narrated presentation videos, overcoming limitations of existing methods that primarily generate static slides or text summaries [1][9]. Group 1: System Overview - PresentAgent employs a modular process that includes systematic document segmentation, slide style planning and rendering, context-aware voice narration generation using large language models, and precise audio-visual alignment to create a complete video [3][5][19]. - The system takes various document types (e.g., web pages, PDFs) as input and outputs a presentation video that combines slides with synchronized narration [17][19]. Group 2: Evaluation Framework - PresentEval is introduced as a unified evaluation framework driven by visual-language models, assessing content fidelity, visual clarity, and audience comprehension [6][10]. - The evaluation is based on a carefully curated dataset of 30 document-presentation pairs, demonstrating that PresentAgent performs close to human levels across all evaluation metrics [7][12]. Group 3: Contributions - The paper presents a new task of "document-to-presentation video generation," aiming to automatically create structured slide videos with voice narration from various long texts [12]. - A high-quality benchmark dataset, Doc2Present Benchmark, is constructed to support the evaluation of document-to-presentation video generation [12]. - The modular design of PresentAgent allows for controllable, interpretable, and multimodal alignment, balancing high-quality generation with fine-grained evaluation [19][27]. Group 4: Experimental Results - The main experimental results indicate that most variants of PresentAgent achieve comparable or superior test accuracy to human benchmarks, with Claude-3.7-sonnet achieving the highest accuracy of 0.64 [22][25]. - Subjective quality assessments show that while human-made presentations still lead in overall video and audio ratings, some PresentAgent variants demonstrate competitive performance, particularly in content and visual appeal [26][27]. Group 5: Case Study - An example of a fully automated presentation video generated by PresentAgent illustrates the system's ability to identify structural segments and produce slides with conversational subtitles and synchronized voice, effectively conveying technical information [29].
AI 编程冲击来袭,程序员怎么办?IDEA研究院张磊:底层系统能力才是护城河
AI前线· 2025-07-13 04:12
Core Viewpoint - The article discusses the challenges and opportunities in the development of multi-modal intelligent agents, emphasizing the need for effective integration of perception, cognition, and action in AI systems [1][2][3]. Multi-modal Intelligent Agents - The three essential components of intelligent agents are "seeing" (understanding input), "thinking" (processing information), and "doing" (executing actions), which are critical for advancing AI capabilities [2][3]. - There is a need to focus on practical problems with real-world applications rather than purely academic pursuits [2][3]. Visual Understanding and Spatial Intelligence - Visual input is complex and high-dimensional, requiring a deep understanding of three-dimensional structures and interactions with objects [3][5]. - Current models, such as the visual-language-action (VLA) model, struggle with precise object understanding and positioning, leading to low operational success rates [5][6]. - Achieving high accuracy in robotic operations is crucial, as even a small failure rate can lead to user dissatisfaction [5][8]. Research and Product Balance - Researchers in the industrial sector must balance between conducting foundational research and ensuring practical application of their findings [10][11]. - The ideal research outcome is one that combines both research value and application value, avoiding work that lacks significance in either area [11][12]. Recommendations for Young Professionals - Young professionals should focus on building solid foundational skills in computer science, including understanding operating systems and distributed systems, rather than solely on model tuning [16][17]. - The ability to optimize systems and understand underlying principles is more valuable than merely adjusting parameters in AI models [17][18]. - A strong foundation in basic disciplines will provide a competitive advantage in the evolving AI landscape [19][20].
Grok-4,马斯克口中地表最强AI
Sou Hu Cai Jing· 2025-07-11 12:58
Core Insights - Musk's xAI company launched the AI model Grok-4, which is claimed to be the "smartest AI in the world" and has excelled in various AI benchmark tests [1][8][10] Company Overview - xAI was founded on July 12, 2023, with the goal of addressing deeper scientific questions and aiding in solving complex scientific and mathematical problems [3] - Grok-4 is available for subscription, with Grok-4 priced at $30 per month and Grok-4 Heavy at $300 per month, making it the most expensive AI subscription plan currently [5] Performance Metrics - Grok-4 achieved impressive scores in various benchmark tests, including: - 88.9% in GPQA (Graduate-level Question Answering) - 100% in AIME25 (American Mathematics Invitational Exam) - 79.4% in LiveCodeBench (Programming Benchmark) - 96.7% in HMMT25 (Harvard-MIT Mathematics Tournament) - 61.9% in USAMO25 (USA Mathematical Olympiad) [8][10] - In the Humanity's Last Exam (HLE), Grok-4 Heavy reached a 44.4% accuracy rate, demonstrating doctoral-level performance across all fields [10] Technological Advancements - Grok-4's training volume is 100 times that of Grok-2 and 10 times that of Grok-3, with significant improvements in reasoning and tool usage capabilities [15][16] - The model is expected to integrate with Tesla-like tools later this year, enhancing its ability to interact with the real world [16] Future Prospects - Musk anticipates that Grok could discover useful new technologies as early as next year, with a strong possibility of uncovering new physics within two years [13][15] - The company plans to develop AI-generated video games and films, with the first AI movie expected next year [23][25] Economic Potential - In a simulated business scenario, Grok-4 outperformed other models in generating revenue, creating double the value of its closest competitor [22] - Musk stated that with 1 million vending machines, the AI could generate $4.7 billion annually [22]
文档秒变演讲视频还带配音!开源Agent商业报告/学术论文接近人类水平
量子位· 2025-07-11 04:00
Core Viewpoint - PresentAgent is a multimodal AI agent designed to automatically convert structured or unstructured documents into video presentations with synchronized voiceovers and slides, aiming to replicate human-like information delivery [1][3][22]. Group 1: Functionality and Process - PresentAgent generates highly synchronized visual content and voice explanations, effectively simulating human-style presentations for various document types such as business reports, technical manuals, policy briefs, or academic papers [3][21]. - The system employs a modular generation framework that includes semantic chunking of input documents, layout-guided slide generation, rewriting key information into spoken text, and synchronizing voice with slides to produce coherent video presentations [11][20]. - The process involves several steps: document processing, structured slide generation, synchronized subtitle creation, and voice synthesis, ultimately outputting a presentation video that combines slides and voice [13][14]. Group 2: Evaluation and Performance - The team conducted evaluations using a test set of 30 pairs of human-made "document-presentation videos" across various fields, employing a dual-path evaluation strategy that assesses content understanding and quality through visual-language models [21][22]. - PresentAgent demonstrated performance close to human levels across all evaluation metrics, including content fidelity, visual clarity, and audience comprehension, showcasing its potential in transforming static text into dynamic and accessible presentation formats [21][22]. - The results indicate that combining language models, visual layout generation, and multimodal synthesis can create an explainable and scalable automated presentation generation system [23].
售41.87万元起,2025款奥迪A7L上市;阿里云与比亚迪合作,Mobile-Agent将接入比亚迪座舱丨汽车交通日报
创业邦· 2025-06-10 10:26
Group 1 - Zeekr has announced a patent for a vehicle anti-tailgating alert system, which aims to alleviate driving anxiety by providing real-time distance alerts to following vehicles [1] - Alibaba Cloud and BYD are collaborating to integrate the Mobile-Agent AI system into BYD's vehicle cockpit, enhancing user interaction through visual recognition and multi-modal capabilities [1] - Lynk & Co has launched a refreshed version of the Lynk 01, with prices starting at 118,800 yuan, featuring a new floating central control screen and updated design elements [1] Group 2 - The 2025 Audi A7L has been released with a price range of 418,700 to 666,200 yuan, maintaining similar design features to its predecessor while making minor configuration adjustments [3]