万物皆可分割，Meta SAM 3D 能帮 AI 理解这个复杂又混乱的世界吗？｜锦秋AI实验室

Core Viewpoint - The article evaluates Meta's SAM 3D AI model, highlighting its strengths in 3D understanding and generation, while also identifying significant limitations in complex real-world scenarios [3][7][57]. Group 1: Testing Scenarios - Round 1 focuses on SAM 3D's ability to infer human body structures under various conditions, revealing impressive capabilities in complex occlusion scenarios, such as accurately identifying individuals in Raphael's painting "The School of Athens" [9][10][11]. - However, in scenarios involving close physical contact, like arm wrestling, the model struggles to distinguish between overlapping body parts, leading to incorrect 3D representations [16]. - The model also fails to recognize non-standard body types, such as infants, particularly in mirrored images, indicating a reliance on adult body templates and a lack of understanding of proportions [19][21][23][29]. Group 2: Object Recognition and Segmentation - Round 2 assesses SAM 3D's semantic segmentation and labeling capabilities, particularly with stacked objects like delivery boxes and fruit platters. The model performs adequately with clear boundaries but falters when faced with reflective or obscured surfaces [35][37][40]. - The model exhibits significant confusion in categorizing similar objects, misidentifying fruits and failing to accurately label them, which impacts subsequent 3D generation [42]. Group 3: Architectural Understanding - The architectural testing phase evaluates SAM 3D's comprehension of rigid structures and spatial relationships. The model can reconstruct simple buildings but produces rough outputs lacking detail [44][50]. - When presented with complex architectural designs, such as the CCTV headquarters, the model recognizes basic topological features but fails to accurately represent intricate structures in 3D [53][56]. Conclusion - The evaluation concludes that while SAM 3D demonstrates advanced capabilities in understanding and generating 3D representations, it struggles with complex scenarios, indicating a gap between theoretical potential and practical application [57][60]. - The model's focus on semantic information rather than detailed visual aesthetics positions it for applications in robotics and augmented reality, rather than traditional artistic rendering [64].