Workflow
Data Sets
icon
Search documents
Without good data sets AI algorithms won't be productive: Gecko Robotics CEO Loosararian
CNBC Television· 2025-08-29 21:26
AI Adoption Challenges in Industrial Sectors - Many companies, especially in energy, manufacturing, defense, and public infrastructure, lack the necessary data for effective AI implementation [3] - Companies are experiencing frustration with AI's ROI, as the promised returns are often not practical in industrial settings [5] - The impact of AI, as seen in everyday applications like ChatGPT, is not yet realized in industrial sectors [6] Gecko Robotics' Solution and Value Proposition - Gecko Robotics focuses on providing data solutions to improve efficiency, safety, and job quality in industrial sectors [3] - Gecko's robots collect data on infrastructure integrity, enabling predictive maintenance and operational efficiencies [4][8] - By providing data, Gecko aims to help clients improve heat rates, reduce shutdowns, and eliminate outages [7] - Gecko offers a 10x increase in data sets by deploying robots and sensors, leveraging existing client data [8] The Importance of Data in AI - Winning the data race is crucial for winning the AI race, as data fuels algorithms [10] - The physical world lacks readily available data, hindering the application of AI [11] - Decoding the physical world is essential to realize the potential of AI in previously overlooked sectors [11][12] Market Dynamics and Future Implications - Companies that pragmatically adopt AI will likely outperform those that don't, potentially leading to shifts in industry leadership [13] - There's a need for investment in data collection to unlock economic efficiencies through AI [9]
OpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs
AI Engineer· 2025-07-19 21:10
Open Thoughts项目概览 - Bespoke Labs 发布 Open Thoughts 3,旨在创建最佳的开源推理数据集 [1][9] - Open Thoughts 项目专注于推理数据配方,以解决创建强大推理模型的关键缺失环节 [6][9] - Open Thoughts 3 在科学、代码和数学等领域都优于 Deepseek R1 quen 7B 模型 [13] 数据集创建与优化 - 数据集流水线包括问题来源、混合、过滤、答案生成和答案过滤等步骤 [17] - 实验创建了超过 5000 个数据集和近 3000 个模型,以严格评估流水线中每个步骤的不同决策 [18] - 每个问题采样多个推理轨迹效果显著,在固定问题规模下,性能不会下降,允许数据规模扩大 16 倍 [19][20] - 合成问题是可扩展的,可以进一步提高准确性 [22] - 问题过滤通过让语言模型评估问题的难度和答案的长度来筛选高质量问题 [23] 关键学习与发现 - 少量高质量的数据来源优于大量多样性的数据来源 [25] - 对于 SFT 和知识蒸馏,基于答案过滤或验证答案似乎没有帮助 [26] - 较强的评估基准模型并不一定意味着它是一个更好的教师模型,例如,Quen 32B 是比 Deepseek R1 更好的教师模型 [21] - 通过知识蒸馏,模型可以在某些领域超越教师模型,例如在法律推理领域 [35][36][37] 实践建议 - 根据特定领域调整数据配方,从 Open Thoughts 的配方开始迭代 [29] - 针对代码、科学和数学等不同领域,应区别研究流水线的每个步骤 [29][30] - 如果特定领域的数据不足,可以将现有数据转换为问题,并使用上下文示例生成更多数据 [32] - 评估至关重要,需要使用 Evalchemy 等开源库来确保模型改进的有效性 [33][34]