Workflow
Large Language Models
icon
Search documents
Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith
AI Engineer· 2025-07-27 16:15
LLM Evaluation Challenges - Traditional benchmarks often fail to reflect real-world LLM performance, reliability, and user satisfaction [1] - Evaluating reasoning quality, agent consistency, MCP integration, and user-focused outcomes requires going beyond standard benchmarks [1] - Benchmarks and leaderboards rarely reflect the realities of production AI [1] Evaluation Strategies & Frameworks - The industry needs tangible evaluation strategies using open-source frameworks like GuideLLM and lm-eval-harness [1] - Custom eval suites tailored to specific use cases are crucial for accurate assessment [1] - Integrating human-in-the-loop feedback is essential for better user-aligned outcomes [1] Key Evaluation Areas - Evaluating reasoning skills, consistency, and reliability in agentic AI applications is critical [1] - Validating MCP (Model Context Protocol) and agent interactions with practical reliability tests is necessary [1] - Agent reliability checks should reflect production conditions [1] Deployment Considerations - Robust evaluation is critical for confidently deploying LLMs in real-world applications like chatbots, copilots, or autonomous AI agents [1]
Waymo's EMMA: Teaching Cars to Think - Jyh Jing Hwang, Waymo
AI Engineer· 2025-07-26 17:00
Autonomous Driving History and Challenges - Autonomous driving research started in the 1980s with simple neural networks and evolved to end-to-end driving models by 2020 [2] - Scaling autonomous driving presents challenges, requiring solutions for long-tail events and rare scenarios [5][7] - Foundation models, like Gemini, show promise in generalizing to rare driving events and providing appropriate responses [8][9][10][11] Emma: A Multimodal Large Language Model for Autonomous Driving - The company is exploring Emma, a driving system leveraging Gemini, which uses routing text and camera input to predict future waypoints [11][12][13][14] - Emma is self-supervised, camera-only, and high-dimension map-free, achieving state-of-the-art quality on the nuScenes benchmark [15][16][17] - Channel reasoning is incorporated into Emma, allowing the model to explain its driving decisions and improve performance on a 100k dataset [17] Evaluation and Validation - Evaluation is crucial for the success of autonomous driving models, including open loop evaluation, simulations, and real-world testing [25] - Generative models are being explored for sensor simulation to evaluate the planner under various conditions like rain and different times of day [26][27][28] Future Directions - The company aims to improve generalization and scale autonomous driving by leveraging foundation models [30] - Training on larger datasets improves the quality of the planner [19][20] - The company is exploring training on various tasks, such as 3D detection and rograph estimation, to create a more generalizable model [21][22][23][24]
X @The Wall Street Journal
Vulnerability & Mitigation - Large language models like Grok have vulnerabilities that need to be addressed immediately [1] - Addressing vulnerabilities is crucial as these models gain capabilities beyond language generation [1]
X @The Wall Street Journal
Large language models aren’t replacing traditional browsers anytime soon, but they have become another responsibility for brands https://t.co/n8m7uemRHr ...
X @Bloomberg
Bloomberg· 2025-07-22 11:22
Technology & Finance Convergence - Large language models are predicted to possess the technical capability to make real investment decisions for clients within five years [1]
X @Avi Chawla
Avi Chawla· 2025-07-21 06:39
LLM Development Stages - The document outlines four stages for building Large Language Models (LLMs) from scratch for real-world applications [1] - These stages include pre-training, instruction fine-tuning, preference fine-tuning, and reasoning fine-tuning [1] Techniques Overview - The document indicates that these techniques are visually summarized [1]
X @Avi Chawla
Avi Chawla· 2025-07-20 06:34
Expertise & Focus - The author has 9 years of experience training neural networks [1] - The content focuses on optimizing model training in the fields of Data Science (DS), Machine Learning (ML), Large Language Models (LLMs), and Retrieval-Augmented Generation (RAGs) [1] Content Type - The author shares tutorials and insights daily on DS, ML, LLMs, and RAGs [1] - The content includes 16 ways to actively optimize model training [1]
360Brew: LLM-based Personalized Ranking and Recommendation - Hamed and Maziar, LinkedIn AI
AI Engineer· 2025-07-16 17:59
Model Building and Training - LinkedIn leverages large language models (LLMs) for personalization and ranking tasks, aiming to use one model for all tasks [2][3] - The process involves converting user information into prompts, a method called "promptification" [8] - LinkedIn builds a large foundation model, Blue XL, with 150 billion parameters, then distills it to smaller, more efficient models like a 3B model for production [12] - Distillation from a large model is more effective than training a small model from scratch [14] - Increasing data, model size (up to 8x22B), and context length can improve model performance, but longer contexts may require model adjustments [17][18][19] Model Performance and Generalization - The model improves performance for cold start users, showing a growing gap compared to production models as interactions decrease [21] - The model demonstrates generalization to new domains, performing on par with or better than task-specific production models in out-of-domain tasks [23] Model Serving and Optimization - LinkedIn focuses on model specification, pruning, and quantization to improve throughput and reduce latency for production [26] - Gradual pruning and distillation are more effective than aggressive pruning, minimizing information loss [29][30] - Mixed precision, including FP8 for activations and model parameters but FP32 for the LM head, is crucial for maintaining prediction precision [31][32] - Sparsifying attention scores can reduce latency by allowing multiple item recommendations without each item attending to each other [34][35] - LinkedIn achieved a 7x reduction in latency and a 30x increase in throughput per GPU through these optimization techniques [36]
What We Learned from Using LLMs in Pinterest — Mukuntha Narayanan, Han Wang, Pinterest
AI Engineer· 2025-07-16 17:58
[Music] Yeah. Hi everyone. Um, thanks for joining the talk today.Um, we're super excited to be here and shares some of the learnings we um, we have from integrating the LM into Pinterest search. My name is Khan and today I'll be presenting with Mukunda and we are both machine learning engineers from search relevance team at Pinterest. So start with a brief introduction to Pinterest.Um Pinterest is a visual discovery platform where piners can come to find inspiration to create a life they love. And there are ...
How LLMs work for Web Devs: GPT in 600 lines of Vanilla JS - Ishan Anand
AI Engineer· 2025-07-13 17:30
Core Technology & Architecture - The workshop focuses on a GPT-2 inference implementation in Vanilla JS, providing a foundation for understanding modern AI systems like ChatGPT, Claude, DeepSeek, and Llama [1] - It covers key concepts such as converting raw text into tokens, representing semantic meaning through vector embeddings, training neural networks through gradient descent, and generating text with sampling algorithms [1] Educational Focus & Target Audience - The workshop is designed for web developers entering the field of ML and AI, aiming to provide a "missing AI degree" in two hours [1] - Participants will gain an intuitive understanding of how Transformers work, applicable to LLM-powered projects [1] Speaker Expertise - Ishan Anand, an AI consultant and technology executive, specializes in Generative AI and LLMs, and created "Spreadsheets-are-all-you-need" [1] - He has a background as former CTO and co-founder of Layer0 (acquired by Edgio) and VP of Product Management for Edgio, with expertise in web performance, edge computing, and AI/ML [1]