Workflow
AI Engineer
icon
Search documents
Why should anyone care about Evals? — Manu Goyal, Braintrust
AI Engineer· 2025-06-27 10:51
An introduction to the evals track About Manu Goyal Manu Goyal is the founding engineer at Braintrust. Previously, he developed autonomous systems at Nuro. He has an 8 year old Pomeranian named Hendrix. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter ...
To the moon! Navigating deep context in legacy code with Augment Agent — Forrest Brazeal, Matt Ball
AI Engineer· 2025-06-27 10:46
[Music] Welcome everyone. Thank you so much for coming. My name is Forest.This is Matt. Uh and we're going to be talking to you today about augment agent and specifically legacy code. how we get the most out of gnarly legacy code bases using an AI agent.So I do not work for Augment Code. Um I am a friend and partner of Augment Code. So I helped to put this talk together.Matt is from Augment Code. So he's going to be your best person to come to with your most detailed technical questions after the session. M ...
Ship it! Building Production Ready Agents — Mike Chambers, AWS
AI Engineer· 2025-06-27 10:45
Generative AI and Agent Technology - Amazon Web Services (AWS) specializes in generative AI, evolving from machine learning [1] - The presentation focuses on deploying generative AI agents to cloud scale, targeting both developers and leaders [1] - The core components of an agent include a model for natural language understanding, a prompt defining the agent's role, an agentic loop for processing input and using tools, history for maintaining context, and tools for external interaction [1][2] - AWS Bedrock offers a suite of capabilities for building generative AI components, including models from Anthropic, Meta, and Mistral [2] - Amazon Bedrock Agents is a fully managed service for deploying agents without infrastructure management [2] Practical Implementation and Tools - The demonstration uses a simple Python agent with a dice rolling tool, initially running locally on a laptop with the Llama 3 8 billion parameter model [1] - The agent is configured with instructions (similar to a prompt) and action groups, which connect to tools [2] - Lambda functions are used to host the tools, enabling them to perform various actions, including interacting with other AWS services [2] - The AWS console provides a user interface for creating and configuring agents, including defining parameters and descriptions for tools [3][4][5][6][7][8][9][10][11][12][13][14][15] - Amazon Q developer is integrated into the console's code editor, offering code suggestions [17][18][19][20][21] Deployment and Scalability - The presentation emphasizes deploying agents to a production-ready, cloud-scale environment [1] - Infrastructure as code frameworks like Terraform, Palumi, and CloudFormation can be used for deployment [3] - AWS offers free courses on deeplearning.ai with AWS environments for experimenting with Amazon Bedrock Agents [25]
Data is Your Differentiator: Building Secure and Tailored AI Systems — Mani Khanuja, AWS
AI Engineer· 2025-06-27 10:42
As organizations seek to harness their proprietary data while maintaining security and compliance, Amazon Bedrock provides a comprehensive framework for building tailored AI applications. Using Amazon Bedrock Knowledge Bases and Amazon Bedrock Data Automation, organizations can create AI solutions that truly understand their unique business context, terminology, and requirements. Combined with Amazon Bedrock Guardrails, these capabilities enhance the accuracy and relevance of AI-generated responses, while e ...
Milliseconds to Magic: Real‑Time Workflows using the Gemini Live API and Pipecat
AI Engineer· 2025-06-27 10:31
Product Updates - Gemini Live API GA is now powered by Google's cost-effective thinking model Gemini 2.5 Flash [1] - An experimental version of the Live API powered by Google's native audio offering is available for trial, enabling seamless, emotive, steerable, multilingual dialogue [1] Key Capabilities - The Gemini Live API combined with Pipecat unlocks capabilities for developers, focusing on session management, turn detection, tool use (including async function calls), proactivity, multilinguality, and integration with telephony and other infrastructure [1] - Pipecat extends realtime multimodal capabilities to client-side applications such as customer support agents, gaming agents, and tutoring agents [1] Industry Impact - Pipecat is a widely used, open-source, vendor-neutral voice agent framework supported by NVIDIA, Google, and AWS, and used by hundreds of startups [1] Personnel - Kwindla Kramer (Kwin) from Daily is the originator of Pipecat [1] - Shrestha Basu Mallick is Group Product Manager and product lead for Gemini API at Google DeepMind [1]
Realtime Conversational Video with Pipecat and Tavus — Chad Bailey and Brian Johnson, Daily & Tavus
AI Engineer· 2025-06-27 10:30
Core Technology & Products - Tavis offers a conversational video interface, an end-to-end pipeline for conversations with AI replicas, with a response time around 600 milliseconds [9] - Tavis's proprietary models, Sparrow Zero and Raven Zero, are being integrated into Pipecat [10][11] - Pipecat is an open-source framework designed as an orchestration layer for real-time AI, handling input, processing, and output of media [15][18] - Pipecat uses frames, processors, and pipelines to manage data flow, with processors handling frames of audio, video, or voice activity detection [23][24] Strategic Partnership & Integration - Tavis and Pipecat are partnering to enhance conversational AI, leveraging Pipecat's capabilities for real-time observability and control [8] - Enterprise customers are using Pipecat and want to integrate Tavis's technology within it, leading Tavis to move its best models into Pipecat [39] - Tavis is integrating its Phoenix rendering model, turn-taking, response timing, and perception models into Pipecat [39][40] Future Development & Deployment - Tavis is developing a multilingual turn detection model to improve conversational AI speed and prevent interruptions [41] - Tavis is working on a response timing model to adjust response speed based on conversation context [42][43] - Tavis's multimodal perception model will analyze emotions and surroundings to provide more nuanced conversational flow [44] - Pipecat Cloud offers a solution for deploying bots at scale, simplifying the process without requiring Kubernetes expertise [49]
Vector Search Benchmark[eting] - Philipp Krenn, Elastic
AI Engineer· 2025-06-27 10:28
Vector Database Benchmarking Challenges - The vector database market is filled with misleading benchmarks, where every database claims to be both faster and slower than its competitors [1] - Meaningful vector search benchmarks are uniquely tricky to build [1] - It is crucial to tailor benchmarks to specific use cases to get useful results [1] - Benchmarks should be tweaked and verified independently to avoid blindly trusting marketing claims [1] Recommendations for Benchmarking - Avoid trusting glossy charts and marketing materials when evaluating vector databases [1] - Build meaningful benchmarks tailored to specific use cases to get accurate performance assessments [1] - Independently verify and tweak benchmarks to ensure they reflect real-world performance [1] About the Speaker - Philipp Krenn leads Developer Relations at Elastic, the company behind Elasticsearch, Kibana, Beats, and Logstash [1]
Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo
AI Engineer· 2025-06-27 10:27
AI Agent Evaluation & Observability - The industry emphasizes the necessity of observability in AI development, particularly for evaluation-driven development [1] - AI trustworthiness is a significant concern, highlighting the need for robust evaluation methods [1] - Detecting problems in AI is challenging due to its non-deterministic nature, making traditional unit testing difficult [1] AI-Driven Evaluation - The industry suggests using AI to evaluate AI, leveraging its ability to understand and identify issues in AI systems [1] - LLMs can be used to score the performance of other LLMs, with the recommendation to use a better (potentially more expensive or custom-trained) LLM for evaluation than the one used in the primary application [2] - Galileo offers a custom-trained small language model (SLM) designed for effective AI evaluations [2] Implementation & Metrics - Evaluations should be integrated from the beginning of the AI application development process, including prompt engineering and model selection [2] - Granularity in evaluation is crucial, requiring analysis at each step of the AI workflow to identify failure points [2] - Key metrics for evaluation include action completion (did it complete the task) and action advancement (did it move towards the goal) [2] Continuous Improvement & Human Feedback - AI can provide insights and suggestions for improving AI agent performance based on evaluation data [3] - Human feedback is essential to validate and refine AI-generated metrics, ensuring accuracy and continuous learning [4] - Real-time prevention and alerting are necessary to address rogue AI agents and prevent issues in production [8]
Building agent fleet architectures your CISO doesn't hate — Lou Bichard, Gitpod
AI Engineer· 2025-06-27 10:25
Security is the biggest blocker for agent orchestration adoption in regulated industries for SWE agents. Gitpod's agent orchestration went from an originally self-hosted kubernetes architecture to the current 'bring your own cloud' model that enables deployment our SWE agent orchestration platform in secure environments. The architecture allows customers to securely connect their foundational models and agent memory solutions and comes with features like auto-suspend and resume for agent fleets. In this tal ...
Don’t get one-shotted: Use AI to test, review, merge, and deploy code — Tomas Reimers, Graphite
AI Engineer· 2025-06-27 10:25
Industry Trends - Software development has two loops: an inner loop focused on development and an outer loop focused on review [1] - AI adoption is increasing among developers, with nearly every developer surveyed using AI tools [2] - 46% of code on GitHub is being written by AI, indicating a significant shift in code generation [3] - The inner loop is changing due to AI, making developers more productive and producing higher volumes of code [3][4] - The outer loop is becoming a bottleneck as developers have to review, test, merge, and deploy higher volumes of code [5] Graphite's Solution (Diamond) - Graphite aims to create a new outer loop to address the challenges posed by increased code volume [6] - Graphite's AI code review platform, Diamond, focuses on high signal, low noise, deep understanding of codebase and change history [13] - Diamond summarizes, prioritizes, and reviews each change, integrating with CI and testing infrastructure [13] - Diamond aims to reduce code review cycles, enforce quality and consistency, and keep code private and secure [13] - AI-generated feedback from Diamond's comments are accepted at a 52% rate, higher than human comments (45-50%) [15][16]