Workflow
Benchmarks
icon
Search documents
Benchmarks Are Memes: How What We Measure Shapes AI—and Us - Alex Duffy
AI Engineer· 2025-07-15 17:05
Benchmarks as Memes in AI - Benchmarks are presented as memes that shape AI development, influencing what models are trained and tested on [1][3][8] - The AI industry faces a problem of benchmark saturation, as models become too good at existing benchmarks, diminishing their value [5][6] - There's an opportunity for individuals to create new benchmarks that define what AI models should excel at, shaping the future of AI capabilities [7][13] The Lifecycle and Impact of Benchmarks - The typical benchmark lifecycle involves an idea spreading, becoming a meme, and eventually being saturated as models train on it [8] - Benchmarks can have unintended consequences, such as reinforcing biases if not designed thoughtfully, as seen with the Chat-GPT thumbs-up/thumbs-down benchmarking [14] - The industry should focus on creating benchmarks that empower people and promote agency, rather than treating them as mere data points [16] Qualities of Effective Benchmarks - Great benchmarks should be multifaceted, rewarding creativity, accessible to both small and large models, generative, evolutionary, and experiential [17][18][19] - The industry needs more "squishy," non-static benchmarks for areas like ethics, society, and art, requiring subject matter expertise [34][35] - Benchmarks can be used to build trust in AI by allowing people to define goals, provide feedback, and see AI improve, fostering a sense of importance and control [37] AI Diplomacy Benchmark - AI Diplomacy is presented as an example of a benchmark that mimics real-world situations, testing models' abilities to negotiate, form alliances, and betray each other [20][22][23] - The AI Diplomacy benchmark revealed interesting personality traits in different models, such as 03 being a schemer and Claude models being naively optimistic [24][25][30] - The AI Diplomacy benchmark highlighted the importance of social aspects and convincing others, with models like Llama performing well due to their social skills [31]
The Becoming Benchmark | Chimezie Nwabueze | TEDxBAU Cyprus
TEDx Talks· 2025-06-25 15:56
Personal Development & Success Measurement - Traditional benchmarks like KPIs, productivity trackers, social media metrics, and achievements are often used as measures of personal success, but they can be misleading [8] - The speaker proposes shifting the focus from "doing" and "achieving" to "becoming," emphasizing internal growth in areas like capacity, compassion, courage, integrity, grit, and kindness [9][11][12] - The "becoming scorecard" involves reflecting on daily growth in areas like self-awareness, courage, compassion, and learning [17][18][19][20][21] Overcoming Fear of Failure - Fear of failure can hinder individuals from pursuing ventures with the potential for failure, limiting opportunities [4] - The speaker's personal experience of avoiding challenging subjects in university qualification exams due to fear of failure led to difficulties later on [5][6][7] - Choosing courses relevant to personal growth, even if challenging, is aligned with the desired personal development [19] Fulfillment & Long-Term Impact - External validations and benchmarks provide fleeting happiness, while internal character development leads to lasting fulfillment [14][26] - Focusing on "becoming" ensures that even if goals are not met, the personal growth achieved along the way provides satisfaction [25][26] - The speaker's mentor advised reflecting daily on what was learned and how one grew, creating an internal compass for measuring what matters [15][16]
Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis
AI Engineer· 2025-06-11 15:40
AI Safety Benchmarks Analysis - The paper analyzes AI safety benchmarks, a novel approach as no prior work has examined the semantic extent covered by these benchmarks [13] - The research identifies six primary harm categories within AI safety benchmarks, noting varying coverage and breadth across different benchmarks [54] - Semantic coverage gaps exist across recent benchmarks and will evolve as the definition of harms changes [55] Methodology and Framework - The study introduces an optimized clustering configuration framework applicable to other benchmarks with similar topical use or LLM applications, demonstrating its scalability [55] - The framework involves appending benchmarks into a single dataset, cleaning data, and iteratively developing unsupervised learning clusters to identify harms [17][18] - The methodology uses embedding models (like MiniLM and MPNet), dimensionality reduction techniques (TSNE and UMAP), and distance metrics (Euclidean and Mahalanobis) to optimize cluster separation [18][19][41] Evaluation and Insights - Plotting semantic space offers a transparent evaluation approach, providing more actionable insights compared to traditional ROUGE and BLEU scores [56] - The research highlights the potential for bias in clusters and variations in prompt lengths across different benchmarks [51] - The study acknowledges limitations including methodological constraints, dimensionality reduction information loss, biases in embedding models, and the limited scope of analyzed benchmarks [51][52] Future Research Directions - Future research could explore harm benchmarks across diverse cultural contexts and investigate prompt-response relationships [53] - Applying the methodology to domain-specific datasets could further reveal differences and insights [53]