Quantization
Search documents
五年,终于等来Transformers v5
机器之心· 2025-12-02 06:47
Core Insights - The article discusses the release of the first release candidate version v5.0.0rc0 of the Transformers library, marking a significant transition from version 4 to version 5 after a five-year technical cycle [2] - The library has seen a dramatic increase in usage, with daily downloads rising from 20,000 at the time of v4's release to over 3 million today, and total installations surpassing 1.2 billion [2] - The core focus of the v5 update is on simplicity, pre-training, interoperability with high-performance inference engines, and making quantization a core feature [2][3] Evolution and Features - The v5 version establishes PyTorch as the sole core backend and emphasizes four key dimensions of evolution: extreme simplicity, transition from fine-tuning to pre-training, interoperability with high-performance inference engines, and enhanced quantization capabilities [2] - The team aims for a clean and clear model integration approach, promoting broader standardization and stronger generality [4] - Over the past five years, an average of 1-3 new models has been added weekly, with the goal of becoming the only trusted source for model definitions [4] Modular Design and Tools - Hugging Face has advanced a modular design approach, simplifying maintenance and speeding up integration while fostering community collaboration [6] - The introduction of the AttentionInterface provides a centralized abstraction layer for attention mechanisms, streamlining the management of common auxiliary functions [8] - Tools are being developed to identify similarities between new models and existing architectures, aiming to automate the model conversion process into the Transformers format [9][10] Training Enhancements - The v5 version increases support for pre-training, with redesigned model initialization and support for forward and backward propagation optimization operators [15][16] - Hugging Face continues to collaborate closely with fine-tuning tools in the Python ecosystem and ensures compatibility with tools in the JAX ecosystem [17] Inference Improvements - Inference is a key focus of the v5 update, introducing dedicated kernels, cleaner default settings, new APIs, and optimized support for inference engines [18][19] - The v5 version aims to complement specialized inference engines rather than replace them, ensuring compatibility with engines like vLLM, SGLang, and TensorRT-LLM [21] Local Deployment and Quantization - The team collaborates with popular inference engines to allow Transformers to be used as a backend, enhancing the value of models added to Transformers [23] - Quantization is positioned as a core capability of Transformers, ensuring compatibility with major functionalities and providing a reliable framework for training and inference [27]
X @Polyhedra
Polyhedra· 2025-10-31 12:00
4/Gemma3 Quantization: Introduced front-end padding for variable-length inputs to enable flexible inference.Next Steps (Circuit Optimization): Plan to prune gemma3 nodes during circuitization to reduce redundancy and improve proof efficiency.Stay tuned for more updates. ...
X @Avi Chawla
Avi Chawla· 2025-10-30 19:45
RT Avi Chawla (@_avichawla)voyage-3-large embedding model just topped the RTEB leaderboard!It's a big deal because it:- ranks first across 33 eval datasets- outperforms OpenAI and cohere models- supports quantization to reduce storage costsHere's another reason that makes this model truly superior:Most retrieval benchmarks test models on academic datasets that don’t reflect real-world data.RTEB, on the other hand, is a newly-released leaderboard on HuggingFace that evaluates retrieval models across enterpri ...
X @Avi Chawla
Avi Chawla· 2025-10-30 06:31
voyage-3-large embedding model just topped the RTEB leaderboard!It's a big deal because it:- ranks first across 33 eval datasets- outperforms OpenAI and cohere models- supports quantization to reduce storage costsHere's another reason that makes this model truly superior:Most retrieval benchmarks test models on academic datasets that don’t reflect real-world data.RTEB, on the other hand, is a newly-released leaderboard on HuggingFace that evaluates retrieval models across enterprise domains like finance, la ...
Hard Work is Useless. This is What Matters. | Manav Gupta | TEDxGHRCEMN
TEDx Talks· 2025-10-24 15:23
Career Advice for College Students in the AI Era - Traditional hard work is incomplete; direction is crucial for success, emphasizing a shift towards vector quantities in career planning [1][2] - Focus on "why" and desired outcomes before acting, aligning actions with career goals [2] - The speaker shares personal experiences and advice on navigating the tech landscape, particularly in AI [2] Key Trends in AI - The AI sector is currently dominated by Business-to-Customer (B2C) companies, such as lovable, Bolt, perplexity, and OpenAI [2] - A critical missing element in the AI boom is infrastructure to support its growth [2][3] - Infrastructure and distribution are key areas for college students to focus on to gain an advantage [3] Technical Skills and Opportunities - Quantization, a technique to run heavy systems on fewer resources, is crucial due to upcoming resource shortages in AI [5][9] - Learning AI infrastructure, especially quantization, requires dedicated effort (6-10 months) to understand the underlying engine [9][10] - Distribution, or effectively reaching people, is vital in today's AI landscape where creation is easier [10][11] Personal Branding and Networking - Actively engage on social media platforms like LinkedIn to build a personal brand and leverage distribution [15] - Asymmetrical returns are possible through distribution, where efforts can lead to disproportionately large outcomes [15][16] - Off-campus efforts are essential to stand out, as relying solely on college placements is insufficient due to the large number of students [17]
X @Avi Chawla
Avi Chawla· 2025-10-18 06:31
Model Quantization - Keras enables model quantization with a single line of code [1] - Supports quantization to int4, int8, float8, and GPTQ modes [1] - Can quantize user's own models or pre-trained models from KerasHub [1]
X @Polyhedra
Polyhedra· 2025-09-25 12:00
6/Currently working on Gemma3 quantization, focusing on:- Learning the new model architecture- Adding KV cache support (which accelerates inference)- Implementing quantization support for some new operators-- Full operator support will require 1+ additional day, plus more time for accuracy testingStay tuned for more updates 🔥 ...
X @Avi Chawla
Avi Chawla· 2025-08-14 06:33
Model Capabilities - Voyage-context-3 支持 2048, 1024, 512 和 256 维度,并具备量化功能 [1] Cost Efficiency - Voyage-context-3 (int8, 2048 维度) 相比 OpenAI-v3-large (float, 3072 维度) 降低了 83% 的向量数据库成本 [1] Performance - Voyage-context-3 提供了 860% 更好的检索质量 [1]
360Brew: LLM-based Personalized Ranking and Recommendation - Hamed and Maziar, LinkedIn AI
AI Engineer· 2025-07-16 17:59
Model Building and Training - LinkedIn leverages large language models (LLMs) for personalization and ranking tasks, aiming to use one model for all tasks [2][3] - The process involves converting user information into prompts, a method called "promptification" [8] - LinkedIn builds a large foundation model, Blue XL, with 150 billion parameters, then distills it to smaller, more efficient models like a 3B model for production [12] - Distillation from a large model is more effective than training a small model from scratch [14] - Increasing data, model size (up to 8x22B), and context length can improve model performance, but longer contexts may require model adjustments [17][18][19] Model Performance and Generalization - The model improves performance for cold start users, showing a growing gap compared to production models as interactions decrease [21] - The model demonstrates generalization to new domains, performing on par with or better than task-specific production models in out-of-domain tasks [23] Model Serving and Optimization - LinkedIn focuses on model specification, pruning, and quantization to improve throughput and reduce latency for production [26] - Gradual pruning and distillation are more effective than aggressive pruning, minimizing information loss [29][30] - Mixed precision, including FP8 for activations and model parameters but FP32 for the LM head, is crucial for maintaining prediction precision [31][32] - Sparsifying attention scores can reduce latency by allowing multiple item recommendations without each item attending to each other [34][35] - LinkedIn achieved a 7x reduction in latency and a 30x increase in throughput per GPU through these optimization techniques [36]
X @Avi Chawla
Avi Chawla· 2025-06-11 06:30
If you found it insightful, reshare it with your network.Find me → @_avichawlaEvery day, I share tutorials and insights on DS, ML, LLMs, and RAGs.Avi Chawla (@_avichawla):A great tool to estimate how much VRAM your LLMs actually need.Alter the hardware config, quantization, etc., and get to know about:- Generation speed (tokens/sec)- Precise memory allocation- System throughput, etc.No more VRAM guessing! https://t.co/lZbIink12f ...