[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
-
Updated
Jan 17, 2026 - Cuda
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
Implementation of the conditionally routed attention in the CoLT5 architecture, in Pytorch
Official implementation of 'Transformer-VQ: Linear-Time Transformers via Vector Quantization'
Efficient Infinite Context Transformers with Infini-attention Pytorch Implementation + QwenMoE Implementation + Training Script + 1M context keypass retrieval
The official PyTorch implementation for CascadedGaze: Efficiency in Global Context Extraction for Image Restoration, TMLR'24.
Unofficial PyTorch implementation of the paper "cosFormer: Rethinking Softmax In Attention".
Official repository for "SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space"
Pytorch implementation of "Compact Global Descriptor for Neural Networks" (CGD).
Implementation of: Hydra Attention: Efficient Attention with Many Heads (https://arxiv.org/abs/2209.07484)
Official Implementation of SEA: Sparse Linear Attention with Estimated Attention Mask (ICLR 2024)
Nonparametric Modern Hopfield Models
Two small-scale research threads with pre-registered falsifiable bars + adversarial referee audits: Prizma-Seq (a parameter-free quadratic delta-state sequence mixer, an efficient-attention-replacement candidate) and Prizma (backprop-free, fully-local continual learning).
O(N) attention with a bounded inference KV cache. D4 Daubechies wavelet field + content-gated Q·K gather at dyadic offsets.
HiCI: Hierarchical Construction-Integration for Long-Context Attention
Minimal implementation of Samba by Microsoft in PyTorch
Unofficial PyTorch reproduction for Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.
Tabular foundation model experiments with learned context sampling and efficient attention
Run local AI models on your machine with a secure, Rust-based inference engine that keeps your data private and provides controlled system access.
Add a description, image, and links to the efficient-attention topic page so that developers can more easily learn about it.
To associate your repository with the efficient-attention topic, visit your repo's landing page and select "manage topics."