这些论文聚焦于 KV 缓存的减少、量化或重用,以提升生成推理效率和降低内存消耗。
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference
(Keyformer:通过高效生成推理的关键令牌选择来减少 KV 缓存) - Muhammad Adnan 等
Prompt Cache: Modular Attention Reuse for Low-Latency Inference (提示缓存:模块化注意力重用,用于低延迟推理) - In Gim 等
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache (Q-Hitter:通过稀疏量化 KV 缓存实现高效 LLM 推理的更好令牌预言机) - Zhenyu Zhang 等
这些论文针对 LoRA 适配器的多租户和大规模服务,提供可扩展的部署解决方案。
Punica: Multi-Tenant LoRA Serving (Punica:多租户 LoRA 服务) - Lequn Chen 等
SLoRA: Scalable Serving of Thousands of LoRA Adapters (SLoRA:数千个 LoRA 适配器的可扩展服务) - Ying Sheng 等
这些论文强调通过异步、并行或异构方法加速 LLM 推理,尤其在资源受限环境或 MoE 模型中。
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics (FlashDecoding++:使用异步、平面 GEMM 优化和启发式进行更快的大型语言模型推理) - Ke Hong 等
HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices (HeteGen:资源受限设备上大型语言模型的高效异构并行推理) - ZHAO XUANLEI 等
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models (SiDA:稀疏启发的数据感知服务,用于高效且可扩展的大型专家混合模型) - Zhixu Du 等
这些论文使用量化方法压缩 LLM 以实现高效服务和加速。
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving (Atom:低位量化,实现高效、准确的法学硕士 服务) - Yilong Zhao 等
这些论文提供 LLM 推理的统一框架、仿真或数据操作工具,支持评估和部署。
VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE (VIDUR:LLM 推理的大规模仿真框架) - Amey Agrawal 等
UniDM: A Unified Framework for Data Manipulation with Large Language Models (UniDM:大语言模型数据操作的统一框架) - Yichen Qian 等
这些论文探讨模型压缩、量化方法,以减少内存和计算开销。
JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training (JIT-Q:利用内存处理实现即时量化,实现高效的 ML 训练) - Mohamed Ibrahim 等
Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers (具有动态浮点容器的薛定谔 FP 训练神经网络) - Milos Nikolic 等
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration (AWQ:用于设备上 LLM 压缩和加速的激活感知权重量化) - Ji Lin 等
Does Compressing Activations Help Model Parallel Training? (压缩激活值有助于模型并行训练吗?) - Song Bian 等
QMoE: Sub-1-Bit Compression of Trillion Parameter Models (QMoE:万亿参数模型的亚 1 位压缩) - Elias Frantar 等
Efficient Post-training Quantization with FP8 Formats (使用 FP8 格式进行高效的训练后量化) - Haihao Shen 等
L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning (L-GreCo:用于高效数据并行深度学习的分层自适应梯度压缩) - Ilia Markov 等
这些论文针对联邦学习中的异质性、效率和平台设计。
HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning (HeteroSwitch:联邦学习中系统诱导数据异质性的表征与驯服) - Gyudong Kim 等
FedTrans: Efficient Federated Learning via Multi-Model Transformation (FedTrans:通过多模型转换实现高效联邦学习) - Yuxuan Zhu 等
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning (LIFL:用于联邦学习的轻量级、事件驱动的无服务器平台) - Shixiong Qi 等
这些论文优化模型训练过程,包括批处理、通信重叠和管道设计。
ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time (ACROBAT:优化编译时动态深度学习的自动批处理) - Pratik Fegade 等
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping (Lancet:通过全图计算-通信重叠加速专家混合训练) - Chenyu Jiang 等
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines (DiffusionPipe:使用高效管道训练大型扩散模型) - Ye Tian 等
Distributed Matrix-Based Sampling for Graph Neural Network Training (基于分布式矩阵的图神经网络训练采样) - Alok Tripathy 等
这些论文涉及隐私保护推理、同态加密和模型机密性。
Accelerating ReLU for MPC-Based Private Inference with a Communication-Efficient Sign Estimation (通过通信高效的符号估计加速基于 MPC 的隐私推理的 ReLU) - Kiwan Maeng 等
ACCURATE LOW-DEGREE POLYNOMIAL APPROXIMATION OF NON-POLYNOMIAL OPERATORS FOR FAST PRIVATE INFERENCE IN HOMOMORPHIC ENCRYPTION (同态加密中用于快速隐私推理的非多项式算子的精确低次多项式近似) - Jingtian Dang 等
Proteus: Preserving Model Confidentiality during Graph Optimizations (Proteus:在图优化期间保护模型机密性) - Yubo Gao 等
这些论文针对特定硬件(如 MCU、加速器)的推理和部署优化。
vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs (vMCU:MCU 上 DNN 推理的协调内存管理和内核优化) - Size Zheng 等
Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design (Torch2Chip:用于原型硬件加速器设计的端到端可定制深度神经网络压缩和部署工具包) - Jian Meng 等
这些论文提供基准测试、模拟框架和解释工具。
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation (CloudEval-YAML:云配置生成的实用基准) - Yifei Xu 等
COMET: Neural Cost Model Explanation Framework (COMET:神经成本模型解释框架) - Isha Chaudhary 等
On Latency Predictors for Neural Architecture Search (关于神经架构搜索的延迟预测器) - Yash Akhauri 等
FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms (FLASH:以 ML 为中心的云平台中的快速模型适应) - Haoran Qiu 等
这些论文针对推荐系统、视频分析、GNN 和自主系统等特定领域。
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation (分解多塔:用于高效大规模推荐的拓扑感知建模技术) - Liang Luo 等
VQPy: An Object-Oriented Approach to Modern Video Analytics (VQPy:现代视频分析的面向对象方法) - Shan Yu 等
Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems (使用形式化方法反馈微调语言模型:自主系统中的用例) - Yunhao Yang 等