旨在排列出看过或者没看过收集的高质量博客
地址:科学空间
让人惊叹的Johnson-Lindenstrauss引理:理论篇
让人惊叹的Johnson-Lindenstrauss引理:应用篇
随机分词浅探:从Viterbi Decoding到Viterbi Sampling
随机分词再探:从Viterbi Sampling到完美采样算法
Google新搜出的优化器Lion:效率与效果兼得的“训练狮”
Transformer升级之路:1、Sinusoidal位置编码追根溯源
Transformer升级之路:2、博采众长的旋转式位置编码
Transformer升级之路:3、从Performer到线性Attention
Transformer升级之路:4、二维位置的旋转式位置编码
Transformer升级之路:5、作为无限维的线性Attention
Transformer升级之路:6、旋转位置编码的完备性分析
Transformer升级之路:9、一种全局长度外推的新思路
Transformer升级之路:10、RoPE是一种β进制编码
Transformer升级之路:12、无限外推的ReRoPE?
Transformer升级之路:13、逆用Leaky ReRoPE
Transformer升级之路:14、当HWFA遇见ReRoPE
Transformer升级之路:15、Key归一化助力长度外推
Transformer升级之路:17、多模态位置编码的简单思考
Transformer升级之路:18、RoPE的底数选择原则
Transformer升级之路:20、MLA好在哪里?(上)
Transformer升级之路:21、MLA好在哪里?(下)
Encoder:
你可能不需要BERT-flow:一个线性变换媲美BERT-flow
CoSENT(一):比Sentence-BERT更有效的句向量方案
Decoder:
Bias项的神奇作用:RoPE + Bias = 更好的长度外推性
《为什么现在的LLM都是Decoder-only的架构?》FAQ
FLASH:可能是近来最有意思的高效Transformer设计
Mini-SGLang: Efficient Inference Engine in a Nutshell
FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems
Sorting-Free GPU Kernels for LLM Sampling
FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding
Accelerating Self-Attentions for LLM Serving with FlashInfer
陈巍:DeepSeek V3/R1的架构与训练技术2万字长文分析(上)
陈巍:DeepSeek V3/R1的架构与训练技术2万字长文分析(下)
陈巍:DeepSeek 开源Day(1)-FlashMLA 深入分析
陈巍:DeepSeek 开源Day(2)DeepEP深入分析
陈巍:DeepSeek 开源Day(3)DeepGEMM深入分析
陈巍:DeepSeek 开源Day(4)DualPipe&EPLB深入分析
陈巍:DeepSeek 开源Day(5)3FS&smallpond深入分析
CUTLASS CuTe GEMM细节分析(一)——ldmatrix的选择
CUTLASS CuTe GEMM细节分析(二)——TiledCopy与cp.async
CUTLASS CuTe GEMM细节分析(三)——Swizzle<B, M, S>模板参数的取值
CUTLASS CuTe GEMM细节分析(四)——谈谈Swizzle模板参数中关于B和S的一些误区
LLM Decode GQA & GEMV算子性能分析(一)
LLM Decode GQA & GEMV算子性能分析(二)
关于CUTLASS Grouped GEMM中Alignment参数的分析
关于现代GPU体系结构内存一致性(Memory Consistency)模型的一些猜想(一)
关于现代GPU体系结构内存一致性(Memory Consistency)模型的一些猜想(二)——同步性能
基于一个MXFP8量化Kernel谈一谈如何在B200上实现高性能的Memory Bound Kernel
基于CUTLASS CuTe分析cp.async的Prefetch行为
关于Nsight Compute中Compute Workload Analysis反映的Tensor Pipe Utilization的理解
Semantic Data Modeling, Graph Query, and SQL, Together at Last?
Statistical Separations: When do Transformers outperform feed forward and recurrent networks?
Transformer 何时优于前馈网络和循环网络?统计学视角
Fast ACS: Low-Latency File-Based Ordered Message Delivery at Scale
The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
ASPLOS 2025 / EuroSys 2025 分布式深度学习算子内并行性竞赛
Vortex: A Stream-oriented Storage Engine For Big Data Analytics
InstructPipe: Generating Visual Blocks Pipelines with Human Instructions and LLMs
InstructPipe:利用人类指令和 LLM 生成可视化模块流水线
Contextual Agent Security: A Policy for Every Purpose
Sufficient Context: A New Lens on Retrieval Augmented Generation Systems
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
New Future of Work Report 2025
Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale
DroidSpeak: Efficient Context Sharing for Multiple-LLM Inference
Exqutor: Extended Query Optimizer for Vector-augmented Analytical Queries
SIT-Graph: State Integrated Tool Graph for Multi-Turn Agents
ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
ModServe:面向可扩展多模态模型服务的模态和阶段感知资源分解
From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models
DocReward: A Document Reward Model for Structuring and Stylizing
xKV: Cross-Layer SVD for KV-Cache Compression
TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
Palu: KV-Cache Compression with Low-Rank Projection Palu
Catalyst的paper都是一簇一簇发的,感觉除了那几个顶会,其他顶会的都不发了
MagicPIG: LSH Sampling for Efficient LLM Generation.
MagicDec:利用推测性解码打破长上下文生成的延迟-吞吐量权衡
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding.
AdaServe:通过 SLO 定制的推测性解码加速多 SLO LLM 服务
Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow.
Helix:通过 Max-Flow 在异构 GPU 和网络上提供大型语言模型服务
Transitive Array: An Efficient GEMM Accelerator with Result Reuse
Qserve: W4a8kv4 quantization and system co-design for efficient llm serving Qserve
W4a8kv4 量化和系统协同设计,实现高效的 llm 服务
Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven
Piperag: Fast retrieval-augmented generation via adaptive pipeline parallelism Piperag
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
MoE-Lightning:在内存受限的 GPU 上实现高吞吐量的 MoE 推理
Demystifying Chains, Trees, and Graphs of Thoughts
Affordable AI Assistants with Knowledge Graph of Thoughts
Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs
机密 LLM 推理:CPU 和 GPU TEE 的性能和成本