KV Cache Quantization

2 天

超越TurboQuant，面向长上下文推理的真2-bit KV Quantization算法问世

本文作者 Zhongzhu Zhou 是 TogetherAI 的 Senior Research Scientist，悉尼大学博士，研究方向为高效机器学习系统，方向覆盖模型训推算法与系统协同设计，LLM 压缩与量化。团队成员均来自 ...

腾讯网

超越 TurboQuant! OSCAR：面向真实 Serving 的 2bit KV Cache量化

作者 | Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu从 KV Cache 瓶颈说起长上下文模型的能力还在往前走，但在线推理服务遇到的压力，很多时候已经不只是计算量本身。每生成一个新 token，系统都要反复访问越来越长的历史 Key 和 V ...

腾讯网

KV Cache管理架构演进：从连续分配到统一混合内存架构

在生产环境部署过LLM的人都知道模型权重只是问题的一半，另一半是KV cache：存储注意力状态的运行时内存，让模型在生成token时不必从头开始重算。能不能管好这块内存决定了系统是一个卡顿的demo还是一个可用的推理服务。本文梳理KV cache管理经历的5个时代 ...

快科技

谷歌新论文把内存股价干崩了！KV cache压缩6倍

2026-03-26 23:31:06 出处：量子位作者：梦晨编辑：若风评论(0) 复制纠错两家存储芯片巨头股价大跌，没有财报暴雷，没有供应链断裂，只是谷歌展示了一篇即将在ICLR 2026正式亮相的论文。谷歌研究院推出TurboQuant压缩算法，把AI推理过程中最吃内存的KV cache压缩 ...

电子工程专辑

一文聊透KV Cache：大模型推理‘提速几十倍’的刚需技术

你输入个几百字，它输出就得慢慢挤牙膏。是模型本身算力不够吗？不全是。这里面其实藏着一个非常基础的效率问题，而解决这个问题的核心技术，就是今天要跟大家聊明白的 KV Cache。 1. 先铺垫一下：这些基础术语你得懂聊KV Cache之前，得先把一些最基础的 ...

环球老虎财经 on MSN

SK海力士、三星如何蚕食英伟达的利润?

AI推理时代，存储成本跃升为算力核心，SK海力士、三星等巨头正通过HBM及SSD分食英伟达利润。

Morning Overview on MSN

Google’s TurboQuant algorithm slashes the memory bottleneck that limits how many AI ...

Running a large language model is expensive, and a surprising amount of that cost comes down to memory, not computation. Every time a model like Gemini or GPT-4 processes a long document or sustains a ...

Business Wire

Penguin Solutions Introduces Industry's First Production-Ready CXL-Based KV Cache Server

FREMONT, Calif.--(BUSINESS WIRE)--Penguin Solutions, Inc. (Nasdaq: PENG), the AI factory platform company, today announced the industry's first production-ready KV cache server that utilizes CXL ...

heise online

TurboQuant: Google aims to curb the memory hunger of large LLMs

Google's TurboQuant reduces the KV cache of large language models to 3 bits. Accuracy is said to remain, speed to multiply. Google Research has published new technical details about its compression ...

Hackaday

vector quantization

Large language models (LLMs) aren’t actually giant computer brains. Instead, they are massive vector spaces in which the probabilities of tokens occurring in a specific order is encoded. Billions of ...

一些您可能无法访问的结果已被隐去。

显示无法访问的结果