mirror of
https://github.com/claude-code-best/claude-code.git
synced 2026-06-15 21:05:51 +00:00
- 扩大模型名称检测范围,匹配所有 deepseek 模型(V4、R1 等) - 始终保留 thinking blocks 为 reasoning_content 回传给 API - 移除有 bug 的 turn boundary 剥离逻辑 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
868 B
868 B
Session: vLLM Inference Optimization
- Level: Beginner (Target: Inference Optimization)
- Started: 2026-04-24
- Status: Mastered
Concepts
- ✅ LLM 推理的两个阶段 (Prefill vs Decode)
- ✅ KV Cache
- ✅ 显存瓶颈与碎片化
- ✅ PagedAttention
- ✅ vLLM 架构 (Scheduler, Worker)
- ✅ 实战部署 (--dtype, openai api)
- ✅ 量化 (AWQ/GPTQ vs 暴力 dtype)
- ✅ Tensor Parallel (TP, NCCL)
- ✅ 性能参数 (--gpu-memory-utilization)
- ✅ Chunked Prefill
Misconceptions
- 纠正:确实降低了峰值激活显存,但核心目的是降低Latency (卡顿感)。
Log
- Diagnosed: Beginner
- Mastery: Intuitive understanding of memory constraints and fragmentation is strong.
- Final Quiz: 3/3 correct (with minor clarification needed on TP params).