Files
claude-code/teach-me/vllm/session.md
claude-code-best 047c85fcbf fix: 修复 DeepSeek V4 reasoning_content 回传导致的 400 错误
- 扩大模型名称检测范围,匹配所有 deepseek 模型(V4、R1 等)
- 始终保留 thinking blocks 为 reasoning_content 回传给 API
- 移除有 bug 的 turn boundary 剥离逻辑

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 20:33:43 +08:00

868 B

Session: vLLM Inference Optimization

  • Level: Beginner (Target: Inference Optimization)
  • Started: 2026-04-24
  • Status: Mastered

Concepts

  1. LLM 推理的两个阶段 (Prefill vs Decode)
  2. KV Cache
  3. 显存瓶颈与碎片化
  4. PagedAttention
  5. vLLM 架构 (Scheduler, Worker)
  6. 实战部署 (--dtype, openai api)
  7. 量化 (AWQ/GPTQ vs 暴力 dtype)
  8. Tensor Parallel (TP, NCCL)
  9. 性能参数 (--gpu-memory-utilization)
  10. Chunked Prefill

Misconceptions

    • 纠正:确实降低了峰值激活显存,但核心目的是降低Latency (卡顿感)

Log

  • Diagnosed: Beginner
  • Mastery: Intuitive understanding of memory constraints and fragmentation is strong.
  • Final Quiz: 3/3 correct (with minor clarification needed on TP params).