claude-code/spec/feature_20260510_F001_multi-encoding-file-tools/spec-plan-task-1.md at 43c20a43c23dc0a68b7d3f8dec3dfa98b84ed22b

mirror of https://github.com/claude-code-best/claude-code.git synced 2026-06-15 12:55:51 +00:00

Files

claude-code-best 0ce8f7a1cb feat: 添加 GBK 编码自动检测支持，文件读写工具透明处理非 UTF-8 文件

新增 encoding.ts 核心模块实现三层编码检测（BOM → UTF-8 fatal → GBK 回退），
改造同步/异步读取路径和写入路径，使 FileReadTool/FileEditTool/FileWriteTool
能正确处理 GBK 编码文件。包含完整单元测试和 spec 文档。

Co-Authored-By: glm-5-turbo <zai-org@claude-code-best.win>

2026-05-10 20:50:12 +08:00

6.7 KiB

Raw Blame History

Task 1: 编码检测核心模块

背景: 当前 src/utils/fileRead.ts 的 detectEncodingForResolvedPath 仅通过 BOM 头识别 UTF-8 和 UTF-16LE，其他所有文件一律返回 utf8，导致 GBK 等非 UTF-8 编码文件读取乱码。本 Task 新建独立的编码检测工具模块 src/utils/encoding.ts，实现三层编码检测算法（BOM → UTF-8 fatal 验证 → GBK 回退），为后续 Task 2/3/4 的读写路径改造提供统一的编码检测和解码能力。本 Task 无前置依赖，是后续所有 Task 的基础。

涉及文件:

新建: src/utils/encoding.ts
新建: src/utils/__tests__/encoding.test.ts

执行步骤:

创建 src/utils/encoding.ts，定义类型

位置: 文件顶部

导出以下类型:

/** 扩展编码类型，覆盖最常见的非 UTF-8 CJK 编码 */
export type FileEncoding = BufferEncoding | 'gbk'

/** TextDecoder 接受的编码名（string），比 FileEncoding 更宽泛 */
export type DetectedEncoding = string

原因: 后续 Task 2/3/4 需要这些类型来做编码标注和类型收窄

实现 detectEncoding(buffer: Buffer): FileEncoding 函数

位置: src/utils/encoding.ts，类型定义之后

三层检测逻辑:

export function detectEncoding(buffer: Buffer): FileEncoding {
  // Layer 1: BOM 检测（与现有 fileRead.ts 逻辑一致）
  if (buffer.length >= 2 && buffer[0] === 0xff && buffer[1] === 0xfe) {
    return 'utf-16le'
  }
  if (
    buffer.length >= 3 &&
    buffer[0] === 0xef &&
    buffer[1] === 0xbb &&
    buffer[2] === 0xbf
  ) {
    return 'utf-8'
  }

  // Layer 2: UTF-8 fatal 验证
  // fatal: true 模式下，无效 UTF-8 字节序列会抛出 TypeError
  try {
    new TextDecoder('utf-8', { fatal: true }).decode(buffer)
    return 'utf-8'
  } catch {
    // 不是合法 UTF-8，进入 Layer 3
  }

  // Layer 3: GBK 回退
  try {
    new TextDecoder('gbk', { fatal: true }).decode(buffer)
    return 'gbk'
  } catch {
    // 不是合法 GBK，latin1 作为最终兜底
  }

  return 'latin1'
}

原因: BOM 必须优先于 fatal 验证；GBK 作为唯一回退避免了多编码链的字节歧义问题；latin1 单字节编码永远成功

实现 decodeBuffer(buffer: Buffer, encoding: DetectedEncoding): string 函数
- 位置: src/utils/encoding.ts，detectEncoding 之后
- 逻辑:
```
export function decodeBuffer(
  buffer: Buffer,
  encoding: DetectedEncoding,
): string {
  return new TextDecoder(encoding).decode(buffer)
}
```
- 原因: 统一解码入口，后续 Task 2/3 的读取路径都调用此函数

实现 encodeString(content: string, encoding: DetectedEncoding): { buffer: Buffer; converted: boolean } 函数

位置: src/utils/encoding.ts，decodeBuffer 之后

逻辑:

export function encodeString(
  content: string,
  encoding: DetectedEncoding,
): { buffer: Buffer; converted: boolean } {
  if (encoding === 'utf-8' || encoding === 'utf8') {
    return { buffer: Buffer.from(content, 'utf-8'), converted: false }
  }
  if (encoding === 'utf-16le') {
    return { buffer: Buffer.from(content, 'utf-16le'), converted: false }
  }

  // 其他编码（如 gbk）：尝试 Buffer.from，失败则回退为 UTF-8
  try {
    const buf = Buffer.from(content, encoding as BufferEncoding)
    return { buffer: buf, converted: false }
  } catch {
    return { buffer: Buffer.from(content, 'utf-8'), converted: true }
  }
}

原因: Buffer.from 在 Bun 中可能支持 GBK 编码名，但 Node.js 不支持。try-catch 策略兼容两种运行时；converted 标志让 Task 4 的写入路径能向用户报告编码转换

为编码检测和解码函数编写单元测试
- 测试文件: src/utils/__tests__/encoding.test.ts
- 测试场景:
  - BOM 检测 — UTF-16LE: 输入 Buffer.from([0xff, 0xfe, 0x48, 0x00]) → 预期返回 'utf-16le'
  - BOM 检测 — UTF-8 BOM: 输入 Buffer.from([0xef, 0xbb, 0xbf, 0x48, 0x65]) → 预期返回 'utf-8'
  - UTF-8 验证: 输入 Buffer.from('Hello, 世界', 'utf-8') → 预期返回 'utf-8'
  - GBK 检测: 输入 Buffer.from([0xc4, 0xe3, 0xba, 0xc3]) → 预期返回 'gbk'
  - 空 buffer: 输入 Buffer.alloc(0) → 预期返回 'utf-8'
  - latin1 兜底: 输入随机字节 Buffer.from([0x80, 0x81, 0x82, 0x83, 0x84, 0x85]) → 预期返回 'latin1'
  - BOM 优先于内容分析: 输入带 UTF-8 BOM 的数据 → 预期返回 'utf-8'
  - decodeBuffer — UTF-8: 输入 UTF-8 编码的 buffer + encoding 'utf-8' → 预期返回正确的中文字符串
  - decodeBuffer — GBK: 输入 GBK 编码的 buffer + encoding 'gbk' → 预期返回正确的中文字符串
  - decodeBuffer — UTF-16LE: 输入 UTF-16LE 编码的 buffer + encoding 'utf-16le' → 预期返回正确字符串
  - decodeBuffer — 空 buffer: 输入空 buffer → 预期返回空字符串
  - encodeString — UTF-8: 输入字符串 + encoding 'utf-8' → 预期 { converted: false }
  - encodeString — utf8 别名: 输入字符串 + encoding 'utf8' → 预期 { converted: false }
  - encodeString — UTF-16LE: 输入字符串 + encoding 'utf-16le' → 预期 { converted: false }
  - encodeString — GBK: 输入字符串 + encoding 'gbk' → 预期返回有效的 Buffer（converted 视运行时而定）
- 运行命令: bun test src/utils/__tests__/encoding.test.ts
- 预期: 所有测试通过

检查步骤:

验证 encoding.ts 文件存在且导出正确
- grep -c "export" src/utils/encoding.ts
- 预期: 输出 >= 4（至少导出 FileEncoding, DetectedEncoding, detectEncoding, decodeBuffer, encodeString 共 5 个导出）
验证类型检查通过
- bunx tsc --noEmit src/utils/encoding.ts 2>&1 | head -5
- 预期: 无类型错误输出
运行编码检测单元测试
- bun test src/utils/__tests__/encoding.test.ts
- 预期: 所有测试通过，无失败用例

认知变更:

[CLAUDE.md] src/utils/encoding.ts 是文件编码检测的唯一入口，提供 detectEncoding（三层检测：BOM → UTF-8 fatal → GBK 回退）和 decodeBuffer/encodeString 函数。检测基于文件头部 4KB，零外部依赖，仅使用 TextDecoder API。FileEncoding 类型为 BufferEncoding | 'gbk'，覆盖最常见非 UTF-8 CJK 编码。latin1 作为最终兜底编码（单字节编码永远成功）。

6.7 KiB Raw Blame History Unescape Escape

Task 1: 编码检测核心模块

6.7 KiB

Raw Blame History