科技资讯

智能体开发实战03｜上下文工程降本80%

22 7分钟阅读 2026-06-04

前两篇我们搭了第一个 Agent 骨架（第01篇），给它装上了重试、超时、步数限制这些『操作系统』能力（第02篇）。

现在它已经能稳定跑了。但你打开后台一看——钱包在流血。

跑一轮对话烧掉了几万 Token
模型每次把整个工具定义翻来覆去读好几遍
历史越长，速度越慢，费用越高
本想用 Agent 省时间，结果全在付 Token 费

这不是模型太贵——而是上下文失控了。这就是 Context Engineering（上下文工程）要解决的问题。

一、Context Engineering 是什么

如果说 Prompt Engineering 是「写好一句话」，Context Engineering 就是「设计整个对话空间」。

Context Engineering 管的不是一句话怎么说，而是：

上下文预算：200K 窗口怎么分给系统指令、历史对话、工具输出
上下文压缩：什么该保留、什么该扔掉、什么该总结
上下文分层：全局规则 vs 任务指令 vs 即时消息的优先级管理
模型路由：什么任务用贵模型、什么用便宜模型

当这些做对了，单次 Agent 调用从「把所有东西塞进去碰运气」变成「精准投喂——不多不少刚刚好」，Token 成本直接打 2 折。

二、最大元凶：上下文税

先看一个常见场景。你的 Agent 是这样的：

# 反面教材——什么都在上下文里
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},  # 3000 tokens
    *history,      # 每次对话膨胀，已经 50000 tokens
    {"role": "user", "content": "帮我看看这个函数的 Bug"},
    {"role": "user", "content": file_content}     # 整个文件 20000 tokens
]
# 总计：~75000 tokens，每次调用都是这个数

你以为在让 Agent 修 Bug，其实你在付钱让它把整个系统读上几十遍。这就是「上下文税」——把用不到的上下文也塞进窗口，白烧 Token。

实战中，以下三项贡献了 80% 的上下文浪费：

三、实战┃上下文预算管理

把 200K 的上下文窗口想象成你的办公桌。不用把所有资料都堆桌上，只放当前任务最需要的东西。

3.1 预算分配模型

一个 200K 窗口，建议这样分：

3.2 代码实现：上下文管理器

class ContextManager:
    """管理 Agent 的上下文预算"""
    
    def __init__(self, max_tokens=200000):
        self.max_tokens = max_tokens
        self.budget = {
            "system": 10000,
            "tools": 20000,
            "current": 30000,
            "results": 60000,
            "summary": 40000,
            "buffer": 40000
        }
        self.system_prompt = ""
        self.active_tools = []
        self.current_task = []
        self.tool_results = []
        self.history_summary = ""
    
    def set_system_prompt(self, prompt: str):
        """设置系统提示词（固定预算）"""
        if len(prompt) > self.budget["system"] * 4:  # 约 4 chars per token
            prompt = prompt[:self.budget["system"] * 4]
        self.system_prompt = prompt
    
    def register_tool(self, tool_schema: dict):
        """注册工具定义（按需加载）"""
        self.active_tools.append(tool_schema)
    
    def get_active_tools(self):
        """返回当前步骤需要的工具"""
        total = sum(len(str(t)) for t in self.active_tools)
        if total > self.budget["tools"] * 4:
            return self._prune_tools()
        return self.active_tools
    
    def add_result(self, content: str):
        """添加工具返回结果"""
        max_chars = self.budget["results"] * 4
        self.tool_results.append(content[:max_chars])
        # 超出预算时丢弃最早的结果
        total = sum(len(r) for r in self.tool_results)
        while total > max_chars and len(self.tool_results) > 1:
            self.tool_results.pop(0)
            total = sum(len(r) for r in self.tool_results)
    
    def compress_history(self, history: list):
        """压缩历史对话为摘要"""
        self.history_summary = self._summarize(history)
    
    def build_context(self) -> list:
        """组装最终上下文"""
        context = []
        context.append({
            "role": "system",
            "content": self.system_prompt
        })
        if self.history_summary:
            context.append({
                "role": "system",
                "content": f"[对话历史摘要]\n{self.history_summary}"
            })
        for msg in self.current_task:
            context.append(msg)
        return context
    
    def _prune_tools(self):
        """工具太多时按重要性精简"""
        char_limit = self.budget["tools"] * 4
        pruned = []
        total = 0
        for t in self.active_tools:
            cost = len(str(t))
            if total + cost <= char_limit:
                pruned.append(t)
                total += cost
                return pruned
    
    def _summarize(self, history: list) -> str:
        """调用模型的压缩能力（示意代码）"""
        # 实际实现中用 LLM 做一次 summarize
        return f"[已压缩] 历史 {len(history)} 轮对话，已做摘要…"

3.3 集成到 Agent 中

class Agent:
    def __init__(self, model="deepseek-chat"):
        self.model = model
        self.ctx = ContextManager()
        self.conversation_history = []
    
    async def run(self, user_input: str):
        self.ctx.current_task.append({
            "role": "user",
            "content": user_input
        })
        # 检查是否需要压缩
        ctx_size = sum(len(str(m)) for m in self.ctx.build_context())
        if ctx_size > 100000:
            self.ctx.compress_history(self.conversation_history)
            self.conversation_history = []
        
        messages = self.ctx.build_context()
        tools = self.ctx.get_active_tools()
        
        response = client.chat.completions.create(
            model=self.model,
            messages=messages,
            tools=tools
        )
        
        self.conversation_history.append({
            "role": "user", "content": user_input
        })
        self.conversation_history.append({
            "role": "assistant", "content": response.choices[0].message.content
        })
        return response

四、核心技巧┃压缩、分层、路由

4.1 工具定义的瘦身术

工具定义是 Token 消耗的大头。一个典型的 Function Calling schema 描述写得像写论文，实际上模型根本不读长篇大论。

黄金法则：description 不超过 80 个字符，parameter 只写必要的

# ❌ 太长，每个工具 500+ tokens
TOOLS = [{
    "type": "function",
    "function": {
        "name": "web_search",
        "description": "当用户需要查询互联网上的最新信息、新闻、资讯时使用这个工具…此处省略200字…",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "用户想要搜索的关键词或问题，通常来自用户问题的核心内容…"
                }
            }
        }
    }
}]

# ✅ 精简后，80 tokens
TOOLS = [{
    "type": "function",
    "function": {
        "name": "web_search",
        "description": "搜索互联网获取最新信息",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string"}
            },
            "required": ["query"]
        }
    }
}]

每个工具省 400 tokens，如果 Agent 有 6 个工具，一次调用省 2400 tokens。1000 次调用省 240 万 tokens——按 DeepSeek 的价格，这就是省了 200 多块。

4.2 对话压缩策略

当对话超过 5 轮以上，不要把所有历史都送进上下文。用「摘要+关键消息」代替完整历史：

def smart_compress(history, max_rounds=5):
    """智能压缩对话历史"""
    if len(history) <= max_rounds:
        return history  # 不需要压缩
    
    # 保留最近 N 轮完整对话
    recent = history[-max_rounds:]
    
    # 将其余部分压缩为摘要
    older = history[:-max_rounds]
    summary_prompt = "请将以下对话压缩为 1-2 句摘要，保留关键决定和结论:"
    
    summary = llm_summarize(summary_prompt, older)
    
    return [
        {"role": "system",
         "content": f"[历史摘要] {summary}"}
    ] + recent

4.3 模型路由：分层用模型

不要试图用一个模型解决所有问题。建立分层路由：

当 Agent 把「复杂规划用贵模型 + 日常执行用便宜模型」作为默认策略后，单次典型任务的成本模型从「什么都用最贵」变成按需分配，成本能降到原来的十分之一。

五、进阶┃Prompt Caching

大多数模型的系统提示词每次调用都在重复传输。Prompt Caching 就像浏览器的缓存——同样的内容不重复传，只传变化的部分。

以 DeepSeek 为例，启用缓存后，系统指令部分的成本可以降低 90%。实际效果取决于上下文命中的连续性——连续多轮对话中，只要系统指令没变，缓存就持续生效。

import hashlib
import time

class PromptCache:
    def __init__(self, ttl_seconds=300):
        self.cache = {}
        self.ttl = ttl_seconds
    
    def get_or_compute(self, key: str, compute_fn):
        now = time.time()
        if key in self.cache:
            cached_time, cached_value = self.cache[key]
            if now - cached_time < self.ttl:
                return cached_value, True  # 命中缓存
        value = compute_fn()
        self.cache[key] = (now, value)
        return value, False  # 未命中
    
    def invalidate(self, key: str):
        self.cache.pop(key, None)

# 使用示例
cache = PromptCache(ttl_seconds=600)
system_prompt = "你是一个专业的 AI 编程助手…（2万 tokens）"

# 第一次调用：正常计算
result, cached = cache.get_or_compute(
    hashlib.md5(system_prompt.encode()).hexdigest(),
    lambda: client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role": "system", "content": system_prompt},
                  {"role": "user", "content": user_input}]
    )
)
# 后续相同系统指令的调用直接命中缓存

六、最佳实践总结

结合以上所有技巧，一个优化后的 Agent 调用流程：

【预算先行】先分好 200K 窗口的预算比例，别等满了再救
【工具瘦身】每个 tool description 不超过 80 字，参数精简到最少
【按需加载】当前步骤不需要的工具定义不载入上下文中
【精准投喂】用 grep 找到关键代码再传文件，不要整文件塞入
【历史压缩】超过 5 轮后用摘要替代完整历史
【分层路由】简单任务用便宜模型，复杂任务用好模型
【启用缓存】系统指令走 Prompt Caching，省 90% 重复成本

按照这些实践，一个典型 Agent 的上下文效率：

成本降 80% 不是夸张——是前后对比的真实数据。

七、下期预告

第04篇: 工具调用——让 Agent 真正『动手干活』
第05篇: Agent 记忆系统——从『转头就忘』到『过目不忘』
第06篇: Multi-Agent 模式——把 Agent 变成团队

提示：Context Engineering 是 AI Agent 工程中最『值钱』的技能之一。

模型本身越来越便宜，但上下文失控会让你的账单反常识地飞涨。把这篇文章里的 7 个实践用上，你的 Agent 不仅跑得快，还花得少。很多人以为优化上下文是『后期再做的事』，但实际上，上下文预算应该在第一天就设计好。就像建房子，地基没打好，后面怎么装修都救不了。

下一篇，我们给 Agent 装『双手』——工具调用。让 Agent 不只能『说』，还能『做』。

提示：本文由 码农大坚果 出品，欢迎转发分享，转载请注明出处。

参考: Weaviate《Context Engineering》电子书、上海交大《上下文工程2.0》论文、知识库 concepts/context-engineering | 整理 by 码农大坚果