agentic_huge_data_base / wiki
页面 Graphiti · 3.4 去重与消解·DeepWiki 中文全文译文

3.4 · 去重与消解(Deduplication and Resolution)

时序知识图谱与动态事实记忆 · 本章是 Graphiti DeepWiki 中文译文的独立章节页,保留原始链接、源码锚点、模块标签和章节层级。

项目Graphiti 章节3.4 状态全文译文 模块图谱与关系、界面与交互、系统架构、测试、发布与运维
源码线索
  • graphiti_core/prompts/dedupe_edges.py
  • graphiti_core/prompts/dedupe_nodes.py
  • graphiti_core/prompts/extract_edges.py
  • graphiti_core/prompts/extract_nodes.py
  • graphiti_core/prompts/summarize_nodes.py
  • graphiti_core/utils/maintenance/edge_operations.py
  • graphiti_core/utils/maintenance/node_operations.py
  • tests/utils/maintenance/test_edge_operations.py
  • tests/utils/maintenance/test_node_operations.py
模块标签
  • 图谱与关系
  • 界面与交互
  • 系统架构
  • 测试、发布与运维
  • 检索、召回与索引

中文译文

去重与消解(中文译文)

原始 DeepWiki 页面:https://deepwiki.com/getzep/graphiti/3.4-deduplication-and-resolution
翻译时间:2026-05-27T08:44:55.887Z
翻译模型:deepseek-chat
原文字符数:12429
项目:Graphiti (graphiti)

---

去重与解析

相关源文件

以下文件用作生成此维基页面的上下文:

  • graphiti_core/prompts/dedupe_edges.py
  • graphiti_core/prompts/dedupe_nodes.py
  • graphiti_core/prompts/extract_edges.py
  • graphiti_core/prompts/extract_nodes.py
  • graphiti_core/prompts/summarize_nodes.py
  • graphiti_core/utils/maintenance/edge_operations.py
  • graphiti_core/utils/maintenance/node_operations.py
  • tests/utils/maintenance/test_edge_operations.py
  • tests/utils/maintenance/test_node_operations.py

本文档解释了 Graphiti 在剧集入库期间如何解析重复实体和关系。系统采用三层策略进行节点去重(精确匹配、模糊相似度和大语言模型推理),并同时处理边去重和矛盾检测,支持基于时间的失效机制。

概述

在剧集入库期间,提取的节点和边必须与现有图谱实体进行解析,以防止重复并保持一致性。去重系统分两个阶段运行:

  1. 节点解析:通过 resolve_extracted_nodes graphiti_core/utils/maintenance/node_operations.py:31-31,先使用确定性启发式方法将提取的实体节点与现有节点进行匹配,若无法解决则升级到大语言模型(LLM)解析。
  2. 边解析:通过 resolve_extracted_edges graphiti_core/utils/maintenance/edge_operations.py:225-225 检查提取的边是否存在重复和矛盾,并对过时信息进行基于时间的失效处理。

系统优先使用快速确定性方法以提高性能,仅在必要时才调用大语言模型(LLM)。这可以在保持高准确率的同时,最大限度降低 API 成本。

来源graphiti_core/utils/maintenance/node_operations.py:31-31, graphiti_core/utils/maintenance/edge_operations.py:225-225

节点去重:三层策略

架构总览

节点去重过程遵循分层升级策略,每一层处理上一层无法解决的案例。该逻辑的主要入口点是 resolve_extracted_nodes graphiti_core/utils/maintenance/node_operations.py:31-31

分层解析流程

graph TB
    Extract["提取的节点<br/>(来自 extract_nodes)"]

    Collect["_collect_candidate_nodes()<br/>搜索相似的现有节点"]

    BuildIdx["_build_candidate_indexes()<br/>normalized_existing<br/>shingles_by_candidate<br/>lsh_buckets"]

    Tier1["第一层:精确匹配<br/>_normalize_string_exact()<br/>不区分大小写,空白标准化"]

    Tier2["第二层:模糊相似度<br/>_resolve_with_similarity()<br/>MinHash + LSH + Jaccard"]

    Tier3["第三层:大语言模型推理<br/>_resolve_with_llm()<br/>dedupe_nodes.nodes 提示"]

    State["DedupResolutionState<br/>resolved_nodes<br/>uuid_map<br/>unresolved_indices<br/>duplicate_pairs"]

    Final["解析后的节点<br/>+ UUID 映射<br/>+ 重复对"]

    Extract --> Collect
    Collect --> BuildIdx
    BuildIdx --> State

    State --> Tier1
    Tier1 -->|"找到精确匹配"| State
    Tier1 -->|"无匹配或多个匹配"| Tier2

    Tier2 -->|"高熵 + 模糊匹配"| State
    Tier2 -->|"低熵或无匹配"| Tier3

    Tier3 --> State

    State --> Final

来源graphiti_core/utils/maintenance/node_operations.py:31-31, graphiti_core/utils/maintenance/dedup_helpers.py:161-175, graphiti_core/utils/maintenance/node_operations.py:217-222

第一层:精确字符串匹配

第一层对实体名称进行标准化,并执行不区分大小写、空白标准化的比较。这发生在 _resolve_with_similarity graphiti_core/utils/maintenance/dedup_helpers.py:196-196 中。

标准化函数操作示例
_normalize_string_exact()转小写,合并空白" Alice Smith ""alice smith"
_normalize_name_for_fuzzy()移除标点,转小写"Alice-Smith!""alice smith"

来源graphiti_core/utils/maintenance/dedup_helpers.py:39-49, graphiti_core/utils/maintenance/dedup_helpers.py:52-64

精确匹配逻辑检查是否存在一个具有标准化名称的候选节点:

normalized_key = _normalize_string_exact(extracted_node.name)
candidates = indexes.normalized_existing[normalized_key]

if len(candidates) == 1:
    # 单个精确匹配 - 立即解析
    state.resolved_nodes[idx] = candidates[0]
    state.uuid_map[extracted_node.uuid] = candidates[0].uuid

来源graphiti_core/utils/maintenance/dedup_helpers.py:214-222

第二层:基于 MinHash 和 LSH 的模糊相似度

对于无法精确匹配的实体,系统在 _resolve_with_similarity graphiti_core/utils/maintenance/dedup_helpers.py:196-196 中使用概率哈希来查找近似重复项。

模糊解析管线

graph LR
    Name["实体名称:<br/>'Alice Smith'"]

    Shingles["_shingles()<br/>字符 3-gram"]

    MinHash["_minhash_signature()<br/>32 个哈希值"]

    LSH["_lsh_bands()<br/>8 个波段,每波段 4 个哈希"]

    Buckets["lsh_buckets<br/>band_hash → [candidate_uuids]"]

    Jaccard["_jaccard_similarity()"]

    Match["模糊匹配<br/>如果相似度 ≥ 0.9"]

    Name --> Shingles
    Shingles --> MinHash
    MinHash --> LSH
    LSH --> Buckets

    Buckets --> Jaccard
    Jaccard --> Match

来源graphiti_core/utils/maintenance/dedup_helpers.py:88-140

基于熵的门控:低熵名称(例如 "Joe")会跳过模糊匹配,直接升级到大语言模型(LLM)推理,以避免误报。这由 _has_high_entropy graphiti_core/utils/maintenance/dedup_helpers.py:79-79 控制,该函数计算字符的香农熵。

常量用途
_NAME_ENTROPY_THRESHOLD1.5模糊匹配的最小香农熵
_MIN_NAME_LENGTH6信任模糊匹配的最小长度
_FUZZY_JACCARD_THRESHOLD0.9解析的最小 Jaccard 相似度

来源graphiti_core/utils/maintenance/dedup_helpers.py:31-36, graphiti_core/utils/maintenance/dedup_helpers.py:79-85, graphiti_core/utils/maintenance/dedup_helpers.py:52-76

第三层:基于大语言模型的解析

经过第一层和第二层后仍未解析的实体,会通过 _resolve_with_llm 函数 graphiti_core/utils/maintenance/node_operations.py:29-29 批量发送给大语言模型(LLM)进行语义推理。

大语言模型解析序列

sequenceDiagram
    participant Sys as _resolve_with_llm
    participant Prompt as prompt_library.dedupe_nodes.nodes
    participant LLM as LLMClient
    participant State as DedupResolutionState

    Sys->>Prompt: 构建上下文,包含:<br/>extracted_nodes(ID 0-N)<br/>existing_nodes
    Prompt-->>Sys: 系统消息 + 用户消息

    Sys->>LLM: generate_response()<br/>response_model=NodeResolutions

    LLM-->>Sys: NodeResolutions 对象

    loop 对于每个解析结果
        Sys->>Sys: 验证 ID 和名称
        alt 有效解析
            Sys->>State: 更新 resolved_nodes<br/>更新 uuid_map
        end
    end

来源graphiti_core/utils/maintenance/node_operations.py:29-29, graphiti_core/prompts/dedupe_nodes.py:117-117, graphiti_core/llm_client/llm_client.py:42-42

大语言模型(LLM)提示会接收 extracted_nodesexisting_nodes 以及原始 episode_content,以提供消歧上下文 graphiti_core/prompts/dedupe_nodes.py:117-179

响应模型NodeDuplicate graphiti_core/prompts/dedupe_nodes.py:25-34

class NodeDuplicate(BaseModel):
    id: int # 来自新实体的整数 ID
    name: str # 最完整的名称
    duplicate_candidate_id: int # 匹配的现有实体的候选 ID,或 -1

边去重与矛盾检测

集成解析

边解析由 resolve_extracted_edges graphiti_core/utils/maintenance/edge_operations.py:225-225 管理,该函数将重复检测和矛盾识别合并到一次大语言模型(LLM)调用中,使用 resolve_edge 提示 graphiti_core/prompts/dedupe_edges.py:43-43

边解析流程

graph TB
    Extract["提取的边<br/>(EntityEdge)"]

    FastPath["快速路径:<br/>精确事实 + 端点匹配?"]

    GetRelated["EntityEdge.get_between_nodes()<br/>具有相同端点的边"]

    SearchSimilar["search() 使用 EDGE_HYBRID_SEARCH_RRF"]

    LLMCall["大语言模型:resolve_edge 提示<br/>现有事实(索引 0-N)<br/>失效候选(索引 N-M)"]

    Resolve["resolve_edge_contradictions()<br/>基于时间的失效逻辑"]

    Result["解析后的边<br/>+ 失效的边"]

    Extract --> FastPath
    FastPath -->|"找到匹配"| Result
    FastPath -->|"无匹配"| GetRelated

    GetRelated --> SearchSimilar
    SearchSimilar --> LLMCall
    LLMCall --> Resolve
    Resolve --> Result

来源graphiti_core/utils/maintenance/edge_operations.py:107-107, graphiti_core/prompts/dedupe_edges.py:43-91, graphiti_core/utils/maintenance/edge_operations.py:225-225

快速路径:精确事实匹配

resolve_extracted_edge graphiti_core/utils/maintenance/edge_operations.py:107-107 中,系统首先检查新事实与同一对节点之间的现有边是否存在精确语义匹配(标准化字符串)。如果找到匹配,则会短路大语言模型(LLM)调用,并将新剧集的 UUID 附加到现有边上 tests/utils/maintenance/test_edge_operations.py:108-152

来源graphiti_core/utils/maintenance/edge_operations.py:107-152, tests/utils/maintenance/test_edge_operations.py:108-152

基于时间的矛盾解析

当大语言模型(LLM)识别出矛盾时(通过 EdgeDuplicate 响应中的 contradicted_facts graphiti_core/prompts/dedupe_edges.py:24-33),系统会执行基于时间的失效处理。如果基于 valid_at 时间戳新事实更新,则现有事实的 invalid_at 会更新为新事实的 valid_at,从而有效地"淘汰"旧信息。

来源graphiti_core/prompts/dedupe_edges.py:24-33, graphiti_core/prompts/dedupe_edges.py:79-84

数据结构

DedupResolutionState

跟踪一批节点的解析进度 graphiti_core/utils/maintenance/dedup_helpers.py:161-168

@dataclass
class DedupResolutionState:
    resolved_nodes: list[EntityNode | None]
    uuid_map: dict[str, str]
    unresolved_indices: list[int]
    duplicate_pairs: list[tuple[EntityNode, EntityNode]] = field(default_factory=list)

来源graphiti_core/utils/maintenance/dedup_helpers.py:161-168

DedupCandidateIndexes

存储用于精确和模糊匹配的预计算查找结构 graphiti_core/utils/maintenance/dedup_helpers.py:150-158

@dataclass
class DedupCandidateIndexes:
    existing_nodes: list[EntityNode]
    nodes_by_uuid: dict[str, EntityNode]
    normalized_existing: defaultdict[str, list[EntityNode]]
    shingles_by_candidate: dict[str, set[str]]
    lsh_buckets: defaultdict[tuple[int, tuple[int, ...]], list[str]]

来源graphiti_core/utils/maintenance/dedup_helpers.py:150-158

与剧集处理的集成

去重是处理管线的核心部分。在通过 extract_nodes graphiti_core/utils/maintenance/node_operations.py:69-69 提取节点后,会对它们进行解析。生成的 uuid_map 随后用于更新提取的边的 source_node_uuidtarget_node_uuid,然后再通过 resolve_extracted_edges graphiti_core/utils/maintenance/edge_operations.py:225-225 对边进行解析。

来源graphiti_core/utils/maintenance/node_operations.py:69-148, graphiti_core/utils/maintenance/edge_operations.py:225-232