agentic_huge_data_base / wiki
页面 Dify · 4 知识库与 RAG 系统·DeepWiki 中文全文译文

4 · 知识库与 RAG 系统(Knowledge Base and RAG System)

应用编排与外部知识接入 · 本章是 Dify DeepWiki 中文译文的独立章节页,保留原始链接、源码锚点、模块标签和章节层级。

项目Dify 章节4 状态全文译文 模块检索、召回与索引、模型调用与提供方适配、图谱与关系、文档对象与元数据
源码线索
  • api/controllers/console/app/annotation.py
  • api/controllers/console/datasets/data_source.py
  • api/controllers/console/datasets/datasets_document.py
  • api/controllers/console/datasets/datasets_segments.py
  • api/controllers/console/datasets/external.py
  • api/controllers/console/datasets/hit_testing.py
  • api/controllers/console/datasets/hit_testing_base.py
  • api/controllers/console/datasets/metadata.py
  • api/controllers/service_api/app/annotation.py
  • api/controllers/service_api/dataset/dataset.py
模块标签
  • 检索、召回与索引
  • 模型调用与提供方适配
  • 图谱与关系
  • 文档对象与元数据
  • 系统架构

中文译文

知识库与 RAG 系统(中文译文)

原始 DeepWiki 页面:https://deepwiki.com/langgenius/dify/4-knowledge-base-and-rag-system
翻译时间:2026-05-27T08:44:37.874Z
翻译模型:deepseek-chat
原文字符数:18738
项目:Dify (dify)

---

知识库与检索增强生成(RAG)系统

相关源文件

以下文件用于生成此 Wiki 页面:

  • api/controllers/console/app/annotation.py
  • api/controllers/console/datasets/data_source.py
  • api/controllers/console/datasets/datasets_document.py
  • api/controllers/console/datasets/datasets_segments.py
  • api/controllers/console/datasets/external.py
  • api/controllers/console/datasets/hit_testing.py
  • api/controllers/console/datasets/hit_testing_base.py
  • api/controllers/console/datasets/metadata.py
  • api/controllers/service_api/app/annotation.py
  • api/controllers/service_api/dataset/dataset.py
  • api/controllers/service_api/dataset/document.py
  • api/controllers/service_api/dataset/hit_testing.py
  • api/controllers/service_api/dataset/metadata.py
  • api/controllers/service_api/dataset/segment.py
  • api/core/callback_handler/index_tool_callback_handler.py
  • api/core/helper/moderation.py
  • api/core/helper/ssrf_proxy.py
  • api/core/indexing_runner.py
  • api/core/rag/datasource/retrieval_service.py
  • api/core/rag/extractor/pdf_extractor.py
  • api/core/rag/extractor/word_extractor.py
  • api/core/rag/index_processor/index_processor_base.py
  • api/core/rag/index_processor/processor/paragraph_index_processor.py
  • api/core/rag/index_processor/processor/parent_child_index_processor.py
  • api/core/rag/index_processor/processor/qa_index_processor.py
  • api/core/rag/retrieval/dataset_retrieval.py
  • api/core/tools/utils/dataset_retriever/dataset_multi_retriever_tool.py
  • api/core/tools/utils/dataset_retriever/dataset_retriever_tool.py
  • api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py
  • api/services/annotation_service.py
  • api/services/dataset_service.py
  • api/services/entities/knowledge_entities/knowledge_entities.py
  • api/services/hit_testing_service.py
  • api/services/knowledge_service.py
  • api/services/summary_index_service.py
  • api/tests/test_containers_integration_tests/models/test_conversation_status_count.py
  • api/tests/unit_tests/controllers/console/datasets/test_hit_testing_base.py
  • api/tests/unit_tests/controllers/service_api/app/test_annotation.py
  • api/tests/unit_tests/controllers/service_api/dataset/test_dataset_segment.py
  • api/tests/unit_tests/controllers/service_api/dataset/test_document.py
  • api/tests/unit_tests/controllers/service_api/dataset/test_hit_testing.py
  • api/tests/unit_tests/controllers/web/test_web_login.py
  • api/tests/unit_tests/core/helper/test_ssrf_proxy.py
  • api/tests/unit_tests/core/rag/datasource/test_datasource_retrieval.py
  • api/tests/unit_tests/core/rag/extractor/test_word_extractor.py
  • api/tests/unit_tests/core/rag/indexing/processor/test_paragraph_index_processor.py
  • api/tests/unit_tests/core/rag/indexing/processor/test_parent_child_index_processor.py
  • api/tests/unit_tests/core/rag/indexing/processor/test_qa_index_processor.py
  • api/tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py
  • api/tests/unit_tests/core/tools/test_signature.py
  • api/tests/unit_tests/models/test_app_models.py
  • api/tests/unit_tests/services/test_annotation_service.py
  • api/tests/unit_tests/services/test_batch_indexing_base.py
  • api/tests/unit_tests/services/test_hit_testing_service_dump_records.py
  • api/tests/unit_tests/services/test_knowledge_service.py
  • api/tests/unit_tests/services/test_metadata_bug_complete.py
  • api/tests/unit_tests/services/test_metadata_nullable_bug.py

本文档介绍 Dify 中的知识库与检索增强生成(RAG)系统,该系统支持对文档集合进行语义搜索。系统负责文档的入库、片段切分、向量化、存储和检索。关于 RAG 检索如何集成到工作流执行中,请参见工作流引擎与节点执行。关于 RAG 操作中使用的大语言模型(LLM)提供方配置,请参见大语言模型(LLM)集成与模型管理

RAG 系统包含以下组件:

---

数据集与文档数据模型

知识库系统将内容组织为层级结构:Dataset(数据集)→ Document(文档)→ DocumentSegment(文档片段)→(可选)ChildChunk(子块)。

核心实体关系
erDiagram
    "Dataset" ||--o{ "Document" : "包含"
    "Dataset" ||--o{ "DocumentSegment" : "拥有"
    "Dataset" ||--o{ "DatasetPermission" : "具有"
    "Dataset" ||--o{ "ExternalKnowledgeBindings" : "可能拥有"
    "Dataset" }o--|| "DatasetCollectionBinding" : "使用"

    "Document" ||--o{ "DocumentSegment" : "切分为"
    "Document" }o--|| "DatasetProcessRule" : "遵循"
    "Document" }o--|| "UploadFile" : "引用"

    "DocumentSegment" ||--o{ "ChildChunk" : "可能拥有"
    "DocumentSegment" ||--o{ "SegmentAttachmentBinding" : "具有"
    "DocumentSegment" ||--o{ "DocumentSegmentSummary" : "有摘要"

    "Dataset" {
        uuid "id" 主键
        string "name"
        string "indexing_technique"
        string "embedding_model"
        string "embedding_model_provider"
        jsonb "retrieval_model"
        string "provider"
        string "doc_form"
        string "permission"
        jsonb "summary_index_setting"
    }

    "Document" {
        uuid "id" 主键
        uuid "dataset_id" 外键
        string "data_source_type"
        string "doc_form"
        string "indexing_status"
    }

    "DocumentSegment" {
        uuid "id" 主键
        uuid "document_id" 外键
        string "content"
        string "answer"
        string "index_node_id"
        string "index_node_hash"
        boolean "enabled"
        int "position"
    }

    "ChildChunk" {
        uuid "id" 主键
        uuid "segment_id" 外键
        string "content"
        string "index_node_id"
        int "position"
    }

来源: api/models/dataset.py:38-54, api/services/dataset_service.py:39-54

数据集配置结构

Dataset 包含控制索引和检索行为的关键配置:

字段类型用途
indexing_technique"high_quality""economy"高质量模式使用向量嵌入向量;经济模式仅使用关键词 api/core/rag/index_processor/constant/index_type.py:24-25
embedding_model字符串用于向量生成的模型名称(例如 "text-embedding-3-small"
embedding_model_provider字符串提供方 ID(例如 "openai""anthropic"
retrieval_modelJSONB搜索方法、重排序配置、top-k、分数阈值 api/services/entities/knowledge_entities/knowledge_entities.py:73-75
provider"vendor""external"Dify 内部存储与外部 API api/services/dataset_service.py:50-51
doc_form字符串"text_model""qa_model""parent_child" api/core/rag/index_processor/constant/index_type.py:18-20

来源: api/models/dataset.py:38-54, api/services/entities/knowledge_entities/knowledge_entities.py:70-76

---

文档索引管线(ETL)

IndexingRunner 负责编排文档入库的提取-转换-加载(ETL)管线。该流程分为三个不同的阶段。详情请参见文档索引管线

索引管线流程
flowchart TD
    Start["文档上传<br/>(UploadFile、Notion、Website)"] --> Extract["提取阶段<br/>IndexingRunner._extract()"]

    Extract --> ExtractFile["FileExtractor<br/>解析 PDF/DOCX/TXT 等"]
    Extract --> ExtractNotion["NotionExtractor<br/>通过 API 获取页面"]
    Extract --> ExtractWeb["WebsiteExtractor<br/>爬取和抓取"]

    ExtractFile --> Transform
    ExtractNotion --> Transform
    ExtractWeb --> Transform

    Transform["转换阶段<br/>IndexingRunner._transform()"] --> IndexProcessor["IndexProcessorFactory<br/>根据 doc_form 获取处理器"]

    IndexProcessor --> ParagraphProc["ParagraphIndexProcessor"]
    IndexProcessor --> ParentChildProc["ParentChildIndexProcessor"]
    IndexProcessor --> QAProc["QAIndexProcessor"]

    ParagraphProc --> Splitter["TextSplitter"]
    ParentChildProc --> Splitter
    QAProc --> Splitter

    Splitter --> Clean["CleanProcessor<br/>预处理规则"]
    Clean --> SaveSegments["_load_segments()<br/>保存 DocumentSegment 记录"]

    SaveSegments --> Load["加载阶段<br/>IndexingRunner._load()"]

    Load --> VectorLoad{"indexing_technique?"}
    VectorLoad -->|"high_quality"| VectorIndex["Vector.create()<br/>生成嵌入向量并插入向量数据库"]
    VectorLoad -->|"economy"| KeywordIndex["Keyword.create()<br/>插入结巴关键词表"]

来源: api/core/indexing_runner.py:50-128, api/core/indexing_runner.py:93-119

---

索引处理器与分段策略

Dify 支持三种文档分段策略,每种策略都实现为 IndexProcessor。详情请参见文档索引管线

处理器架构
classDiagram
    class "BaseIndexProcessor" {
        <<抽象>>
        +extract(extract_setting) 列表
        +transform(documents, **kwargs) 列表
        +load(dataset, documents, **kwargs) None
    }

    class "ParagraphIndexProcessor" {
        +transform() 列表
    }

    class "ParentChildIndexProcessor" {
        +transform() 列表
    }

    class "QAIndexProcessor" {
        +transform() 列表
    }

    class "IndexProcessorFactory" {
        +init_index_processor() BaseIndexProcessor
    }

    "BaseIndexProcessor" <|-- "ParagraphIndexProcessor"
    "BaseIndexProcessor" <|-- "ParentChildIndexProcessor"
    "BaseIndexProcessor" <|-- "QAIndexProcessor"
    "IndexProcessorFactory" ..> "BaseIndexProcessor" : "创建"

来源: api/core/rag/index_processor/index_processor_base.py:25-30, api/core/rag/index_processor/index_processor_factory.py:26-30

---

向量数据库集成

Dify 通过工厂模式抽象支持 23 种以上的向量数据库实现。详情请参见向量数据库集成架构

向量数据库工厂架构
classDiagram
    class "Vector" {
        -dataset: Dataset
        +create(documents) None
        +search_by_vector(query, **kwargs) 列表
        +search_by_full_text(query, **kwargs) 列表
    }

    class "BaseVector" {
        <<抽象>>
        +create(texts, embeddings, **kwargs)*
        +search_by_vector(query, **kwargs)*
    }

    class "WeaviateVector" {
        +search_by_vector()
    }

    class "QdrantVector" {
        +search_by_vector()
    }

    "Vector" --> "BaseVector" : "通过 VectorFactory 实例化"
    "BaseVector" <|-- "WeaviateVector"
    "BaseVector" <|-- "QdrantVector"

来源: api/core/rag/datasource/vdb/vector_factory.py:15-20, api/core/rag/datasource/retrieval_service.py:15-23

---

检索服务与搜索方法

RetrievalService 提供统一接口,用于访问所有向量数据库中的多种搜索方法。详情请参见检索策略与元数据过滤

检索方法类型
flowchart TD
    Query["用户查询"] --> Router{"retrieval_method"}

    Router -->|SEMANTIC_SEARCH| Semantic["embedding_search()"]
    Router -->|KEYWORD_SEARCH| Keyword["keyword_search()"]
    Router -->|HYBRID_SEARCH| Hybrid["加权融合"]

    Semantic --> VectorSearch["Vector.search_by_vector()"]
    Keyword --> JiebaSearch["Keyword.search()"]

    Hybrid --> Semantic
    Hybrid --> Keyword
    Hybrid --> Fusion["加权分数融合"]

    VectorSearch --> Rerank{reranking_enable?}
    Fusion --> Rerank

    Rerank -->|true| RerankModel["DataPostProcessor.invoke()"]

来源: api/core/rag/datasource/retrieval_service.py:92-171, api/core/rag/retrieval/retrieval_methods.py:25-30

---

数据集检索与路由

对于包含多个数据集的应用,DatasetRetrieval 类负责编排检索和路由。详情请参见检索策略与元数据过滤

单数据集与多数据集检索
flowchart TD
    Start["应用查询"] --> CheckStrategy{retrieval_mode}

    CheckStrategy -->|SINGLE| SingleRouter["单数据集路由器"]
    CheckStrategy -->|MULTIPLE| MultiSearch["多数据集搜索"]

    SingleRouter --> RouterType{planning_strategy}
    RouterType -->|REACT_ROUTER| React["ReactMultiDatasetRouter"]
    RouterType -->|FUNCTION_CALL| FuncCall["FunctionCallMultiDatasetRouter"]

    MultiSearch --> Parallel["并行 RetrievalService.retrieve()"]
    Parallel --> Merge["合并与去重"]

来源: api/core/rag/retrieval/dataset_retrieval.py:157-170, api/core/rag/retrieval/dataset_retrieval.py:101-118

---

元数据过滤

元数据过滤允许将检索限制为符合特定元数据条件的文档片段。详情请参见检索策略与元数据过滤

元数据过滤模式

支持三种过滤模式:

  1. disabled(禁用):不进行元数据过滤。
  2. manual(手动):用户通过界面/API 定义精确的过滤条件。
  3. automatic(自动):大语言模型(LLM)分析查询并自动生成过滤条件。

来源: api/core/rag/retrieval/dataset_retrieval.py:130-157, api/core/app/app_config/entities.py:15-20

---

摘要索引系统

摘要索引系统为文档片段生成简洁的摘要,支持基于摘要的检索。详情请参见摘要索引生成

来源: api/services/dataset_service.py:103, api/tasks/regenerate_summary_index_task.py:1-50

---

外部知识库集成

外部知识库允许 Dify 从第三方 API 检索内容,而无需在内部管理文档。详情请参见外部知识库集成

来源: api/services/external_knowledge_service.py:85-89, api/core/rag/datasource/retrieval_service.py:173-195

---

API 端点

知识库系统为控制台(管理端)和服务(应用端)使用提供了 API。详情请参见数据集服务与文档管理

来源: api/controllers/console/datasets/datasets_document.py:1-100, api/controllers/service_api/dataset/dataset.py:129-185, api/controllers/service_api/dataset/document.py:128-193

---

DatasetService 核心操作

DatasetService 类提供数据集管理的高级操作。详情请参见数据集服务与文档管理

来源: api/services/dataset_service.py:120-154