16.3 · Embedchain 数据 Sources（Embedchain Data Sources）

长期记忆与上下文管理 · 本章是 Mem0 DeepWiki 中文译文的独立章节页，保留原始链接、源码锚点、模块标签和章节层级。

项目Mem0 章节16.3 状态全文译文模块记忆与上下文、测试、发布与运维、存储与持久化、界面与交互

项目要点页2.5 参考项目项目章节目录Mem0 DeepWiki 原始章节Embedchain Data Sources 上一章16.2 下一章17

源码线索

embedchain/docs/api-reference/app/add.mdx
embedchain/docs/mint.json
embedchain/docs/use-cases/chatbots.mdx
embedchain/embedchain/chunkers/base_chunker.py
embedchain/embedchain/data_formatter/data_formatter.py
embedchain/embedchain/deployment/gradio.app/requirements.txt
embedchain/embedchain/embedchain.py
embedchain/embedchain/loaders/base_loader.py
embedchain/embedchain/loaders/web_page.py
embedchain/embedchain/vectordb/chroma.py

模块标签

记忆与上下文
测试、发布与运维
存储与持久化
界面与交互
模型调用与提供方适配

中文译文

Embedchain 数据 Sources（中文译文）

原始 DeepWiki 页面：https://deepwiki.com/mem0ai/mem0/16.3-embedchain-data-sources

翻译时间：2026-05-27T08:45:04.786Z

翻译模型：deepseek-chat

原文字符数：7894

项目：Mem0 (mem0)

---

Embedchain 数据源

数据库模型概述

Embedchain 使用 SQLAlchemy 持久化两种主要数据类型：

数据源：跟踪添加到应用中的每个源（URL、文件路径或原始文本）。
聊天历史：记录对话交互，用于上下文感知的聊天功能。

这些模型定义在 embedchain/embedchain/core/db/models.py 中，由 EmbedChain 和 BaseLlm 类用于管理应用状态和记忆。

实体关系图

erDiagram
    "DataSource" {
        string id PK "UUID 主键"
        text app_id "已索引，应用标识符"
        text hash "已索引，内容哈希，用于去重"
        text type "已索引，数据源类型"
        text value "数据源值/URL"
        text metadata "JSON 元数据"
        integer is_uploaded "上传状态标志"
    }

    "ChatHistory" {
        string app_id PK "应用标识符"
        string id PK "聊天消息标识符"
        string session_id PK "已索引，会话标识符"
        text question "用户问题"
        text answer "系统回答"
        text metadata "JSON 元数据"
        timestamp created_at "已索引，自动生成的时间戳"
    }

    "DataSource" ||--o{ "ChatHistory" : "共享 app_id"

来源： embedchain/embedchain/core/db/models.py:10-31, embedchain/embedchain/embedchain.py:18-18

数据加载与处理流程

当用户调用 app.add() 时，系统会通过加载器和片段切分器管线处理源数据，然后将元数据持久化到数据库，将向量持久化到向量存储。

数据处理管线

graph TD
    "用户"["用户: EmbedChain.add(source)"] --> "检测器"["detect_datatype()"]
    "检测器" --> "数据格式化器"["DataFormatter 类"]

    subgraph "入库引擎"["入库引擎"]
        "数据格式化器" --> "基础加载器"["BaseLoader（例如 WebPageLoader）"]
        "基础加载器" --> "基础片段切分器"["BaseChunker（例如 WebPageChunker）"]
    end

    subgraph "持久化层"["持久化层"]
        "基础片段切分器" --> "基础向量数据库"["向量存储（例如 ChromaDB）"]
        "基础片段切分器" --> "SQLAlchemy"["SQLAlchemy（DataSource 模型）"]
    end

    "基础向量数据库" --> "搜索"["语义搜索"]
    "SQLAlchemy" --> "去重"["基于哈希的去重"]

来源： embedchain/embedchain/embedchain.py:117-182, embedchain/embedchain/data_formatter/data_formatter.py:12-35, embedchain/embedchain/chunkers/base_chunker.py:18-74, embedchain/embedchain/utils/misc.py:30-30

数据源去重

BaseChunker 使用内容与源 URL 的 SHA-256 哈希值生成 doc_id 和 chunk_id embedchain/embedchain/chunkers/base_chunker.py:42-62。WebPageLoader 也会基于清洗后的内容和 URL 计算 doc_id embedchain/embedchain/loaders/web_page.py:53-53。该哈希机制与 DataSource 模型配合使用，防止对相同数据进行重复嵌入 embedchain/embedchain/core/db/models.py:14-14。

DataSource 模型

DataSource 模型跟踪所有已入库数据的元数据。

模式定义

列名	类型	描述
`id`	字符串	UUID v4 主键 `embedchain/embedchain/core/db/models.py:12-12`
`app_id`	文本	特定应用实例的标识符 `embedchain/embedchain/core/db/models.py:13-13`
`hash`	文本	用于去重的内容哈希 `embedchain/embedchain/core/db/models.py:14-14`
`type`	文本	`DataType`（例如 WEB_PAGE、PDF_FILE） `embedchain/embedchain/core/db/models.py:15-15`
`value`	文本	原始源 URL 或路径 `embedchain/embedchain/core/db/models.py:16-16`
`meta_data`	文本	与该源关联的 JSON 编码元数据 `embedchain/embedchain/core/db/models.py:17-17`

来源： embedchain/embedchain/core/db/models.py:10-20

聊天历史与 Mem0 集成

Embedchain 通过两种机制管理对话状态：旧版 ChatHistory 模型和现代 Mem0 集成。

旧版聊天历史

默认情况下，对话存储在本地 SQLite 数据库中。EmbedChain 实例在初始化时通过调用 self.llm.update_history(app_id=self.config.id) 来初始化历史记录 embedchain/embedchain/embedchain.py:90-90。

Mem0 记忆集成

现代 Embedchain 应用可以使用 Mem0 实现长期记忆。EmbedChain 类包含 memory_config 和 mem0_memory 属性 embedchain/embedchain/embedchain.py:65-66。

聊天中的记忆流程

sequenceDiagram
    participant U as "用户"
    participant A as "EmbedChain 应用"
    participant M as "Mem0 记忆"
    participant L as "BaseLlm"

    U->>A: "app.chat(query)"
    A->>M: "搜索相关记忆"
    M-->>A: "返回记忆"
    A->>L: "generate_prompt(query, contexts, memories)"
    L->>L: "格式化提示模板"
    L-->>A: "包含长期记忆的提示"
    A->>U: "最终回答"

来源： embedchain/embedchain/embedchain.py:65-66, embedchain/embedchain/llm/base.py:22-22

数据库管理与迁移

Embedchain 使用集中式引擎设置和 Alembic 进行模式迁移。

引擎初始化

数据库引擎通常通过 SQLAlchemyManager 初始化。对于旧版组件，BaseVectorDB（例如 ChromaDB）管理自己的持久化层。ChromaDB 默认使用名为 db 的本地目录 embedchain/embedchain/vectordb/chroma.py:57-61。

向量存储持久化（ChromaDB）

元数据存储在 SQL 中，而向量则由 ChromaDB 等提供程序管理。

# embedchain/embedchain/vectordb/chroma.py
if self.config.dir is None:
    self.config.dir = "db"
self.settings.persist_directory = self.config.dir
self.settings.is_persistent = True
self.client = chromadb.Client(self.settings)

来源： embedchain/embedchain/vectordb/chroma.py:29-64, embedchain/embedchain/embedchain.py:78-83