【AI Agent 知识库】16-LlamaIndex框架详解

内容纲要

LlamaIndex 框架详解

目标:精通 LlamaIndex 数据框架,构建高性能 RAG 系统


目录


LlamaIndex 概述

1. 简介

【定义】

LlamaIndex 是一个数据框架,专为 LLM 应用设计。
它连接您的私有数据与大语言模型(LLM),构建可查询的索引。

【核心价值】

1. 数据摄入简单
   - 支持多种数据源
   - 自动解析和处理
   - 灵活的数据转换

2. 高效索引
   - 向量索引
   - 混合索引(向量 + 关键词)
   - 层级索引

3. 灵活查询
   - 多种检索策略
   - 流式输出
   - 模板化响应

4. 易于集成
   - 与 LangChain 无缝集成
   - 支持多种向量数据库
   - 支持多种嵌入模型

2. 架构

┌─────────────────────────────────────────────────┐
│                  LlamaIndex                     │
├─────────────────────────────────────────────────┤
│                                                 │
│  ┌─────────────┐    ┌─────────────┐          │
│  │  Loaders    │    │  Documents  │          │
│  │  (数据加载)  │    │  (文档管理)  │          │
│  └──────┬──────┘    └──────┬──────┘          │
│         │                  │                  │
│         ▼                  ▼                  │
│  ┌─────────────────────────────────────┐       │
│  │         Nodes & Indexes           │       │
│  │  ┌──────┬──────┬──────┐         │       │
│  │  │ Node │Chunk │Embed│         │       │
│  │  │Parser│  ers │ ding│         │       │
│  │  └──────┴──────┴──────┘         │       │
│  │  ┌──────┬──────┬──────┐         │       │
│  │  │Vector│Hybrid│Graph│         │       │
│  │  │Index │Index │Index│         │       │
│  │  └──────┴──────┴──────┘         │       │
│  └──────────────┬──────────────────┘       │
│                 ▼                           │
│  ┌─────────────────────────────────────┐       │
│  │          Query Engines             │       │
│  │  ┌──────┬──────┬──────┐         │       │
│  │  │Vector│Hybrid│Graph │         │       │
│  │  │Query │Query │Query │         │       │
│  │  └──────┴──────┴──────┘         │       │
│  └──────────────┬──────────────────┘       │
│                 ▼                           │
│  ┌─────────────────────────────────────┐       │
│  │         Response Synthesizer        │       │
│  └─────────────────────────────────────┘       │
└─────────────────────────────────────────────────┘

3. 与 LangChain 对比

维度 LlamaIndex LangChain
定位 数据框架 应用框架
核心 索引构建 链式编排
优势 数据处理、检索灵活 Agent、链式编排
集成 可独立使用或与 LC 集成 核心组件
最佳场景 RAG 优化 Agent 开发

核心概念

1. Document(文档)

from llama_index import Document

# 单个文档
document = Document(
    text="这是文档内容",
    doc_id="doc_001",
    metadata={
        "source": "file.pdf",
        "page": 1,
        "author": "张三"
    },
    excluded_embed_metadata_keys=["page"]  # 排除嵌入的元数据
)

# 批量文档
documents = [
    Document(text="文档1", metadata={"category": "技术"}),
    Document(text="文档2", metadata={"category": "产品"}),
    Document(text="文档3", metadata={"category": "运营"}),
]

2. Node(节点)

【节点类型】

1. Document Node
   - 完整文档
   - 包含元数据

2. Text Node
   - 文本块
   - 父节点关系

3. Index Node
   - 索引节点
   - 包含向量
from llama_index.node_parser import TextNode
from llama_index.schema import NodeRelationship

# 文本节点
node = TextNode(
    text="这是节点文本",
    id_="node_001",
    metadata={"source": "file.pdf", "chunk": 1},
    relationships={
        NodeRelationship.SOURCE: "doc_001",
        NodeRelationship.NEXT: "node_002",
        NodeRelationship.PREVIOUS: "node_000",
    }
)

# 节点关系
- SOURCE: 文档来源
- NEXT: 下一个节点
- PREVIOUS: 上一个节点
- PARENT: 父节点
- CHILD: 子节点

3. NodeParser(节点解析器)

from llama_index.node_parser import (
    SentenceSplitter,
    TokenTextSplitter,
    CodeSplitter,
    MarkdownSplitter,
    SemanticSplitter
)

# 句子分割器
sentence_splitter = SentenceSplitter(
    separator=" ",
    chunk_size=512,
    chunk_overlap=50
)

# Token 分割器
token_splitter = TokenTextSplitter(
    separator=" ",
    chunk_size=1024,
    chunk_overlap=100
)

# Markdown 分割器
md_splitter = MarkdownSplitter(
    chunk_size=1024,
    chunk_overlap=50
)

# 语义分割器(需要 LLM)
semantic_splitter = SemanticSplitter(
    buffer_size=1,
    breakpoint_percentile_threshold=0.95
)

# 代码分割器
code_splitter = CodeSplitter(
    language="python",
    chunk_lines=40,
    chunk_lines_overlap=5,
    max_chars=1500
)

# 使用示例
nodes = sentence_splitter.get_nodes_from_documents(documents)
print(f"生成了 {len(nodes)} 个节点")

4. Embedding(嵌入模型)

from llama_index.embeddings import (
    OpenAIEmbedding,
    HuggingFaceEmbedding,
    CohereEmbedding,
    LocalEmbedding
)
from llama_index import ServiceContext

# OpenAI 嵌入
openai_embedding = OpenAIEmbedding(
    model="text-embedding-3-large",
    embed_batch_size=100
)

# HuggingFace 嵌入
hf_embedding = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-zh-v1.5"
)

# 本地嵌入(使用本地模型)
local_embedding = LocalEmbedding(
    "your-local-model-path"
)

# 配置服务上下文
service_context = ServiceContext.from_defaults(
    embed_model=openai_embedding
)

5. VectorStore(向量存储)

from llama_index import VectorStoreIndex
from llama_index.vector_stores import (
    ChromaVectorStore,
    PineconeVectorStore,
    MilvusVectorStore,
    WeaviateVectorStore,
    QdrantVectorStore
)

# Chroma(本地)
import chromadb

chroma_client = chromadb.HttpClient(host="localhost", port=8000)
chroma_store = ChromaVectorStore(
    chroma_collection=chroma_client.create_collection("documents")
)

# Pinecone(云端)
pinecone_store = PineconeVectorStore(
    api_key="your-api-key",
    environment="us-west1-gcp",
    index_name="your-index-name"
)

# Milvus
milvus_store = MilvusVectorStore(
    host="localhost",
    port=19530,
    collection_name="documents",
    dim=1536  # 嵌入维度
)

# Qdrant
(
    url="http://localhost:6333",
    collection_name="documents",
    prefer_grpc=True
)

数据加载与索引

1. Loaders(数据加载器)

from llama_index import SimpleDirectoryReader, download_loader
from llama_index.readers import (
    WebReader,
    PDFReader,
    NotionReader,
    GitReader,
    WikipediaReader
)

# 目录加载
documents = SimpleDirectoryReader(
    input_dir="./data",
    recursive=True,
    required_exts=[".txt", ".md", ".pdf"]
).load_data()

# Web 加载
documents = WebReader().load_data([
    "https://example.com/article1",
    "https://example.com/article2"
])

# PDF 加载
from llama_index.readers.file import PyMuPDFReader
documents = PyMuPDFReader().load_data("document.pdf")

# Wikipedia 加载
documents = WikipediaReader().load_data([
    {"page": "Artificial Intelligence"},
    {"page": "Machine Learning"}
])

# Notion 加载
documents = NotionReader(
    integration_token="your-token"
).load_data()

# Git 仓库加载
documents = GitReader(
    clone_url="https://github.com/user/repo.git"
).load_data(
    "README.md",
    "docs/**/*.md"
)

# 下载并加载
download_loader(
    "https://raw.githubusercontent.com/user/repo/main/README.md",
    "downloads/README.md"
)
documents = SimpleDirectoryReader("downloads").load_data()

2. Transformation(数据转换)

from llama_index.indices.node_parser import (
    TokenTextSplitter,
    MetadataRemoverNodeParser
)
from llama_index.node_parser import (
    SimilarityPostprocessorNodeParser,
    KeywordPostprocessorNodeParser
)

# Token 分割
parser = TokenTextSplitter(
    separator=" ",
    chunk_size=1024,
    chunk_overlap=100
)

# 元数据移除
metadata_remover = MetadataRemoverNodeParser(
    remove_keys=["page_number", "source"]
)

# 相似度后处理
similarity_postprocessor = SimilarityPostprocessorNodeParser(
    threshold=0.8
)

# 关键词后处理
keyword_postprocessor = KeywordPostprocessorNodeParser(
    keywords=["重要", "重点"]
)

# 链式转换
from llama_index.node_parser import NodeParser
from typing import List
from llama_index.schema import Document, BaseNode

class CustomTransformation(NodeParser):
    def _parse_nodes(
        self,
        documents: List[Document],
        show_progress: bool = False
    ) -> List[BaseNode]:
        nodes = []
        for doc in documents:
            # 自定义转换逻辑
            custom_text = self._transform(doc.text)
            node = TextNode(text=custom_text)
            nodes.append(node)
        return nodes

    def _transform(self, text: str) -> str:
        # 实现转换
        return text.upper()

3. 构建索引

from llama_index import VectorStoreIndex, ServiceContext
from llama_index.storage import StorageContext
from llama_index.node_parser import SentenceSplitter

# 配置
service_context = ServiceContext.from_defaults(
    chunk_size=512,
    chunk_overlap=50,
    node_parser=SentenceSplitter(),
    embed_model=OpenAIEmbedding()
)

# 存储上下文(持久化)
storage_context = StorageContext.from_defaults(
    persist_dir="./storage"
)

# 创建文档
documents = [
    Document(text="Python 是一种编程语言"),
    Document(text="Java 是另一种编程语言"),
    Document(text="Go 是 Google 开发的语言"),
]

# 构建索引
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context,
    storage_context=storage_context,
    show_progress=True
)

# 保存索引
index.storage_context.persist(persist_dir="./storage")

# 加载索引
from llama_index import load_index_from_storage
index = load_index_from_storage(
    storage_context=storage_context,
    service_context=service_context
)

4. 索引类型

from llama_index import (
    VectorStoreIndex,
    ListIndex,
    TreeIndex,
    KeywordTableIndex,
    GPTEndpointIndex
)

# 向量索引
vector_index = VectorStoreIndex.from_documents(documents)

# 列表索引(简单列表)
list_index = ListIndex.from_documents(documents)

# 树索引(层级结构)
tree_index = TreeIndex.from_documents(documents)

# 关键词索引
keyword_index = KeywordTableIndex.from_documents(documents)

# 混合索引(向量 + 关键词)
from llama_index.indices.composability import ComposableGraph
from llama_index.query.query import QueryBundle

vector_query = QueryBundle(
    query_str="查询文本",
    custom_embedding_strs=[embedding]
)
keyword_query = QueryBundle(
    query_str="查询文本",
    custom_embedding_strs=[]
)

from llama_index.query.router import RouterQueryEngine
query_engine = RouterQueryEngine(
    retrievers={
        "vector": vector_index.as_retriever(),
        "keyword": keyword_index.as_retriever()
    },
    retriever_mode="default"
)

查询引擎

1. 基本查询

# 同步查询
query_engine = index.as_query_engine()
response = query_engine.query("什么是 Python?")

print("答案:", response.response)
print("来源:", response.source_nodes)

# 异步查询
import asyncio

async def async_query():
    query_engine = index.as_query_engine()
    response = await query_engine.aquery("什么是 Python?")
    return response

asyncio.run(async_query())

# 流式查询
for token in query_engine.query_stream("介绍一下 Python?"):
    print(token, end="", flush=True)

2. 查询配置

from llama_index.query.query import QueryConfig
from llama_index.indices.postprocessor import SimilarityPostprocessor

# 查询配置
query_config = QueryConfig(
    similarity_top_k=5,          # 检索数量
    vector_store_query_mode="hybrid", # 查询模式
    node_postprocessors=[
        SimilarityPostprocessor(threshold=0.8)
    ],
    response_synthesizer=response_synthesizer,
    streaming=False
)

response = query_engine.query(
    "Python 的特性有哪些?",
    query_config=query_config
)

3. 查询模式

from llama_index.vector_stores.types import VectorStoreQueryMode

# 默认模式(向量检索)
vector_mode = VectorStoreQueryMode.DEFAULT

# 仅向量检索
dense_mode = VectorStoreQueryMode.DENSE

# 仅关键词检索
sparse_mode = VectorStoreQueryMode.SPARSE

# 混合检索(向量 + 关键词)
hybrid_mode = VectorStoreQueryMode.HYBRID

# 使用
from llama_index.indices.vector_store import VectorIndexRetrieverMode

retriever = index.as_retriever(
    retriever_mode=VectorIndexRetrieverMode.DEFAULT
)

4. 多次查询

# 并行查询
from llama_index.query_engine import QueryEngineConfig

query_engine = index.as_query_engine(
    query_engine_config=QueryEngineConfig(
        use_async=True  # 启用异步
    )
)

queries = [
    "Python 是什么?",
    "Java 的特点是什么?",
    "Go 适合什么场景?"
]

# 并行执行
import asyncio
async def parallel_queries():
    tasks = [
        query_engine.aquery(q) for q in queries
    ]
    responses = await asyncio.gather(*tasks)
    return responses

asyncio.run(parallel_queries())

高级特性

1. 流式响应

from llama_index.response_synthesizers import get_response_synthesizer

# 流式响应生成器
response_synthesizer = get_response_synthesizer(
    response_mode="streaming",
    streaming=True
)

query_engine = index.as_query_engine(
    response_synthesizer=response_synthesizer
)

# 流式输出
for token in query_engine.query("介绍一下 Python 的历史"):
    print(token, end="", flush=True)

2. 子查询

from llama_index.indices.query import TransformQueryEngine

# 子查询引擎
transform_query_engine = TransformasiEngine(
    query_engine=query_engine,
    transform_fn=lambda x: x + ",请详细解释。"
)

# 父子查询
parent_response = query_engine.query("什么是 Python?")
child_response = transform_query_engine.query(
    "它的特性有哪些?",
    parent_response
)

3. 查询重写

from llama_index.query_pipeline import QueryPipeline

# 查询重写
query_rewrite_prompt = """
请将以下查询重写,使其更加清晰和具体:

原查询:{query_str}

重写后的查询:
"""

# 构建查询管道
from llama_index.indices.query import QueryEngine, QueryConfig

class QueryRewriteEngine(QueryEngine):
    def __init__(self, llm, base_engine):
        self.llm = llm
        self.base_engine = base_engine

    def query(self, query_str, query_config=None):
        # 重写查询
        rewritten = self.llm.predict(
            query_rewrite_prompt.format(query_str=query_str)
        )

        # 使用重写后的查询
        return self.base_engine.query(rewritten, query_config)

rewrite_engine = QueryRewriteEngine(llm, query_engine)
response = rewrite_engine.query("Python")

4. 元数据过滤

# 在索引时添加元数据
documents = [
    Document(
        text="Python 3.12 的新特性",
        metadata={
            "version": "3.12",
            "category": "release_notes",
            "year": 2023
        }
    ),
]

# 查询时过滤
from llama_index.vector_stores.types import MetadataFilters

metadata_filters = MetadataFilters(
    filters={
        "version": "3.12",
        "year": 2023
    }
)

# 执行过滤查询
response = query_engine.query(
    "有哪些新特性?",
    filters=metadata_filters
)

5. Reranking(重排)

from llama_index.postprocessor import LLMRerank
from llama_index.llms import OpenAI

# LLM 重排
rerank_postprocessor = LLMRerank(
    llm=OpenAI(model="gpt-4"),
    top_n=3,  # 保留前 N 个
    verbose=True
)

# 应用重排
query_config = QueryConfig(
    node_postprocessors=[
        rerank_postprocessor
    ]
)

response = query_engine.query(
    "Python 的并发特性",
    query_config=query_config
)

集成案例

1. 完整 RAG 流程

from llama_index import (
    VectorStoreIndex,
    ServiceContext,
    StorageContext,
    Document
)
from llama_index.llms import OpenAI
from llama_index.embeddings import OpenAIEmbedding
from llama_index.node_parser import SentenceSplitter
from llama_index.response_synthesizers import get_response_synthesizer
import chromadb

# 1. 配置
llm = OpenAI(model="gpt-4")
embed_model = OpenAIEmbedding(model="text-embedding-3-large")
node_parser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50
)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    node_parser=node_parser
)

# 2. 加载数据
documents = SimpleDirectoryReader(
    input_dir="./docs",
    recursive=True
).load_data()

# 3. 配置向量存储
chroma_client = chromadb.HttpClient(host="localhost", port=8000)
chroma_collection = chroma_client.create_collection("rag-documents")

# 4. 构建索引
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context,
    show_progress=True
)

# 5. 创建查询引擎
query_engine = index.as_query_engine(
    response_mode="compact"
)

# 6. 查询
response = query_engine.query("Python 的并发编程如何实现?")

print(f"答案:{response.response}")
print(f"引用来源数量:{len(response.source_nodes)}")
for i, node in enumerate(response.source_nodes, 1):
    print(f"{i}. {node.metadata.get('source')}: {node.text[:100]}...")

2. 多文档源 RAG

from llama_index import VectorStoreIndex, ServiceContext
from llama_index.readers import (
    WebReader,
    PDFReader,
    SimpleDirectoryReader
)

# 加载多个数据源
all_documents = []

# 1. 本地 Markdown 文件
docs_dir = SimpleDirectoryReader(
    input_dir="./markdown",
    required_exts=[".md"]
).load_data()
all_documents.extend(docs_dir)

# 2. PDF 文件
pdf_docs = PDFReader().load_data("./report.pdf")
all_documents.extend(pdf_docs)

# 3. 在线文档
web_docs = WebReader().load_data([
    "https://example.com/doc1",
    "https://example.com/doc2"
])
all_documents.extend(web_docs)

# 构建统一索引
service_context = ServiceContext.from_defaults()
index = VectorStoreIndex.from_documents(
    all_documents,
    service_context=service_context,
    show_progress=True
)

# 查询
query_engine = index.as_query_engine()
response = query_engine.query("你的问题是")

3. 与 LangChain 集成

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import Chroma

# 转换 LlamaIndex 为 LangChain VectorStore
vector_store = Chroma(
    embedding_function=embed_model.embed_documents,
    client_settings={"chroma_client": chroma_client},
    collection_name="documents"
)

# 填充数据
texts = [doc.text for doc in documents]
vector_store.add_texts(texts)

# 创建 LangChain
retriever = vector_store.as_retriever()
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

# 查询
result = qa_chain.invoke({
    "query": "什么是 LlamaIndex?"
})

print(result["result"])
print("来源:", result["source_documents"])

面试高频问法

Q1: LlamaIndex 和 LangChain 的区别?

【标准回答】

LlamaIndex:
- 定位:数据框架
- 核心:索引构建、数据处理
- 优势:灵活的数据处理、高效检索
- 场景:RAG 优化

LangChain:
- 定位:应用框架
- 核心:链式编排、Agent 开发
- 优势:丰富的组件、易用性
- 场景:Agent 应用

协作:
- LlamaIndex 处理数据和索引
- LangChain 处理应用逻辑
- 两者可无缝集成

记忆要点

【LlamaIndex】

数据框架
RAG 优化
灵活索引

【核心组件】

Loaders:数据加载
Documents:文档管理
Nodes:节点解析
Embeddings:嵌入模型
VectorStore:向量存储

【索引类型】

Vector:向量索引
List:列表索引
Tree:树索引
Keyword:关键词索引
Hybrid:混合索引

【查询引擎】

同步查询
异步查询
流式查询
多次查询

文档版本: 1.0
最后更新: 2026-01-21

close
arrow_upward