内容纲要

知识图谱构建与应用

1. 知识图谱概述

1.1 知识图谱定义

知识图谱（Knowledge Graph）是一种以图结构表示知识的系统，由实体（节点）和关系（边）组成。

┌─────────────────────────────────────────────────┐
│              知识图谱基本结构                    │
├─────────────────────────────────────────────────┤
│                                                     │
│      (实体)              (实体)                     │
│     Python  ──────► created ──────►  Guido  │
│       │        (关系)      │                       │
│       ▼                   ▼                       │
│     used_by ◄──── used ◄────── NumPy            │
│              (关系)                                  │
│                                                     │
│  实体(Entity) + 关系(Relation) + 属性(Property)  │
│                                                     │
└─────────────────────────────────────────────────┘

1.2 知识图谱类型

类型	特点	应用场景
领域图谱	聚焦特定领域	垂直领域知识
通用图谱	覆盖广泛知识	通用问答、搜索
常识图谱	常识性知识	常识问答
企业图谱	企业内部知识	企业知识管理
时序图谱	包含时间信息	事件追踪、溯源

1.3 知识图谱价值

┌─────────────────────────────────────────────────┐
│              知识图谱的价值                      │
├─────────────────────────────────────────────────┤
│                                                     │
│  ✓ 结构化知识存储                                   │
│  ✓ 支持复杂推理                                    │
│  ✓ 提供可解释性                                    │
│  ✓ 支持知识融合                                    │
│  ✓ 增强检索能力                                    │
│  ✓ 支持知识演化                                    │
│                                                     │
└─────────────────────────────────────────────────┘

2. 实体识别与抽取

2.1 实体类型定义

from typing import List, Dict, Set, Optional
from dataclasses import dataclass
from enum import Enum

class EntityType(Enum):
    """实体类型"""
    PERSON = "person"           # 人物
    ORGANIZATION = "organization" # 组织
    LOCATION = "location"       # 地点
    PRODUCT = "product"         # 产品
    CONCEPT = "concept"         # 概念
    EVENT = "event"            # 事件
    DATE = "date"              # 日期
    NUMBER = "number"          # 数字
    URL = "url"                # 网址
    EMAIL = "email"            # 邮箱
    CUSTOM = "custom"           # 自定义

@dataclass
class Entity:
    """实体"""
    id: str
    text: str                   # 实体文本
    type: EntityType            # 实体类型
    start_pos: int             # 起始位置
    end_pos: int               # 结束位置
    properties: Dict = None     # 属性
    aliases: List[str] = None   # 别名
    confidence: float = 1.0     # 置信度
    source: str = None          # 来源

    def __post_init__(self):
        if self.properties is None:
            self.properties = {}
        if self.aliases is None:
            self.aliases = []

2.2 基于规则的实体识别

import re
from datetime import datetime

class RuleBasedEntityExtractor:
    """基于规则的实体提取器"""

    def __init__(self):
        # 定义实体规则
        self.rules = {
            EntityType.EMAIL: [
                r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
            ],
            EntityType.URL: [
                r'https?://[^\s<>"{}|\\^`\[\]]+',
                r'www\.[^\s<>"{}|\\^`\[\]]+',
                r'[A-Za-z0-9.-]+\.[A-Za-z]{2,}[^\s]*'
            ],
            EntityType.NUMBER: [
                r'\b\d+\.?\d*\b',
                r'\b\d{1,3}(,\d{3})*(\.\d+)?\b'
            ],
            EntityType.DATE: [
                r'\b\d{4}-\d{1,2}-\d{1,2}\b',  # YYYY-MM-DD
                r'\b\d{1,2}/\d{1,2}/\d{4}\b',  # MM/DD/YYYY
                r'\b\d{4}年\d{1,2}月\d{1,2}日\b'
            ],
            EntityType.PHONE: [
                r'\b1[3-9]\d{9}\b',
                r'\b\d{3}-\d{4}-\d{4}\b'
            ]
        }

    def extract(self, text: str) -> List[Entity]:
        """
        提取实体

        Args:
            text: 输入文本

        Returns:
            实体列表
        """
        entities = []

        for entity_type, patterns in self.rules.items():
            for pattern in patterns:
                for match in re.finditer(pattern, text):
                    entity = Entity(
                        id=self._generate_id(),
                        text=match.group(),
                        type=entity_type,
                        start_pos=match.start(),
                        end_pos=match.end(),
                        properties=self._extract_properties(match.group(), entity_type)
                    )
                    entities.append(entity)

        # 去重
        entities = self._deduplicate(entities)

        return entities

    def _generate_id(self) -> str:
        """生成实体ID"""
        import uuid
        return str(uuid.uuid4())

    def _extract_properties(self, text: str, entity_type: EntityType) -> Dict:
        """提取实体属性"""
        properties = {}

        if entity_type == EntityType.EMAIL:
            # 提取邮箱域名
            if '@' in text:
                properties['domain'] = text.split('@')[1]

        elif entity_type == EntityType.URL:
            # 提取URL域名
            from urllib.parse import urlparse
            try:
                parsed = urlparse(text)
                properties['domain'] = parsed.netloc
            except:
                pass

        elif entity_type == EntityType.DATE:
            # 尝试解析日期
            try:
                from dateutil.parser import parse
                parsed_date = parse(text)
                properties['date'] = parsed_date.isoformat()
            except:
                pass

        return properties

    def _deduplicate(self, entities: List[Entity]) -> List[Entity]:
        """去重（基于文本和位置）"""
        seen = set()
        unique = []

        for entity in entities:
            key = (entity.text, entity.start_pos, entity.end_pos)
            if key not in seen:
                seen.add(key)
                unique.append(entity)

        return unique

2.3 基于模型的实体识别

class ModelBasedEntityExtractor:
    """基于模型的实体提取器"""

    def __init__(self, model_name: str = None, use_llm: bool = False):
        self.use_llm = use_llm

        if use_llm:
            # 使用LLM提取
            import openai
            self.llm = openai.ChatCompletion
        else:
            # 使用NER模型
            try:
                from transformers import AutoTokenizer, AutoModelForTokenClassification
                from transformers import pipeline

                model_name = model_name or "dbmdz/bert-large-cased-finetuned-conll03-english"
                self.ner_pipeline = pipeline(
                    "ner",
                    model=model_name,
                    tokenizer=model_name,
                    aggregation_strategy="simple"
                )
            except ImportError:
                raise ImportError("transformers not installed")

    def extract(self, text: str) -> List[Entity]:
        """提取实体"""
        if self.use_llm:
            return self._extract_with_llm(text)
        else:
            return self._extract_with_ner(text)

    def _extract_with_llm(self, text: str) -> List[Entity]:
        """使用LLM提取实体"""
        prompt = f"""从以下文本中提取实体。

文本：
{text}

请识别以下类型的实体：
- 人物(person)
- 组织(organization)
- 地点(location)
- 产品(product)
- 概念(concept)
- 日期(date)
- 网址(url)
- 邮箱(email)

请以JSON格式输出：
{{
    "entities": [
        {{
            "text": "实体文本",
            "type": "实体类型",
            "start_pos": 起始位置,
            "end_pos": 结束位置
        }}
    ]
}}"""

        response = self.llm.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )

        import json
        data = json.loads(response.choices[0].message.content)
        entities = []

        for entity_data in data.get("entities", []):
            try:
                entity_type = EntityType(entity_data["type"])
            except ValueError:
                entity_type = EntityType.CUSTOM

            entity = Entity(
                id=self._generate_id(),
                text=entity_data["text"],
                type=entity_type,
                start_pos=entity_data["start_pos"],
                end_pos=entity_data["end_pos"]
            )
            entities.append(entity)

        return entities

    def _extract_with_ner(self, text: str) -> List[Entity]:
        """使用NER模型提取"""
        results = self.ner_pipeline(text)
        entities = []

        # 类型映射
        type_mapping = {
            'PER': EntityType.PERSON,
            'ORG': EntityType.ORGANIZATION,
            'LOC': EntityType.LOCATION,
            'MISC': EntityType.CONCEPT,
            'DATE': EntityType.DATE,
            'NUMBER': EntityType.NUMBER
        }

        for result in results:
            if result['entity_group'] == 'O':
                continue

            entity_type = type_mapping.get(
                result['entity_group'],
                EntityType.CUSTOM
            )

            entity = Entity(
                id=self._generate_id(),
                text=result['word'],
                type=entity_type,
                start_pos=result['start'],
                end_pos=result['end'],
                confidence=result['score']
            )
            entities.append(entity)

        return entities

    def _generate_id(self) -> str:
        import uuid
        return str(uuid.uuid4())

2.4 实体链接

class EntityLinker:
    """实体链接器"""

    def __init__(self, knowledge_base=None):
        self.knowledge_base = knowledge_base

    def link_entities(
        self,
        entities: List[Entity],
        candidates: List[Entity] = None
    ) -> List[Entity]:
        """
        链接实体到知识库

        Args:
            entities: 待链接的实体
            candidates: 候选实体（来自知识库）

        Returns:
            链接后的实体列表
        """
        if candidates is None and self.knowledge_base:
            candidates = self.knowledge_base.search_all_entities()

        for entity in entities:
            # 找到最相似的候选
            matches = self._find_matches(entity, candidates)

            if matches:
                # 链接到最佳匹配
                best_match = matches[0]
                entity.linked_id = best_match.id
                entity.linked_text = best_match.text
                entity.link_confidence = best_match['score']

        return entities

    def _find_matches(
        self,
        entity: Entity,
        candidates: List[Entity],
        top_k: int = 3
    ) -> List[Dict]:
        """找到匹配的候选实体"""
        matches = []

        for candidate in candidates:
            # 文本相似度
            text_sim = self._text_similarity(entity.text, candidate.text)

            # 类型匹配
            type_match = entity.type == candidate.type

            if type_match and text_sim > 0.8:
                matches.append({
                    "candidate": candidate,
                    "score": text_sim
                })

        # 按分数排序
        matches.sort(key=lambda x: x['score'], reverse=True)

        return matches[:top_k]

    def _text_similarity(self, text1: str, text2: str) -> float:
        """计算文本相似度"""
        # 简化实现：Jaccard相似度
        set1 = set(text1.lower().split())
        set2 = set(text2.lower().split())

        if not set1 or not set2:
            return 0

        intersection = len(set1 & set2)
        union = len(set1 | set2)

        return intersection / union

3. 关系抽取

3.1 关系定义

from enum import Enum

class RelationType(Enum):
    """关系类型"""
    # 通用关系
    RELATED_TO = "related_to"
    PART_OF = "part_of"
    INSTANCE_OF = "instance_of"
    SIMILAR_TO = "similar_to"

    # 人物关系
    WORKS_AT = "works_at"
    FOUNDED = "founded"
    COLLABORATES_WITH = "collaborates_with"
    STUDENT_OF = "student_of"

    # 组织关系
    SUBSIDIARY_OF = "subsidiary_of"
    PARTNERS_WITH = "partners_with"
    COMPETITOR_OF = "competitor_of"

    # 产品关系
    VERSION_OF = "version_of"
    DEPENDS_ON = "depends_on"
    COMPATIBLE_WITH = "compatible_with"

    # 时间关系
    HAPPENED_BEFORE = "happened_before"
    HAPPENED_AFTER = "happened_after"

    # 自定义
    CUSTOM = "custom"

@dataclass
class Relation:
    """关系"""
    id: str
    subject_id: str             # 主体实体ID
    object_id: str             # 客体实体ID
    relation_type: RelationType  # 关系类型
    properties: Dict = None     # 关系属性
    confidence: float = 1.0     # 置信度
    source: str = None          # 来源

    def __post_init__(self):
        if self.properties is None:
            self.properties = {}

3.2 基于规则的关系抽取

class RuleBasedRelationExtractor:
    """基于规则的关系提取器"""

    def __init__(self):
        # 定义关系模式
        self.patterns = [
            # 人物-组织关系
            {
                "type": RelationType.WORKS_AT,
                "patterns": [
                    r'({}\s+(is|are|was|were)\s+(a|an|the|at)\s+({})',
                    r'({})\s+(works|worked)\s+(at|for)\s+({})'
                ],
                "subject_type": EntityType.PERSON,
                "object_type": EntityType.ORGANIZATION
            },
            # 创始关系
            {
                "type": RelationType.FOUNDED,
                "patterns": [
                    r'({})\s+founded\s+({})',
                    r'({})\s+was\s+founded\s+by\s+({})'
                ],
                "subject_type": EntityType.PERSON,
                "object_type": EntityType.ORGANIZATION
            },
            # 依赖关系
            {
                "type": RelationType.DEPENDS_ON,
                "patterns": [
                    r'({})\s+(uses|used|depends\s+on)\s+({})'
                ],
                "subject_type": EntityType.PRODUCT,
                "object_type": EntityType.PRODUCT
            },
            # 版本关系
            {
                "type": RelationType.VERSION_OF,
                "patterns": [
                    r'({})\s+is\s+(a|an)\s+version\s+of\s+({})',
                    r'({})\s+v\d+(\.\d+)*\s+-\s+({})'
                ],
                "subject_type": EntityType.PRODUCT,
                "object_type": EntityType.PRODUCT
            }
        ]

    def extract(
        self,
        text: str,
        entities: List[Entity]
    ) -> List[Relation]:
        """
        抽取关系

        Args:
            text: 输入文本
            entities: 已识别的实体

        Returns:
            关系列表
        """
        relations = []

        # 按类型分组实体
        entities_by_type = {}
        for entity in entities:
            if entity.type not in entities_by_type:
                entities_by_type[entity.type] = []
            entities_by_type[entity.type].append(entity)

        # 对每个模式尝试提取
        for relation_config in self.patterns:
            relation_type = relation_config["type"]
            subject_type = relation_config["subject_type"]
            object_type = relation_config["object_type"]

            # 检查是否有对应类型的实体
            if subject_type not in entities_by_type or \
               object_type not in entities_by_type:
                continue

            # 尝试匹配模式
            for pattern in relation_config["patterns"]:
                matches = self._match_pattern(text, pattern)
                relations.extend(matches)

        # 去重
        relations = self._deduplicate(relations)

        return relations

    def _match_pattern(
        self,
        text: str,
        pattern: str
    ) -> List[Relation]:
        """匹配关系模式"""
        # 将模式中的{}替换为非贪婪匹配
        regex_pattern = pattern.replace('{}', '(.+?)')
        matches = []

        for match in re.finditer(regex_pattern, text, re.IGNORECASE):
            subject_text = match.group(1)
            object_text = match.group(2)

            relation = Relation(
                id=self._generate_id(),
                subject_id=self._generate_id(),
                object_id=self._generate_id(),
                relation_type=RelationType.CUSTOM,
                properties={
                    "subject_text": subject_text,
                    "object_text": object_text,
                    "pattern": pattern
                },
                confidence=0.8
            )
            matches.append(relation)

        return matches

    def _deduplicate(self, relations: List[Relation]) -> List[Relation]:
        """去重"""
        seen = set()
        unique = []

        for relation in relations:
            key = (
                relation.subject_id,
                relation.object_id,
                relation.relation_type
            )
            if key not in seen:
                seen.add(key)
                unique.append(relation)

        return unique

    def _generate_id(self) -> str:
        import uuid
        return str(uuid.uuid4())

3.3 基于模型的关系抽取

class ModelBasedRelationExtractor:
    """基于模型的关系提取器"""

    def __init__(self, llm=None):
        self.llm = llm

    def extract(
        self,
        text: str,
        entities: List[Entity]
    ) -> List[Relation]:
        """
        使用LLM抽取关系

        Args:
            text: 输入文本
            entities: 已识别的实体

        Returns:
            关系列表
        """
        if not entities:
            return []

        # 构建实体列表
        entity_list = "\n".join([
            f"{i+1}. {e.text} ({e.type.value})"
            for i, e in enumerate(entities)
        ])

        prompt = f"""从以下文本中抽取实体间的关系。

文本：
{text}

实体列表：
{entity_list}

请识别实体间的关系，并说明关系类型。

关系类型包括：
- works_at: 工作于
- founded: 创建
- collaborates_with: 合作
- part_of: 属于
- depends_on: 依赖
- version_of: 版本
- related_to: 相关

请以JSON格式输出：
{{
    "relations": [
        {{
            "subject": "主体实体文本",
            "object": "客体实体文本",
            "relation_type": "关系类型",
            "confidence": 0.95
        }}
    ]
}}"""

        response = self.llm.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )

        import json
        data = json.loads(response.choices[0].message.content)
        relations = []

        # 构建实体文本到ID的映射
        entity_map = {e.text: e.id for e in entities}

        for rel_data in data.get("relations", []):
            try:
                relation_type = RelationType(rel_data["relation_type"])
            except ValueError:
                relation_type = RelationType.CUSTOM

            subject_id = entity_map.get(rel_data["subject"])
            object_id = entity_map.get(rel_data["object"])

            if subject_id and object_id:
                relation = Relation(
                    id=self._generate_id(),
                    subject_id=subject_id,
                    object_id=object_id,
                    relation_type=relation_type,
                    confidence=rel_data.get("confidence", 1.0)
                )
                relations.append(relation)

        return relations

    def _generate_id(self) -> str:
        import uuid
        return str(uuid.uuid4())

4. 知识图谱存储

4.1 内存存储

from typing import Dict, List, Set, Tuple

class InMemoryKnowledgeGraph:
    """内存知识图谱"""

    def __init__(self):
        self.entities: Dict[str, Entity] = {}
        self.relations: Dict[str, Relation] = {}
        self.adjacency_list: Dict[str, Set[str]] = {}  # subject -> objects

    def add_entity(self, entity: Entity) -> bool:
        """添加实体"""
        if entity.id in self.entities:
            return False

        self.entities[entity.id] = entity
        self.adjacency_list[entity.id] = set()
        return True

    def add_relation(self, relation: Relation) -> bool:
        """添加关系"""
        if relation.id in self.relations:
            return False

        if relation.subject_id not in self.entities or \
           relation.object_id not in self.entities:
            return False

        self.relations[relation.id] = relation
        self.adjacency_list[relation.subject_id].add(relation.object_id)
        return True

    def get_entity(self, entity_id: str) -> Optional[Entity]:
        """获取实体"""
        return self.entities.get(entity_id)

    def get_entities(
        self,
        entity_type: EntityType = None
    ) -> List[Entity]:
        """获取实体"""
        entities = list(self.entities.values())

        if entity_type:
            entities = [e for e in entities if e.type == entity_type]

        return entities

    def get_relations(
        self,
        subject_id: str = None,
        object_id: str = None,
        relation_type: RelationType = None
    ) -> List[Relation]:
        """获取关系"""
        relations = list(self.relations.values())

        if subject_id:
            relations = [r for r in relations if r.subject_id == subject_id]

        if object_id:
            relations = [r for r in relations if r.object_id == object_id]

        if relation_type:
            relations = [r for r in relations if r.relation_type == relation_type]

        return relations

    def get_neighbors(self, entity_id: str) -> List[Entity]:
        """获取邻居实体"""
        if entity_id not in self.adjacency_list:
            return []

        neighbor_ids = self.adjacency_list[entity_id]
        return [self.entities[eid] for eid in neighbor_ids if eid in self.entities]

    def find_path(
        self,
        start_id: str,
        end_id: str,
        max_depth: int = 5
    ) -> List[str]:
        """查找路径"""
        from collections import deque

        if start_id not in self.entities or end_id not in self.entities:
            return []

        queue = deque([(start_id, [start_id])])
        visited = {start_id}

        while queue:
            current_id, path = queue.popleft()

            if current_id == end_id:
                return path

            if len(path) >= max_depth:
                continue

            for neighbor_id in self.adjacency_list.get(current_id, []):
                if neighbor_id not in visited:
                    visited.add(neighbor_id)
                    queue.append((neighbor_id, path + [neighbor_id]))

        return []

    def to_json(self) -> Dict:
        """导出为JSON"""
        return {
            "entities": [
                {
                    "id": e.id,
                    "text": e.text,
                    "type": e.type.value,
                    "properties": e.properties
                }
                for e in self.entities.values()
            ],
            "relations": [
                {
                    "id": r.id,
                    "subject_id": r.subject_id,
                    "object_id": r.object_id,
                    "type": r.relation_type.value,
                    "properties": r.properties
                }
                for r in self.relations.values()
            ]
        }

4.2 Neo4j存储

class Neo4jKnowledgeGraph:
    """Neo4j知识图谱"""

    def __init__(
        self,
        uri: str = "bolt://localhost:7687",
        username: str = "neo4j",
        password: str = None
    ):
        from neo4j import GraphDatabase
        self.driver = GraphDatabase.driver(uri, auth=(username, password))

    def add_entity(self, entity: Entity) -> bool:
        """添加实体（节点）"""
        try:
            with self.driver.session() as session:
                cypher = """
                MERGE (e:Entity {id: $id})
                SET e.text = $text,
                    e.type = $type,
                    e.properties = $properties
                RETURN e
                """
                result = session.run(
                    cypher,
                    id=entity.id,
                    text=entity.text,
                    type=entity.type.value,
                    properties=entity.properties
                )
                return result.single() is not None
        except Exception as e:
            print(f"添加实体失败: {e}")
            return False

    def add_relation(self, relation: Relation) -> bool:
        """添加关系（边）"""
        try:
            with self.driver.session() as session:
                cypher = """
                MATCH (s:Entity {id: $subject_id})
                MATCH (o:Entity {id: $object_id})
                MERGE (s)-[r:RELATION {
                    id: $id,
                    type: $type,
                    properties: $properties
                }]->(o)
                RETURN r
                """
                result = session.run(
                    cypher,
                    subject_id=relation.subject_id,
                    object_id=relation.object_id,
                    id=relation.id,
                    type=relation.relation_type.value,
                    properties=relation.properties
                )
                return result.single() is not None
        except Exception as e:
            print(f"添加关系失败: {e}")
            return False

    def get_entity(self, entity_id: str) -> Optional[Dict]:
        """获取实体"""
        try:
            with self.driver.session() as session:
                cypher = """
                MATCH (e:Entity {id: $id})
                RETURN e
                """
                result = session.run(cypher, id=entity_id)
                record = result.single()
                return dict(record["e"]) if record else None
        except Exception as e:
            print(f"获取实体失败: {e}")
            return None

    def get_relations(
        self,
        subject_id: str = None,
        relation_type: str = None
    ) -> List[Dict]:
        """获取关系"""
        try:
            with self.driver.session() as session:
                cypher = """
                MATCH (s:Entity)-[r:RELATION]->(o:Entity)
                """

                conditions = []
                params = {}

                if subject_id:
                    conditions.append("s.id = $subject_id")
                    params["subject_id"] = subject_id

                if relation_type:
                    conditions.append("r.type = $type")
                    params["type"] = relation_type

                if conditions:
                    cypher += " WHERE " + " AND ".join(conditions)

                cypher += """
                RETURN s, r, o
                """

                result = session.run(cypher, **params)
                return [dict(record) for record in result]
        except Exception as e:
            print(f"获取关系失败: {e}")
            return []

    def query(self, cypher: str, **params) -> List[Dict]:
        """执行Cypher查询"""
        try:
            with self.driver.session() as session:
                result = session.run(cypher, **params)
                return [dict(record) for record in result]
        except Exception as e:
            print(f"查询失败: {e}")
            return []

    def close(self):
        """关闭连接"""
        self.driver.close()

5. 知识推理

5.1 基于规则的推理

class RuleBasedReasoner:
    """基于规则的推理器"""

    def __init__(self):
        # 定义推理规则
        self.rules = [
            # 传递性规则
            {
                "name": "transitivity",
                "description": "如果A related_to B 且 B related_to C，则 A related_to C",
                "pattern": {
                    "relation1": RelationType.RELATED_TO,
                    "relation2": RelationType.RELATED_TO,
                    "inferred": RelationType.RELATED_TO
                },
                "confidence": 0.6
            },
            # 层次性规则
            {
                "name": "hierarchy",
                "description": "如果A part_of B 且 B part_of C，则 A part_of C",
                "pattern": {
                    "relation1": RelationType.PART_OF,
                    "relation2": RelationType.PART_OF,
                    "inferred": RelationType.PART_OF
                },
                "confidence": 0.8
            }
        ]

    def infer(
        self,
        kg: InMemoryKnowledgeGraph
    ) -> List[Relation]:
        """
        执行推理

        Args:
            kg: 知识图谱

        Returns:
            推理出的新关系
        """
        inferred_relations = []

        for rule in self.rules:
            pattern = rule["pattern"]

            # 找到匹配规则的关系对
            relations = kg.get_relations(
                relation_type=pattern["relation1"]
            )

            for rel1 in relations:
                # 检查是否存在第二个关系
                rel2_candidates = kg.get_relations(
                    subject_id=rel1.object_id,
                    relation_type=pattern["relation2"]
                )

                for rel2 in rel2_candidates:
                    # 检查是否已存在推理关系
                    existing = kg.get_relations(
                        subject_id=rel1.subject_id,
                        object_id=rel2.object_id,
                        relation_type=pattern["inferred"]
                    )

                    if not existing:
                        # 创建推理关系
                        inferred_relation = Relation(
                            id=self._generate_id(),
                            subject_id=rel1.subject_id,
                            object_id=rel2.object_id,
                            relation_type=pattern["inferred"],
                            properties={
                                "inferred": True,
                                "rule": rule["name"],
                                "source_relations": [rel1.id, rel2.id]
                            },
                            confidence=rule["confidence"]
                        )
                        inferred_relations.append(inferred_relation)

        return inferred_relations

    def _generate_id(self) -> str:
        import uuid
        return str(uuid.uuid4())

5.2 基于路径的推理

class PathBasedReasoner:
    """基于路径的推理器"""

    def __init__(self):
        pass

    def infer_by_path(
        self,
        kg: InMemoryKnowledgeGraph,
        start_id: str,
        path_pattern: List[RelationType],
        max_depth: int = 5
    ) -> List[Tuple[str, List[Relation]]]:
        """
        沿路径模式推理

        Args:
            kg: 知识图谱
            start_id: 起始实体ID
            path_pattern: 路径模式 [RELATION_TYPE1, RELATION_TYPE2, ...]
            max_depth: 最大深度

        Returns:
            [(end_id, path_relations), ...]
        """
        if not path_pattern:
            return []

        results = []

        # 递归查找路径
        def dfs(current_id, current_depth, current_path):
            if current_depth >= len(path_pattern):
                # 找到完整路径
                results.append((current_id, current_path))
                return

            if current_depth >= max_depth:
                return

            # 查找下一步的关系
            target_relation = path_pattern[current_depth]
            relations = kg.get_relations(
                subject_id=current_id,
                relation_type=target_relation
            )

            for relation in relations:
                dfs(
                    relation.object_id,
                    current_depth + 1,
                    current_path + [relation]
                )

        dfs(start_id, 0, [])
        return results

    def find_common_type(
        self,
        kg: InMemoryKnowledgeGraph,
        entity_ids: List[str]
    ) -> Optional[str]:
        """
        查找实体的共同类型

        Args:
            kg: 知识图谱
            entity_ids: 实体ID列表

        Returns:
            共同类型ID（如果有）
        """
        if len(entity_ids) < 2:
            return None

        # 查找第一个实体的类型
        first_entity = kg.get_entity(entity_ids[0])
        if not first_entity:
            return None

        # 查找instance_of关系
        type_relations = kg.get_relations(
            subject_id=first_entity.id,
            relation_type=RelationType.INSTANCE_OF
        )

        candidate_type_ids = [r.object_id for r in type_relations]

        # 检查其他实体是否有相同的类型
        for type_id in candidate_type_ids:
            all_have_type = True

            for entity_id in entity_ids[1:]:
                relations = kg.get_relations(
                    subject_id=entity_id,
                    object_id=type_id,
                    relation_type=RelationType.INSTANCE_OF
                )

                if not relations:
                    all_have_type = False
                    break

            if all_have_type:
                return type_id

        return None

6. 知识图谱应用

6.1 知识增强检索

class KnowledgeEnhancedRetriever:
    """知识增强检索器"""

    def __init__(self, kg: InMemoryKnowledgeGraph):
        self.kg = kg

    def retrieve_with_knowledge(
        self,
        query: str,
        top_k: int = 10
    ) -> List[Dict]:
        """
        结合知识图谱检索

        1. 实体识别
        2. 路径查找
        3. 扩展检索
        """
        # 1. 识别查询中的实体
        extractor = ModelBasedEntityExtractor()
        entities = extractor.extract(query)

        if not entities:
            return []

        # 2. 查找相关实体和路径
        related_entities = []

        for entity in entities:
            # 获取邻居
            neighbors = self.kg.get_neighbors(entity.id)
            related_entities.extend(neighbors)

            # 查找相关实体
            related = self.kg.get_relations(subject_id=entity.id)
            for rel in related:
                neighbor = self.kg.get_entity(rel.object_id)
                if neighbor:
                    related_entities.append(neighbor)

        # 去重
        unique_entities = {}
        for e in related_entities:
            if e.id not in unique_entities:
                unique_entities[e.id] = e

        # 3. 构建结果
        results = []
        for entity_id, entity in unique_entities.items():
            # 计算相关性分数
            score = self._calculate_relevance(query, entity)

            if score > 0.5:
                # 获取实体描述
                results.append({
                    "entity": entity,
                    "score": score,
                    "context": self._get_entity_context(entity)
                })

        # 按分数排序
        results.sort(key=lambda x: x["score"], reverse=True)

        return results[:top_k]

    def _calculate_relevance(self, query: str, entity: Entity) -> float:
        """计算相关性分数"""
        # 简化实现：基于文本相似度
        query_words = set(query.lower().split())
        entity_words = set(entity.text.lower().split())

        if not query_words or not entity_words:
            return 0

        intersection = len(query_words & entity_words)
        union = len(query_words | entity_words)

        return intersection / union

    def _get_entity_context(self, entity: Entity) -> str:
        """获取实体上下文"""
        # 获取相关关系
        relations = self.kg.get_relations(subject_id=entity.id)

        context_parts = [f"{entity.text} ({entity.type.value})"]

        for rel in relations[:3]:  # 最多3个关系
            neighbor = self.kg.get_entity(rel.object_id)
            if neighbor:
                context_parts.append(
                    f"{rel.relation_type.value} {neighbor.text}"
                )

        return ", ".join(context_parts)

6.2 知识问答

class KnowledgeQA:
    """知识问答"""

    def __init__(self, kg: InMemoryKnowledgeGraph, llm=None):
        self.kg = kg
        self.llm = llm

    def answer(
        self,
        question: str,
        use_reasoning: bool = True
    ) -> Dict:
        """
        基于知识图谱回答问题

        Args:
            question: 问题
            use_reasoning: 是否使用推理

        Returns:
            {
                "answer": str,
                "entities": List[Entity],
                "relations": List[Relation],
                "reasoning_path": List[str]
            }
        """
        # 1. 识别问题中的实体
        extractor = ModelBasedEntityExtractor()
        entities = extractor.extract(question)

        if not entities:
            return {
                "answer": "无法识别相关实体",
                "entities": [],
                "relations": []
            }

        # 2. 查找相关实体和关系
        main_entity = entities[0]
        relations = self.kg.get_relations(subject_id=main_entity.id)

        # 3. 推理
        reasoning_path = []
        if use_reasoning:
            reasoner = RuleBasedReasoner()
            inferred = reasoner.infer(self.kg)

            # 检查推理结果是否相关
            relevant_inferred = [
                r for r in inferred
                if r.subject_id == main_entity.id or
                   r.object_id == main_entity.id
            ]
            relations.extend(relevant_inferred)

            if relevant_inferred:
                reasoning_path = [r.relation_type.value for r in relevant_inferred]

        # 4. 构建上下文
        context = self._build_context(main_entity, relations)

        # 5. 使用LLM生成答案
        if self.llm:
            answer = self._generate_with_llm(question, context)
        else:
            answer = self._generate_simple(context)

        return {
            "answer": answer,
            "entities": [main_entity] + [
                self.kg.get_entity(r.object_id)
                for r in relations[:5]
                if self.kg.get_entity(r.object_id)
            ],
            "relations": relations[:5],
            "reasoning_path": reasoning_path
        }

    def _build_context(
        self,
        entity: Entity,
        relations: List[Relation]
    ) -> str:
        """构建上下文"""
        context_parts = [f"{entity.text} ({entity.type.value})"]

        for rel in relations[:10]:
            neighbor = self.kg.get_entity(rel.object_id)
            if neighbor:
                context_parts.append(
                    f"- {rel.relation_type.value}: {neighbor.text} ({neighbor.type.value})"
                )

        return "\n".join(context_parts)

    def _generate_with_llm(self, question: str, context: str) -> str:
        """使用LLM生成答案"""
        prompt = f"""基于以下知识回答问题。

知识：
{context}

问题：{question}

请基于上述知识回答，如果信息不足请说明。"""

        response = self.llm.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        return response.choices[0].message.content

    def _generate_simple(self, context: str) -> str:
        """简单生成答案"""
        return f"基于知识库，{context}"

7. 知识图谱质量评估

7.1 质量指标

class KnowledgeGraphQualityAssessor:
    """知识图谱质量评估器"""

    def __init__(self):
        pass

    def assess(self, kg: InMemoryKnowledgeGraph) -> Dict:
        """
        评估知识图谱质量

        Returns:
            {
                "completeness": 完整性,
                "consistency": 一致性,
                "accuracy": 准确性,
                "connectivity": 连通性,
                "overall": 综合分数
            }
        """
        metrics = {}

        # 完整性
        metrics["completeness"] = self._assess_completeness(kg)

        # 一致性
        metrics["consistency"] = self._assess_consistency(kg)

        # 准确性
        metrics["accuracy"] = self._assess_accuracy(kg)

        # 连通性
        metrics["connectivity"] = self._assess_connectivity(kg)

        # 综合分数
        metrics["overall"] = sum(metrics.values()) / len(metrics)

        return metrics

    def _assess_completeness(self, kg: InMemoryKnowledgeGraph) -> float:
        """评估完整性"""
        # 检查实体是否有必要属性
        entities = kg.get_entities()

        valid_count = sum(
            1 for e in entities
            if e.text and e.properties
        )

        return valid_count / len(entities) if entities else 1.0

    def _assess_consistency(self, kg: InMemoryKnowledgeGraph) -> float:
        """评估一致性"""
        # 检查关系是否有效
        relations = kg.get_relations()

        valid_count = 0
        for rel in relations:
            subject = kg.get_entity(rel.subject_id)
            object_entity = kg.get_entity(rel.object_id)

            if subject and object_entity:
                valid_count += 1

        return valid_count / len(relations) if relations else 1.0

    def _assess_accuracy(self, kg: InMemoryKnowledgeGraph) -> float:
        """评估准确性"""
        # 简化实现：检查文本质量
        entities = kg.get_entities()

        valid_count = sum(
            1 for e in entities
            if len(e.text) >= 2  # 至少2个字符
        )

        return valid_count / len(entities) if entities else 1.0

    def _assess_connectivity(self, kg: InMemoryKnowledgeGraph) -> float:
        """评估连通性"""
        entities = kg.get_entities()

        if len(entities) <= 1:
            return 1.0

        # 计算连通的实体数量
        visited = set()

        def dfs(entity_id):
            if entity_id in visited:
                return

            visited.add(entity_id)

            for neighbor in kg.get_neighbors(entity_id):
                dfs(neighbor.id)

        # 从第一个实体开始DFS
        if entities:
            dfs(entities[0].id)

        connectivity = len(visited) / len(entities)

        return connectivity

8. 实现示例

8.1 完整知识图谱构建流程

"""
完整知识图谱构建流程

1. 数据采集
2. 实体识别
3. 关系抽取
4. 知识存储
5. 知识推理
6. 质量评估
"""

class KnowledgeGraphBuilder:
    """知识图谱构建器"""

    def __init__(
        self,
        storage_type: str = "memory"  # memory/neo4j
    ):
        self.storage_type = storage_type

        # 初始化组件
        self.entity_extractor = ModelBasedEntityExtractor()
        self.relation_extractor = ModelBasedRelationExtractor()

        # 初始化存储
        if storage_type == "memory":
            self.kg = InMemoryKnowledgeGraph()
        elif storage_type == "neo4j":
            self.kg = Neo4jKnowledgeGraph()
        else:
            raise ValueError(f"Unknown storage type: {storage_type}")

    def build_from_text(self, text: str) -> Dict:
        """
        从文本构建知识图谱

        Returns:
            {
                "entities": 实体数量,
                "relations": 关系数量,
                "quality": 质量指标
            }
        """
        print("=== 开始构建知识图谱 ===")

        # 1. 实体识别
        print("\n[1/4] 实体识别")
        entities = self.entity_extractor.extract(text)
        print(f"    识别到 {len(entities)} 个实体")
        for entity in entities:
            print(f"      - {entity.text} ({entity.type.value})")

        # 2. 添加实体到知识图谱
        print("\n[2/4] 添加实体")
        for entity in entities:
            self.kg.add_entity(entity)

        # 3. 关系抽取
        print("\n[3/4] 关系抽取")
        relations = self.relation_extractor.extract(text, entities)
        print(f"    抽取到 {len(relations)} 个关系")
        for relation in relations[:5]:
            subject = self.kg.get_entity(relation.subject_id)
            object_entity = self.kg.get_entity(relation.object_id)
            print(f"      - {subject.text} -> {relation.relation_type.value} -> {object_entity.text}")

        # 4. 添加关系到知识图谱
        print("\n[4/4] 添加关系")
        for relation in relations:
            self.kg.add_relation(relation)

        # 5. 质量评估
        print("\n[5/5] 质量评估")
        if isinstance(self.storage_type, str) and self.storage_type == "memory":
            assessor = KnowledgeGraphQualityAssessor()
            quality = assessor.assess(self.kg)
            print(f"    完整性: {quality['completeness']:.2f}")
            print(f"    一致性: {quality['consistency']:.2f}")
            print(f"    准确性: {quality['accuracy']:.2f}")
            print(f"    连通性: {quality['connectivity']:.2f}")
            print(f"    综合分数: {quality['overall']:.2f}")
        else:
            quality = None

        print("\n=== 知识图谱构建完成 ===")

        return {
            "entities": len(entities),
            "relations": len(relations),
            "quality": quality
        }

    def query(self, query: str) -> Dict:
        """查询知识图谱"""
        # 识别查询实体
        entities = self.entity_extractor.extract(query)

        if not entities:
            return {"answer": "无法识别相关实体"}

        main_entity = entities[0]

        # 获取邻居
        neighbors = self.kg.get.get_neighbors(main_entity.id)

        # 获取关系
        relations = self.kg.get_relations(subject_id=main_entity.id)

        return {
            "query": query,
            "entity": {
                "text": main_entity.text,
                "type": main_entity.type.value
            },
            "neighbors": [
                {"text": n.text, "type": n.type.value}
                for n in neighbors
            ],
            "relations": len(relations)
        }

# ============== 使用示例 ==============

if __name__ == "__main__":
    # 创建构建器
    builder = KnowledgeGraphBuilder(storage_type="memory")

    # 示例文本
    text = """
    Python是一种高级编程语言，由Guido van Rossum在1989年创建。
    Guido在荷兰的CWI研究所开发了Python。
    Python被广泛用于Web开发、数据科学和人工智能领域。
    Guido曾在Google工作，后来加入了Dropbox。
    Python的最新版本是3.12。
    """

    # 构建知识图谱
    result = builder.build_from_text(text)

    # 查询
    query = "Guido在哪里工作？"
    print(f"\n查询: {query}")
    answer = builder.query(query)
    print(f"结果: {answer}")

面试高频问法

Q1: 如何构建一个知识图谱？

标准回答：

知识图谱构建流程：

1. 数据采集
   - 文档数据
   - 结构化数据（数据库、API）
   - 网络数据

2. 实体识别
   - 基于规则：正则匹配
   - 基于模型：NER模型、LLM
   - 实体链接：消歧、链接到知识库

3. 关系抽取
   - 基于规则：模式匹配
   - 基于模型：关系分类模型、LLM
   - 多跳关系：路径抽取

4. 知识存储
   - 内存存储：小规模、快速
   - 图数据库：Neo4j、JanusGraph
   - 关系数据库：PostgreSQL

5. 知识推理
   - 规则推理：传递性、层次性
   - 路径推理：查找路径
   - 逻辑推理：谓词逻辑

实现：
```python
# 1. 实体识别
extractor = ModelBasedEntityExtractor()
entities = extractor.extract(text)

# 2. 关系抽取
relation_extractor = ModelBasedRelationExtractor()
relations = relation_extractor.extract(text, entities)

# 3. 存储
kg = InMemoryKnowledgeGraph()
for entity in entities:
    kg.add_entity(entity)

for relation in relations:
    kg.add_relation(relation)</code></pre>
<p>```</p>
<h3>Q2: 知识图谱在RAG中如何应用？</h3>
<p>标准回答：</p>
<pre><code>知识图谱增强RAG：

1. 实体识别
   - 从查询中识别实体
   - 识别文档中的实体

2. 关系扩展
   - 查找实体间的关系
   - 扩展相关实体
   - 构建实体图

3. 路径检索
   - 查找实体间的路径
   - 收集路径上的信息
   - 提供多跳上下文

4. 层级检索
   - 实体级：直接匹配
   - 关系级：相关实体
   - 邻居级：扩展检索

实现：
```python
def kg_enhanced_rag(query, kg, vector_db):
    # Step 1: 识别实体
    entities = extract_entities(query)

    # Step 2: 查找相关实体和关系
    related_entities = []
    for entity in entities:
        related_entities.extend(
            kg.get_neighbors(entity.id)
        )

    # Step 3: 扩展查询
    expanded_query = query
    for entity in related_entities:
        expanded_query += " " + entity.text

    # Step 4: 向量检索
    results = vector_db.search(expanded_query)

    return results


### Q3: 如何评估知识图谱的质量？

标准回答：

知识图谱质量评估维度：

完整性
- 实体覆盖率
- 关系覆盖率
- 属性完整性
一致性
- 无矛盾的关系
- 数据类型一致
- 约束满足
准确性
- 实体识别准确率
- 关系抽取准确率
- 属性值准确率
连通性
- 连通分量数量
- 孤立节点比例
- 平均路径长度
可用性
- 查询性能
- 更新性能
- 存储效率

实现：

def assess_kg_quality(kg):
    metrics = {}

    # 完整性
    entities = kg.get_entities()
    valid_entities = sum(
        1 for e in entities
        if e.text and e.properties
    )
    metrics["completeness"] = valid_entities / len(entities)

    # 一致性
    relations = kg.get_relations()
    valid_relations = sum(
        1 for r in relations
        if kg.get_entity(r.subject_id) and
           kg.get_entity(r.object_id)
    )
    metrics["consistency"] = valid_relations / len(relations)

    # 连通性
    visited = set()
    dfs(entities[0].id, visited)
    metrics["connectivity"] = len(visited) / len(entities)

    # 综合分数
    metrics["overall"] = sum(metrics.values()) / len(metrics)

    return metrics



---

## 总结

### 知识图谱核心要点

| 要点 | 策略 |
|------|------|
| **实体识别** | 规则+模型结合 |
| **关系抽取** | 模式匹配+LLM |
| **知识存储** | 图数据库优先 |
| **知识推理** | 规则+路径 |
| **质量评估** | 多维度指标 |

### 最佳实践

1. **分步构建**：实体→关系→验证
2. **多源融合**：整合不同数据源
3. **增量更新**：支持知识演化
4. **质量管控**：持续评估优化
5. **应用导向**：根据场景设计图谱

【AI Agent 知识库】24-知识图谱构建与应用

知识图谱构建与应用

目录

1. 知识图谱概述

1.1 知识图谱定义

1.2 知识图谱类型

1.3 知识图谱价值

2. 实体识别与抽取

2.1 实体类型定义

2.2 基于规则的实体识别

2.3 基于模型的实体识别

2.4 实体链接

3. 关系抽取

3.1 关系定义

3.2 基于规则的关系抽取

3.3 基于模型的关系抽取

4. 知识图谱存储

4.1 内存存储

4.2 Neo4j存储

5. 知识推理

5.1 基于规则的推理

5.2 基于路径的推理

6. 知识图谱应用

6.1 知识增强检索

6.2 知识问答

7. 知识图谱质量评估

7.1 质量指标

8. 实现示例

8.1 完整知识图谱构建流程

面试高频问法

Q1: 如何构建一个知识图谱？

林清杨