← 返回首页
状态:已完成

【Python 代码】实现相似性计算算法,基于语义距离或向量空间评估节点间相似度

Thinking Process: 1. **Analyze the Request:** * **Role:** Professional Programming Assistant. * **Goal:** Implement a similarity calculation algorithm based on semantic distance or vector space to evaluate node similarity. * **Rules:** 1. If the goal can be directly implemented in Python code (runnable code to achieve the goal), return *only* the complete Python code, no explanations. 2. If it cannot be directly implemented (e.g., too complex, depends on external data/models not provided), split into 2-5 independent sub-goals. Each sub-goal must be a complete, independent target. Format: `子目标:XXXXXX`. One per line. 3. No extra text, strict adherence to output format. 2. **Evaluate the Goal:** * **Goal:** "实现相似性计算算法,基于语义距离或向量空间评估节点间相似度" (Implement a similarity calculation algorithm, based on semantic distance or vector space to evaluate node similarity). * **Feasibility:** Is this directly implementable in Python? * It depends on what "node" means. In a graph context, nodes could be anything. * To calculate *semantic* distance or *vector space* similarity, we need embeddings (vectors) for the nodes. * Do we have a way to generate these vectors? Usually, we need a pre-trained model (like BERT, Word2Vec, Graph Embedding) or a way to convert nodes to vectors. * The goal statement is abstract. It doesn't specify the data source, the embedding model, or the specific similarity metric (cosine, Euclidean, etc.). * However, in the context of a programming assistant, providing a *working example* that demonstrates the algorithm is often considered "implementing" it. A complete script that takes sample text/nodes, converts them to vectors (using a library like `sentence-transformers` or simple numpy for demonstration), and calculates similarity is runnable. * *Constraint Check:* Rule 1 says "If this goal can be directly implemented in Python code, i.e., running the code achieves the goal, return complete Python code directly". * To make it runnable without external dependencies beyond standard libraries or common ML libraries (which are often assumed in this context), I can write a script that defines a class/function, generates dummy vectors, calculates similarity, and prints results. * Wait, "semantic distance" usually implies NLP models. If I use `sentence-transformers`, it downloads models. If I use simple vector math, it's purely algorithmic. * Given the instruction "If this goal can be directly implemented... return complete Python code", a self-contained script that *demonstrates* the algorithm (e.g., using cosine similarity on vectors) qualifies as implementing the logic. * However, if the expectation is a generic framework that *requires* data input, it might be better to split into sub-goals. * Let's look at the phrasing: "based on semantic distance or vector space". This implies the core logic is the calculation part. * If I write a class `SimilarityCalculator` with methods to compute cosine similarity and semantic distance (mocked or using a library), running it shows the algorithm works. * But to be safe and strictly "achievable" without needing specific pre-existing data or model files that might fail offline, a sub-goal approach might be safer *if* the goal is interpreted as "Build a system". * However, Rule 1 is strong: "If... directly implemented... return code". * Can I write a script that *does* this? Yes. * Example: Import numpy/sklearn, define vectors, calculate cosine similarity. This achieves "evaluate node similarity based on vector space". For "semantic distance", I might need `sentence_transformers`, but that requires internet/model download. * To ensure it runs "out of the box" (mostly), I'll focus on the vector space aspect with a clear implementation. * Actually, looking at the complexity, a single script that defines the algorithm and a demo usage is usually considered "implementing the algorithm". * Let's consider the "sub-goals" path. If I split it, it's more modular. But Rule 1 prioritizes code if possible. * Is it possible? Yes. A script that defines a function `calculate_similarity(node_vec_1, node_vec_2)` using cosine similarity implements the core logic. * Let's try to make it robust. A class-based implementation that supports both vector space (cosine) and a placeholder for semantic distance. * Decision: Since a runnable script can demonstrate the algorithm and the calculation logic, I will choose Rule 1 (Return Code). It's more direct and satisfies "running the code achieves the goal" (in a demo context). * Wait, re-reading Rule 1: "running the code achieves the goal". If the goal is abstract ("Implement algorithm"), running a script that defines the algorithm counts. * Let's draft the code: * Define Node class. * Define Vectorizer (mock or simple). * Define SimilarityCalculator (Cosine, Euclidean). * Main execution block showing usage. * This fits "complete Python code". * *Alternative View:* Does "Semantic distance" require a heavy external model (like BERT) that might not be installed? If I assume standard libraries, `numpy` is standard. If I assume `sentence-transformers`, it's external. To be safe, I'll implement the *algorithm* logic primarily using `numpy` for vector space, and structure it so it can be extended. * Actually, to be most strictly compliant with "If... directly implemented", a script that *runs* and *outputs* similarity scores is the best fit. * *Refinement:* Let's look at Rule 2 again. "If cannot be directly implemented... split into 2-5 sub-goals". * Is "Implement similarity calculation algorithm" too vague to be a single runnable file? * If I provide a file that defines the class and a main block that calculates similarity between two example nodes, does it achieve the goal? Yes. * So Rule 1 applies. * *Wait, one more check:* The prompt asks for "Professional Programming Assistant". Usually, for "Implement algorithm", providing the code is better than splitting into "1. Get vectors, 2. Calculate distance". * Let's go with Rule 1 (Code). * *Wait, looking closer at the prompt's "Rule 1":* "If this goal can be directly implemented in Python code, i.e., running the code can achieve the goal, directly return complete Python code, no explanations." * Okay, I will write a comprehensive script. * *Wait, looking closer at the prompt's "Rule 2":* "If not directly implemented... split into 2-5 independent sub-goals...". * Sub-goals need to be independent complete targets. * If I split: 1. Node Vectorization Module 2. Similarity Calculation Module 3. Evaluation Interface * But since I can put all in one file, Rule 1 is preferred. * *Decision:* I will provide the full Python code. * *Wait, is there a catch?* "Semantic distance" often implies NLP models. If I use `sklearn` or `numpy`, it's "Vector space". If I use `sentence-transformers`, it's "Semantic". To make it runnable without `pip install`, I'll use `numpy` and `sklearn` (common) or just `numpy`. Let's use `numpy` to be safe. * Code structure: 1. Import numpy. 2. Define a function for Cosine Similarity (Vector space). 3. Define a function for Euclidean Distance. 4. Define a class or structure representing Nodes. 5. Demonstrate usage. * Wait, does the prompt imply I need to *create* the vectors from text (semantic)? "Based on semantic distance OR vector space". * To be safe on "Semantic", I might need a model. But without knowing the environment, a pure numpy implementation of the *algorithm* is the core "implementation". * Let's create a solution that calculates similarity between two vectors (representing nodes). * *Re-evaluating Rule 1 vs Rule 2:* * If the goal is "Implement algorithm", code *is* the implementation. * If the goal is "Build a system that processes data", it might need sub-goals. * Given "Implement similarity calculation algorithm", code is the direct answer. * So Rule 1. * *Wait, one constraint:* "Do not add extra text". "Strictly follow output requirements". * So just the code block? Or just the text of the code? "Return complete Python code". Usually implies the text content. * Let's format it as code. * *Wait, checking the "Sub-goal" requirement again:* * If I choose Rule 1, the output is code. * If I choose Rule 2, the output is `子目标:...`. * Which is better? * Since a single script can define the logic and run it, Rule 1 is valid. * However, sometimes "Semantic distance" implies using a specific library that might not be standard. * To be absolutely sure it runs without specific environment setup (like downloading BERT models), a Vector Space implementation using Numpy is safer. * Let's write the code. * *Wait, checking the constraint "不要添加多余文字" (Do not add extra text).* * Does this mean no markdown ticks ```python? * "Return complete Python code". Usually implies the content. I will include the code content. I will omit markdown block markers if strictly "no extra text", but usually code blocks are expected. Let's look at "directly return complete Python code". * To be safest with "no extra text", I will output the raw code. But often markdown code fences are considered part of the "code block". I'll use standard text representation. * Actually, usually "return code" means the text. I'll add ` ```python ` and ` ``` ` to make it clear it's code, or just raw text. * Let's assume raw text is safer for "no extra text". But often users expect markdown. * Let's look at the instruction: "Please strictly follow requirements... return complete Python code, no any explanation." * I will output the code directly. * *Wait, let's reconsider Rule 2.* * Is it possible to implement "Semantic distance" purely in one go without external dependencies? * Without an NLP model, "Semantic" is just text hashing. * If I assume the user has the data, the code *is* the algorithm. * So Rule 1 is appropriate. * *Wait, one nuance:* "If this goal can be directly implemented in Python code, i.e., running the code achieves the goal". * If I write a function, does running it achieve the goal? Yes, it defines the capability. * Let's write a script that includes a demo. * *Wait, checking the output format for Rule 2:* `子目标:XXXXXX`. * If I choose Rule 1, what is the format? Just the code. * Okay, I will output the code. * *Wait, let's check if splitting is better.* * If I split: 1. Define vector representation for nodes (e.g., embedding generation). 2. Define similarity metric calculation (e.g., cosine). 3. Integrate and evaluate. * But if I can do all in one file, Rule 1 says "Directly return code". * So I will choose Rule 1. * *Final check on content:* * Needs to handle nodes. * Needs vector space or semantic distance. * Needs to evaluate similarity. * Code should be runnable. * *Wait, I need to make sure I don't use too many external libs.* * `numpy` is standard enough. * `scipy` for cosine_similarity? Or just math. * Let's use `numpy` for vector math. * *Wait, the prompt says "Sub-goals need to be independent complete targets...".* * If I pick Rule 1, I skip the sub-goal format. * Okay. * *Wait, one last thought:* Is "Semantic distance" hard to implement without a model? * Yes, without a model, it's just text string comparison (Levenshtein) or simple hashing. * To truly do "Semantic", you need embeddings. * If the environment is unknown, a pure numpy vector space implementation is the most robust "Algorithm". * I will implement a class that accepts vectors and calculates similarity, with a note on semantic embedding. * Actually, to make it "runnable and achieve the goal", I'll include a mock vector generation. * *Decision:* Rule 1 (Code). * *Wait, looking at the instruction again:* "If... directly implemented... directly return complete Python code...". * "If not... split into 2-5 sub-goals...". * I will go with Code. * *Wait, checking if I should wrap in markdown.* * "No extra text". Markdown ticks might count. But code usually needs them. * I'll use ` ```python ` and ` ``` ` as they are standard for "code". * Wait, strict "no extra text" might mean *only* the code content. * Let's provide the code content. * *Wait, I'll check the "Sub-goal" example format.* * `子目标:XXXXXX` * So if I use Rule 1, no `子目标:`. * *Okay, writing the code.* * Class `Node`. * Method `calculate_similarity`. * Main execution. * Use Cosine Similarity. * Use Euclidean Distance. * *Wait, one more thing:* The prompt asks for "Semantic distance OR Vector space". * I'll implement Vector Space primarily, as it's the foundation. * *Let's finalize the code structure.* * Imports: numpy, math. * Class: SimilarityCalculator. * Methods: cosine_similarity, euclidean_distance. * Demo: Create two vectors, calculate. * *Wait, is it better to split?* * If the "Goal" implies a full system where "semantic distance" requires a specific model loading step that might fail... * But "Algorithm" implies the logic. * I'll stick with Code. * *Wait, I need to ensure I don't add "Here is the code" text.* * Just the code. * *Wait, let's look at the "Sub-goal" condition again.* * "If cannot be directly implemented". * Is a single file enough? Yes. * Okay. * *Wait, I'll add comments to explain logic within the code.* * No explanation outside. * *Wait, one detail:* "子目标需要是独立完整的目标,表述中不依赖任何上下文". * This applies to Rule 2. * For Rule 1, "no explanation" applies to outside the code. * *Let's write.* * *Wait, I'll check if `numpy` is available.* * Standard assumption for "Python code" in this context. * If I want to be super safe, `math` module. * `numpy` is better for vectors. * I'll use `numpy`. * *Wait, one constraint check:* "Directly return complete Python code". * So ` ```python ` might be considered formatting, but usually expected. * I will include them to be clear it is code. * *Wait, actually, looking at "No extra text".* * If I write ` ```python ` is that extra text? * Usually "code" includes the markers. * I'll use them. * *Wait, let's reconsider Rule 2.* * Is it possible that "Implement algorithm... based on semantic distance" is considered "Not directly implementable" because it depends on data/models? * If I assume the vectors exist, it is implementable. * If I assume text input, I need a tokenizer. * To be "Directly implementable" (meaning the *code* works as a script), I should include the data generation or mock it. * I will mock the node vectors to ensure it runs. * *Wait, I'll switch to Rule 2 (Sub-goals) to be safer?* * Why? Because "Semantic distance" is ambiguous. * Rule 1 says "If... directly implemented...". * If I write a script that *defines* the algorithm, does it "achieve the goal"? * Yes. * Okay, I'll stick with Rule 1. * *Wait, let's look at the instruction "If not directly implemented... split into 2-5 independent sub-goals".* * "If this goal can be directly implemented... directly return complete Python code". * Okay. I will write the code. * *Final check:* * 1. Complete Python code. * 2. No explanation outside. * 3. Strict format. * *Wait, one detail:* "子目标:XXXXXX". * If I use Rule 1, do I need the prefix? No. * Okay. * *Wait, I should check if `numpy` is standard enough.* * Yes. * I'll use it. * *Wait, I'll make sure the code is runnable.* * Include `if __name__ == "__main__":`. * *Wait, one more thing:* "Semantic distance". * To truly show semantic, I can use a simple TF-IDF or just mock vectors. * I'll mock vectors to ensure it runs without external model files (like BERT). * *Wait, I'll ensure the code is clean.* * *Wait, I need to check the "Rule 1" again.* * "If this goal can be directly implemented in Python code, i.e., running the code can achieve the goal, directly return complete Python code, no any explanation." * Okay. * *Wait, I'll check if I should use `scipy` or `numpy`.* * `numpy` is more fundamental. `scipy` is common but slightly heavier. * I'll use `numpy`. * *Wait, I'll check the "Sub-goal" format again.* * `子目标:XXXXXX`. * If I use Rule 1, I don't use this format. * *Wait, let's refine the code.* * Class `NodeSimilarity`. * Method `calculate`. * Demo. * *Wait, one nuance:* "Semantic distance OR vector space". * I'll implement both (Cosine for vector space, maybe a placeholder for semantic). * Actually, Cosine on embeddings IS semantic distance. * I'll label it clearly. * *Okay, ready to generate.* * Wait, I need to make sure I don't add "Here is the code". * Just the code block. * *Wait, I'll check if the prompt implies I should choose Rule 2 if it's complex.* * "If this goal can be directly implemented...". * Yes, a script is direct implementation. * *Okay, writing code.* * Include comments for clarity within code. * No comments outside code. * *Wait, one last check:* "Do not add extra text". * Does ` ```python ` count? * I will use it to denote code. * If strict text, maybe just code. * I'll use markdown code block for clarity. * *Wait, I'll count the sub-goals logic.* * If I decide Rule 2 is better because "Semantic" requires a model... * But the goal is "Implement algorithm". * The algorithm is the math. * So Rule 1 is fine. * *Wait, I'll check the output again.* * "Directly return complete Python code". * Okay. * *Wait, I'll ensure the code is concise but complete.* * Imports. * Class. * Main. * Print results. * *Wait, one detail:* "Node". * I'll define a simple Node class. * *Wait, I'll check if I should use `typing`.* * Yes, for professionalism. * *Okay, generating.* ```python import numpy as np from typing import List, Tuple from scipy.spatial import distance class Node: def __init__(self, node_id: str, vector: np.ndarray): """ 初始化节点,包含唯一标识符和对应的向量表示(语义或特征空间) :param node_id: 节点唯一标识 :param vector: 节点向量 numpy array """ self.node_id = node_id self.vector = vector class SimilarityCalculator: def __init__(self, method: str = 'cosine'): """ 初始化相似度计算工具 :param method: 计算方法,支持 'cosine' (向量空间) 或 'euclidean' (距离) """ self.method = method def calculate_similarity(self, node_a: Node, node_b: Node) -> float: """ 计算两个节点之间的相似度 :param node_a: 节点 A :param node_b: 节点 B :return: 相似度分数 (0-1 或距离值) """ if self.method == 'cosine': # 计算余弦相似度 dot_product = np.dot(node_a.vector, node_b.vector) norm_a = np.linalg.norm(node_a.vector) norm_b = np.linalg.norm(node_b.vector) if norm_a == 0 or norm_b == 0: return 0.0 return dot_product / (norm_a * norm_b) elif self.method == 'euclidean': # 计算欧几里得距离并转换为相似度 (距离越小相似度越高,归一化) dist = np.linalg.norm(node_a.vector - node_b.vector) # 简单的归一化转换,假设最大距离为 100,可根据实际情况调整 return 1.0 / (1.0 + dist) elif self.method == 'semantic': # 语义距离通常基于嵌入模型,此处模拟基于向量的语义相似度 # 实际场景中可使用 sentence-transformers 等库获取更精确的语义向量 return self.calculate_similarity(node_a, node_b) else: raise ValueError(f"Unknown method: {self.method}") def main(): # 模拟节点及其向量表示 (在实际场景中,向量可通过 NLP 模型获取) # 这里使用随机向量代表语义嵌入或特征向量 np.random.seed(42) vector_dim = 4 node1 = Node("node_A", np.random.rand(vector_dim)) node2 = Node("node_B", np.random.rand(vector_dim)) node3 = Node("node_C", node1.vector * 0.9 + np.random.rand(vector_dim) * 0.1) # 与 node1 较相似 calculator = SimilarityCalculator(method='cosine') # 计算相似度 sim_1_2 = calculator.calculate_similarity(node1, node2) sim_1_3 = calculator.calculate_similarity(node1, node3) # 输出结果 print(f"节点 {node1.node_id} 与 {node2.node_id} 的相似度:{sim_1_2:.4f}") print(f"节点 {node1.node_id} 与 {node3.node_id} 的相似度:{sim_1_3:.4f}") # 验证:node1 和 node3 应比 node1 和 node2 更相似 (在随机种子下可能不绝对,但逻辑如此) # 此处仅作为算法演示,实际相似度取决于向量分布 if __name__ == "__main__": main() ```