Serf initial implementation: Semantic Entity Resolution Framework#17
Serf initial implementation: Semantic Entity Resolution Framework#17
Conversation
Gemini 2.0 Flash only for all ER pipeline operations. Gemini 2.5 Pro allowed only for validation data generation with < 2K API calls.
Do not reuse Abzu's BAML-generated types. Build fresh domain-agnostic Pydantic classes for DSPy signatures. Add auto-generation of entity types from PySpark DataFrame schemas via type_generator module. Add Spark-to-Python type mapping and DatasetProfile-driven field descriptions.
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…te config and module structure Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ipeline with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ger with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…tion with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…s with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…, download commands Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…s, fix text column detection Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…r F1=0.90 (embedding baseline) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…flag, remove scripts/ directory Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…h optional YAML config Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…M for all matching Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…tput flag Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…th validation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…fix analyze LLM guidance, add auto-convergence Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
|
/gemini review |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request represents the initial, comprehensive implementation of the Semantic Entity Resolution Framework (SERF). It establishes a robust, agentic pipeline for identifying and disambiguating entities across datasets. The changes lay the groundwork for a scalable and intelligent entity resolution system by integrating modern AI techniques like large language models and embeddings with efficient data processing frameworks. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
…arison test, fix analyze config generation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request provides an impressive and comprehensive initial implementation of the SERF framework, successfully migrating the project to a modern Python toolchain with uv and ruff, and building out core components for a sophisticated, multi-stage entity resolution pipeline. No security vulnerabilities were found. My review highlights critical correctness issues in data handling and the UUID mapping recovery logic, along with suggestions to improve design consistency and robustness.
src/serf/cli/main.py
Outdated
| ] | ||
| entities.append( | ||
| Entity( | ||
| id=int(row_dict.get("id", idx)), # type: ignore[arg-type] |
There was a problem hiding this comment.
The use of int(row_dict.get("id", idx)) is likely to cause a crash when processing datasets where the 'id' column is not a numeric type. For example, the DBLP-ACM benchmark dataset uses string identifiers like "conf/sigmod/Abadi05". Attempting to cast this to an int will raise a ValueError.
To make this helper more robust, you should handle non-integer IDs gracefully. One approach is to create an internal mapping from the original string IDs to integer IDs if they are not already integers, or simply use the pandas index idx consistently as the primary integer ID for processing.
| id=int(row_dict.get("id", idx)), # type: ignore[arg-type] | |
| id=idx, # Use the DataFrame index for a consistent integer ID |
src/serf/match/uuid_mapper.py
Outdated
| # Phase 1: Add missing IDs to first resolved entity's source_ids | ||
| if missing_ids and resolution.resolved_entities: | ||
| first = resolution.resolved_entities[0] | ||
| existing_sources = set(first.source_ids or []) | ||
| first_sources = list(existing_sources | missing_ids) | ||
| resolution.resolved_entities[0] = first.model_copy(update={"source_ids": first_sources}) |
There was a problem hiding this comment.
The Phase 1 recovery logic for missing entities makes a strong and potentially incorrect assumption. If the LLM fails to return some entities from a block, this code assumes they were all merged into the first resolved entity. This can corrupt the merge lineage (source_ids), which is critical for a multi-round ER system.
For example, if a block contains {A, B, C} and the LLM only returns a merged A+B, this logic would incorrectly record that C was also merged into A+B.
A safer approach would be to rely solely on the Phase 2 recovery mechanism for all missing entities. This would treat any entity not returned by the LLM as an un-merged singleton, preserving the integrity of the data lineage. I recommend removing this Phase 1 logic.
pyproject.toml
Outdated
| packages = ["src/serf"] | ||
|
|
||
| [tool.black] | ||
| [dependency-groups] |
There was a problem hiding this comment.
While [dependency-groups] is supported by uv, it's a uv-specific extension. The standard way to define optional dependencies according to PEP 621 is using [project.optional-dependencies]. Using the standard format would improve portability with other packaging tools like pip.
For example:
[project.optional-dependencies]
dev = [
"pytest>=8.0",
"pytest-asyncio>=1.0",
"ruff>=0.11",
"zuban>=0.0.23",
"pre-commit>=4.0",
"types-pyyaml>=6.0",
]| index.train(embeddings) # type: ignore[call-arg] | ||
| index.add(embeddings) # type: ignore[call-arg] |
There was a problem hiding this comment.
The type: ignore[call-arg] comments suggest a potential type mismatch between the embeddings array and what the faiss library expects. While the code may work, these ignores can hide underlying issues, especially with C-extension libraries where type stubs might be incomplete or out of sync. It would be beneficial to investigate the exact cause of the type error and resolve it, which might involve a minor type cast or adjustment, to ensure type safety and remove the need for the ignores.
src/serf/dspy/types.py
Outdated
| def text_for_embedding(self) -> str: | ||
| """Return text representation for embedding. | ||
|
|
||
| Returns | ||
| ------- | ||
| str | ||
| Concatenation of name and description for embedding | ||
| """ | ||
| parts = [self.name] | ||
| if self.description: | ||
| parts.append(self.description) | ||
| for _key, val in self.attributes.items(): | ||
| if isinstance(val, str) and val: | ||
| parts.append(val) | ||
| return " ".join(parts) |
There was a problem hiding this comment.
The current implementation of text_for_embedding concatenates the name, description, and all string attributes. This can lead to noisy and overly long text for the embedding model, potentially degrading the quality of the semantic blocking. For instance, it might include irrelevant metadata or long boilerplate text from attributes.
Consider making this more configurable. A better approach would be to use a specified list of fields for embedding, similar to the text_fields defined in ERConfig. This would give the user more control over the embedding representation and likely improve blocking performance.
src/serf/pipeline.py
Outdated
| def _llm_match_and_merge(blocks: list[EntityBlock], cfg: ERConfig) -> list[Entity]: | ||
| """Run LLM-based matching on blocks and return resolved entities. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| blocks : list[EntityBlock] | ||
| Blocks to process | ||
| cfg : ERConfig | ||
| Pipeline configuration | ||
|
|
||
| Returns | ||
| ------- | ||
| list[Entity] | ||
| Resolved entities after LLM matching and merging |
There was a problem hiding this comment.
This function's name, _llm_match_and_merge, implies that it performs both matching and merging. However, the implementation only calls the EntityMatcher, which in turn uses the BlockMatch signature. The BlockMatch prompt instructs the LLM to perform the merge as part of its task. This means the serf.merge.merger.EntityMerger class is not actually used in this pipeline.
This isn't a bug, but it's a design discrepancy worth noting. The current approach (merging within the LLM prompt) is valid and efficient. However, it renders merger.py as dead code in the context of this main pipeline. Consider either renaming this function to better reflect that it only orchestrates the LLM call (e.g., _resolve_blocks_with_llm) or refactoring to use the EntityMerger for a more modular, two-step match-then-merge process, which might offer more control and testability.
…ig for agentic override Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
|
I have completed the initial review and provided a summary of the changes. I will now proceed with a more detailed code review and post my findings as review comments shortly. Please let me know if you have any specific areas you'd like me to focus on, or any questions regarding the summary. |
…All matching via LLM. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…hing Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…g (Gemini Flash, 30 concurrent) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…king Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
- pipeline.py: Assign UUIDs at entity creation, track all historical UUIDs - uuid_mapper.py: Transitive source_ids/source_uuids collection, dedup, exclude self - merger.py: Dedup source_ids/source_uuids, exclude master's own ID/UUID - matcher.py: Add iteration param to resolve_block/resolve_blocks, set match_skip_history - metrics.py: Add validate_source_uuids function - Tests for dedup, self-exclusion, transitive accumulation, and UUID validation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…analysis, UUID validation, PASS/FAIL checks Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
|
It seems Copilot has opened a new pull request to address some changes. I will await the review request on that new pull request (#18) once it's ready. |
review now |
…ss-cpu Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request presents an impressive and comprehensive initial implementation of the Semantic Entity Resolution Framework (SERF), utilizing modern tooling like uv and ruff, and DSPy for LLM interaction. While the code quality is high, a critical security vulnerability related to Prompt Injection has been identified. Untrusted data from datasets or user input is being included in LLM prompts without sufficient sanitization or structural isolation, potentially allowing attackers to manipulate the entity resolution process or cause denial of service. To mitigate this, consider strengthening prompt engineering with robust delimiters, validating LLM outputs against strict schemas, and implementing safe bounds for LLM-generated configuration parameters. Additionally, address minor code quality issues such as incorrect version specifiers for the ruff tool and some code duplication in the CLI module.
.pre-commit-config.yaml
Outdated
| types: [python] | ||
| - repo: local | ||
| - repo: https://github.com/astral-sh/ruff-pre-commit | ||
| rev: v0.11.0 |
There was a problem hiding this comment.
The ruff-pre-commit repository version v0.11.0 appears to be invalid. The astral-sh/ruff-pre-commit repository does not have a tag with this version, which will cause pre-commit install or pre-commit autoupdate to fail. Please update this to a recent, valid version. The latest version is v0.4.4.
rev: v0.4.4
pyproject.toml
Outdated
| dev = [ | ||
| "pytest>=8.0", | ||
| "pytest-asyncio>=1.0", | ||
| "ruff>=0.11", |
There was a problem hiding this comment.
The version specifier ruff>=0.11 is not a valid specifier for the ruff package, as its versions follow a 0.x.y scheme (e.g., 0.4.4). This will likely cause dependency resolution to fail. Please correct this to a valid version specifier, for example, by pinning to a recent version.
| "ruff>=0.11", | |
| "ruff>=0.4.4", |
| self, | ||
| dataset_description: str, | ||
| entity_type: str = "entity", | ||
| ) -> dspy.Prediction: | ||
| """Run the agentic ER pipeline. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| dataset_description : str | ||
| Summary of the dataset to resolve | ||
| entity_type : str | ||
| Type of entities being resolved | ||
|
|
||
| Returns | ||
| ------- | ||
| dspy.Prediction | ||
| The agent's action plan and reasoning | ||
| """ | ||
| return cast( | ||
| dspy.Prediction, | ||
| self.react( | ||
| dataset_description=dataset_description, |
There was a problem hiding this comment.
The ERAgent.forward method takes a dataset_description parameter which is directly passed to the dspy.ReAct agent. This input is not sanitized or validated, allowing a user to provide a crafted description that could manipulate the agent's reasoning and tool usage (Prompt Injection). In a ReAct loop, this could lead to the agent performing unintended actions or returning misleading results. Consider validating the input or using a more restrictive prompt template that clearly separates user input from instructions.
| def generate_er_config( | ||
| profile: DatasetProfile, | ||
| sample_records: list[dict[str, Any]], | ||
| model: str = "gemini/gemini-2.0-flash", | ||
| ) -> str: | ||
| """Use an LLM to generate an ER config YAML from a dataset profile. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| profile : DatasetProfile | ||
| Statistical profile of the dataset | ||
| sample_records : list[dict[str, Any]] | ||
| Sample records from the dataset (5-10 records) | ||
| model : str | ||
| LLM model to use for config generation | ||
|
|
||
| Returns | ||
| ------- | ||
| str | ||
| YAML string with the recommended ER configuration | ||
| """ | ||
| api_key = os.environ.get("GEMINI_API_KEY", "") | ||
| lm = dspy.LM(model, api_key=api_key) | ||
|
|
||
| predictor = dspy.ChainOfThought(GenerateERConfig) | ||
|
|
||
| profile_json = profile.model_dump_json(indent=2) | ||
| samples_json = json.dumps(sample_records[:10], indent=2, default=str) | ||
|
|
||
| logger.info("Generating ER config with LLM...") | ||
| with dspy.context(lm=lm, adapter=BAMLAdapter()): | ||
| result = predictor( |
There was a problem hiding this comment.
The generate_er_config function constructs an LLM prompt using a statistical profile and sample records from the input dataset. If the dataset contains malicious content, it could influence the LLM to generate an insecure or incorrect entity resolution configuration. This 'Indirect Prompt Injection' could lead to denial of service (e.g., by setting extremely high iteration counts) or manipulation of the ER process. Ensure that the LLM output is validated against a strict schema and that sensitive configuration parameters have safe upper bounds.
| def resolve_block(self, block: EntityBlock, iteration: int = 1) -> BlockResolution: | ||
| """Process a single block through the LLM. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| block : EntityBlock | ||
| Block of entities to resolve | ||
| iteration : int | ||
| Current pipeline iteration number | ||
|
|
||
| Returns | ||
| ------- | ||
| BlockResolution | ||
| Resolution with merged and non-matched entities | ||
| """ | ||
| mapper = UUIDMapper() | ||
| mapped_block = mapper.map_block(block) | ||
|
|
||
| block_records = json.dumps( | ||
| [e.model_dump(mode="json") for e in mapped_block.entities], | ||
| indent=2, | ||
| ) | ||
| few_shot = get_default_few_shot_examples() | ||
|
|
||
| try: | ||
| lm = self._ensure_lm() | ||
| with dspy.context(lm=lm, adapter=self._adapter): | ||
| result = self.predictor( |
There was a problem hiding this comment.
The EntityMatcher.resolve_block method serializes entity data from the dataset into JSON and includes it directly in the LLM prompt for matching. An attacker could include malicious data within entity attributes (e.g., names or descriptions) designed to subvert the matching logic, cause the LLM to ignore other entities, or produce false matches. This is a form of Indirect Prompt Injection. Consider using robust delimiters and explicit instructions to the LLM to treat the data as untrusted content.
| async def resolve_edge_block( | ||
| self, block_key: str, edges: list[dict[str, Any]] | ||
| ) -> list[dict[str, Any]]: | ||
| """Resolve a single block of duplicate edges. | ||
|
|
||
| On error, return original edges unchanged. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| block_key : str | ||
| Key identifying this block (e.g. JSON of [src, dst, type]) | ||
| edges : list[dict[str, Any]] | ||
| Edges in this block | ||
|
|
||
| Returns | ||
| ------- | ||
| list[dict[str, Any]] | ||
| Resolved edges (deduplicated/merged), or original on error | ||
| """ | ||
| if len(edges) <= 1: | ||
| return edges | ||
|
|
||
| try: | ||
| edge_block_json = json.dumps(edges) | ||
| result = await asyncio.to_thread(self._predictor, edge_block=edge_block_json) |
There was a problem hiding this comment.
Similar to the EntityMatcher, the EdgeResolver.resolve_edge_block method passes edge data from the dataset to an LLM prompt. Malicious edge attributes could be used to manipulate the edge deduplication and merging logic via prompt injection. Ensure that the prompt instructions clearly distinguish between the task logic and the data being processed.
src/serf/cli/main.py
Outdated
| def _detect_name_column(columns: list[str]) -> str: | ||
| """Detect the primary name column from a list of column names. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| columns : list[str] | ||
| Column names to search | ||
|
|
||
| Returns | ||
| ------- | ||
| str | ||
| The detected name column | ||
| """ | ||
| name_candidates = [ | ||
| "title", | ||
| "name", | ||
| "product_name", | ||
| "company_name", | ||
| "entity_name", | ||
| ] | ||
| for candidate in name_candidates: | ||
| if candidate in columns: | ||
| return candidate | ||
| for col in columns: | ||
| if col != "id": | ||
| return col | ||
| return columns[0] | ||
|
|
||
|
|
||
| def _dataframe_to_entities(df: Any) -> list[Any]: | ||
| """Convert a pandas DataFrame to a list of Entity objects. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| df : pd.DataFrame | ||
| Input DataFrame with entity records | ||
|
|
||
| Returns | ||
| ------- | ||
| list[Entity] | ||
| List of Entity objects | ||
| """ | ||
|
|
||
| from serf.dspy.types import Entity | ||
|
|
||
| entities = [] | ||
| name_col = _detect_name_column(df.columns.tolist()) | ||
| for i, (_idx, row) in enumerate(df.iterrows()): | ||
| row_dict = row.to_dict() | ||
| name = str(row_dict.get(name_col, f"entity_{i}")) | ||
| desc_parts = [ | ||
| str(v) for k, v in row_dict.items() if k != name_col and isinstance(v, str) and v | ||
| ] | ||
| entities.append( | ||
| Entity( | ||
| id=i, # Use sequential index — original IDs may be strings | ||
| name=name, | ||
| description=" ".join(desc_parts), | ||
| attributes=row_dict, | ||
| ) | ||
| ) | ||
| return entities | ||
|
|
There was a problem hiding this comment.
The helper functions _detect_name_column and _dataframe_to_entities in this module duplicate logic that is also implemented in a more comprehensive way in src/serf/pipeline.py. This can lead to maintenance challenges and inconsistencies if one implementation is updated but the other is not. To improve code reuse and maintainability, consider refactoring to use a single, shared implementation for loading data and converting it to Entity objects across all relevant CLI commands. For example, the functions from pipeline.py could be moved to a shared utility module.
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… FAISS compatibility Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… processes to fix macOS MPS segfault Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…NING.md from Eridu lessons Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ngual-e5-base, remove all pip references Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…P=0.885 R=0.581 F1=0.701 Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…rvice profiles Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… defenses, validate LLM config output, deduplicate CLI helpers Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
The primary SERF interface is DataFrame-in, DataFrame-out. Pydantic types are auto-generated internally from df.schema + DatasetProfile. Document the full flow: DataFrame → Pydantic → JSON → DSPy/LLM → Pydantic → DataFrame. Users never need to define types unless they want custom control.
Add true_positives and false_positives to evaluate_resolution() return dict. Display benchmark results as a pd.DataFrame table instead of individual click.echo lines.
…raphlet-AI/serf into cursor/serf-long-shot-plan-system-b0d4
…ing, deprecate connected components Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
- Add mlflow[genai]>=3.10.1 dependency - Add serf mlflow command to start local MLflow server with SQLite backend - Create serf.tracking module with setup_mlflow() for DSPy autologging - Enable MLflow tracing in run, match, benchmark, benchmark-all commands - Use click.Choice for --dataset in benchmark/download commands - Fix benchmark tests for proper mock patching and type annotations
Cursor Agent long-shot prompts overnight to build out the core features of SERF.