Serf initial implementation: Semantic Entity Resolution Framework by rjurney · Pull Request #17 · Graphlet-AI/serf

rjurney · 2026-03-08T20:23:11Z

Cursor Agent long-shot prompts overnight to build out the core features of SERF.

Gemini 2.0 Flash only for all ER pipeline operations. Gemini 2.5 Pro allowed only for validation data generation with < 2K API calls.

Do not reuse Abzu's BAML-generated types. Build fresh domain-agnostic Pydantic classes for DSPy signatures. Add auto-generation of entity types from PySpark DataFrame schemas via type_generator module. Add Spark-to-Python type mapping and DatasetProfile-driven field descriptions.

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…te config and module structure Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…ipeline with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…ger with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…tion with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…s with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…, download commands Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…s, fix text column detection Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…r F1=0.90 (embedding baseline) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…flag, remove scripts/ directory Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…h optional YAML config Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…M for all matching Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…tput flag Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…th validation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…fix analyze LLM guidance, add auto-convergence Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

rjurney · 2026-03-08T20:23:25Z

/gemini review
@gemini review

gemini-code-assist · 2026-03-08T20:24:31Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents the initial, comprehensive implementation of the Semantic Entity Resolution Framework (SERF). It establishes a robust, agentic pipeline for identifying and disambiguating entities across datasets. The changes lay the groundwork for a scalable and intelligent entity resolution system by integrating modern AI techniques like large language models and embeddings with efficient data processing frameworks.

Highlights

Project Infrastructure Overhaul: Migrated the project's dependency management from Poetry to uv, updated pre-commit hooks to utilize ruff for linting and formatting, and expanded .gitignore to include common development artifacts and IDE-specific files. An Apache License 2.0 was added to the repository.
Core Entity Resolution Pipeline Implementation: Introduced foundational data models including a generic Entity base class, EntityBlock, MatchDecision, and BlockResolution. Implemented semantic blocking using Qwen3 sentence embeddings and FAISS for efficient clustering, along with a SemanticBlockingPipeline to orchestrate the process.
Extensive DSPy Integration: Leveraged DSPy for various LLM-powered operations, including BlockMatch for entity matching and merging, EdgeResolve for deduplicating graph edges, AnalyzeDataset for profiling, and GenerateERConfig for LLM-driven configuration. An ERAgent was introduced for agentic control of the ER pipeline.
Comprehensive CLI and Benchmarking: Developed a full-featured command-line interface (serf) with commands for run, analyze, block, match, eval, edges, download, benchmark, and benchmark-all. Integrated with standard ER benchmark datasets (Leipzig, DeepMatcher formats) and evaluation metrics.
Spark Integration and Utilities: Added modules for PySpark integration, including schema conversion utilities (Pydantic to Spark StructType), graph algorithms (connected_components), and initial Iceberg table support.
Detailed Documentation and Planning: Created a comprehensive BUILD_PLAN.md outlining the development phases and tasks. Updated CLAUDE.md and README.md to reflect the new tooling, architecture, and usage instructions for the SERF framework.

Changelog

.gitignore
- Added Python-specific build and environment ignores.
- Added IDE-specific ignores for .idea and .vscode directories.
.pre-commit-config.yaml
- Replaced Black, Flake8, and Isort hooks with Ruff and Ruff-format.
- Updated Zuban hook to use uv run and disabled filename passing.
- Removed Pytest hook.
CLAUDE.md
- Updated development instructions to use uv for dependency management and command execution.
- Replaced references to Black, Isort, and Flake8 with Ruff for linting and formatting.
- Updated LLM-related guidance from BAML to DSPy and BAMLAdapter.
- Removed the BAML Client Generation section.
LICENSE
- Added the Apache License 2.0.
README.md
- Updated project title and added license and Python version badges.
- Rewrote the project description to detail SERF's agentic approach and phases.
- Added an architecture table outlining key technologies used.
- Updated Quick Start and Development sections to reflect uv and Ruff tooling.
- Included Python API and DSPy Interface usage examples.
- Added a Benchmark Results section with baseline performance data.
- Detailed the project's directory structure and configuration access.
config.yml
- Added extensive configuration for models (embedding, LLM), entity resolution parameters (blocking, matching, evaluation, paths), and benchmark datasets.
docs/BUILD_PLAN.md
- Added a detailed, multi-phase build plan for the SERF project, covering infrastructure, type system, DSPy signatures, core modules, evaluation, CLI, and PyPI preparation.
docs/SERF_LONG_SHOT_PLAN.md
- Updated the plan to emphasize fresh DSPy Pydantic types over BAML-generated ones.
- Included a task for auto-generating entity types from DataFrames.
- Replaced Poetry with uv in the evolution patterns.
- Added a section detailing the overnight build budget constraint for Gemini API usage.
pyproject.toml
- Migrated project metadata from Poetry format to PEP 621 [project] format.
- Updated and added dependencies including dspy-ai, pyspark, sentence-transformers, faiss-cpu, cleanco, tqdm, numpy, and pandas.
- Replaced Black and Isort configurations with Ruff linting and formatting settings.
- Added project URLs and classifiers for PyPI publication.
src/serf/analyze/init.py
- Added module initialization for the analyze package.
src/serf/analyze/field_detection.py
- Added a new module for detecting field types based on name and sample values using heuristics and regex.
src/serf/analyze/profiler.py
- Added a new module for DatasetProfiler to analyze dataset statistics and generate_er_config to create ER configurations using an LLM.
src/serf/block/embeddings.py
- Added a new module for EntityEmbedder to compute sentence embeddings using sentence-transformers.
src/serf/block/faiss_blocker.py
- Added a new module for FAISSBlocker to cluster entity embeddings into blocks using FAISS IVF.
src/serf/block/normalize.py
- Added a new module for name normalization utilities, including corporate suffix removal, acronym generation, and multilingual stop word filtering.
src/serf/block/pipeline.py
- Added a new module for SemanticBlockingPipeline to orchestrate the embedding, clustering, and block splitting process.
src/serf/cli/main.py
- Refactored and expanded the CLI with new commands for run, analyze, block, match, eval, edges, download, benchmark, and benchmark-all.
- Implemented data loading, entity conversion, and pipeline execution logic within CLI commands.
src/serf/config.py
- Improved error handling for file operations and type hints in the Config class.
src/serf/dspy/agents.py
- Added a new module for ERAgent, a DSPy ReAct agent, to control the entity resolution pipeline dynamically.
src/serf/dspy/baml_adapter.py
- Updated field structure formatting to correctly iterate over input fields.
src/serf/dspy/signatures.py
- Added new DSPy signatures for BlockMatch, EntityMerge, EdgeResolve, AnalyzeDataset, and GenerateERConfig.
src/serf/dspy/type_generator.py
- Added a new module for entity_type_from_spark_schema to dynamically generate Pydantic Entity subclasses from Spark schemas.
src/serf/dspy/types.py
- Rewrote the module to define new, domain-agnostic Pydantic types for the ER pipeline, including Entity, Publication, Product, EntityBlock, MatchDecision, BlockResolution, and various metrics types.
src/serf/edge/init.py
- Added module initialization for the edge package.
src/serf/edge/resolver.py
- Added a new module for EdgeResolver to deduplicate graph edges using LLM-guided merging.
src/serf/eval/init.py
- Added module initialization for the eval package.
src/serf/eval/benchmarks.py
- Added a new module for BenchmarkDataset to manage downloading, loading, and evaluating standard entity resolution benchmarks.
src/serf/eval/metrics.py
- Added a new module for evaluation metrics including precision, recall, F1 score, pair completeness, and reduction ratio.
src/serf/match/init.py
- Added module initialization for the match package.
src/serf/match/few_shot.py
- Added a new module for generating default and custom few-shot examples for LLM matching.
src/serf/match/matcher.py
- Added a new module for EntityMatcher to resolve entity blocks using DSPy BlockMatch signatures with async processing.
src/serf/match/uuid_mapper.py
- Added a new module for UUIDMapper to map entity IDs to integers for LLM processing and restore original IDs/UUIDs.
src/serf/merge/init.py
- Added module initialization for the merge package.
src/serf/merge/merger.py
- Added a new module for EntityMerger to combine multiple entities into a single canonical record.
src/serf/pipeline.py
- Added a new module for the run_pipeline function, orchestrating the end-to-end entity resolution process with iterative blocking and LLM matching.
src/serf/spark/init.py
- Added module initialization for the spark package.
src/serf/spark/graph.py
- Added a new module for connected_components function to find connected components in Spark DataFrames.
src/serf/spark/iceberg.py
- Added a new module for Iceberg integration, including session creation and data read/write functions.
src/serf/spark/schemas.py
- Added a new module for Pydantic to Spark schema conversion and schema validation utilities.
src/serf/spark/utils.py
- Added a new module for Spark utility functions, including splitting large blocks and selecting most common properties.
tests/test_agents.py
- Added tests for the ERAgent and its tool functions.
tests/test_analyze_benchmark.py
- Added integration tests for LLM-powered dataset analysis and benchmark config generation.
tests/test_benchmarks.py
- Added tests for benchmark dataset loading, entity conversion, and evaluation.
tests/test_cli.py
- Added tests for the serf CLI commands and their help messages.
tests/test_dspy.py
- Updated DSPy integration tests.
tests/test_edge_resolver.py
- Added tests for the EdgeResolver module.
tests/test_embeddings.py
- Added tests for the EntityEmbedder module.
tests/test_faiss_blocker.py
- Added tests for the FAISSBlocker module.
tests/test_few_shot.py
- Added tests for few-shot example generation functions.
tests/test_field_detection.py
- Added tests for the field type detection utility.
tests/test_graph.py
- Added tests for Spark graph connected components.
tests/test_merger.py
- Added tests for the EntityMerger module.
tests/test_metrics.py
- Added tests for evaluation metrics functions.
tests/test_normalize.py
- Added tests for name normalization utilities.
tests/test_profiler.py
- Added tests for the DatasetProfiler module.
tests/test_schemas.py
- Added tests for Spark schema conversion and validation utilities.
tests/test_signatures.py
- Added tests for DSPy signature definitions.
tests/test_type_generator.py
- Added tests for dynamically generating Pydantic Entity subclasses from Spark schemas.

Activity

The author, rjurney, initiated this pull request as the 'Serf initial implementation', indicating a foundational and extensive set of changes.
The description mentions that 'Cursor Agent long-shot prompts overnight to build out the core features of SERF', suggesting that a significant portion of this implementation was generated through an automated process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

…arison test, fix analyze config generation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request provides an impressive and comprehensive initial implementation of the SERF framework, successfully migrating the project to a modern Python toolchain with uv and ruff, and building out core components for a sophisticated, multi-stage entity resolution pipeline. No security vulnerabilities were found. My review highlights critical correctness issues in data handling and the UUID mapping recovery logic, along with suggestions to improve design consistency and robustness.

gemini-code-assist · 2026-03-08T20:30:11Z

src/serf/cli/main.py

+        ]
+        entities.append(
+            Entity(
+                id=int(row_dict.get("id", idx)),  # type: ignore[arg-type]


The use of int(row_dict.get("id", idx)) is likely to cause a crash when processing datasets where the 'id' column is not a numeric type. For example, the DBLP-ACM benchmark dataset uses string identifiers like "conf/sigmod/Abadi05". Attempting to cast this to an int will raise a ValueError.

To make this helper more robust, you should handle non-integer IDs gracefully. One approach is to create an internal mapping from the original string IDs to integer IDs if they are not already integers, or simply use the pandas index idx consistently as the primary integer ID for processing.

Suggested change

id=int(row_dict.get("id", idx)), # type: ignore[arg-type]

id=idx, # Use the DataFrame index for a consistent integer ID

gemini-code-assist · 2026-03-08T20:30:11Z

src/serf/match/uuid_mapper.py

+        # Phase 1: Add missing IDs to first resolved entity's source_ids
+        if missing_ids and resolution.resolved_entities:
+            first = resolution.resolved_entities[0]
+            existing_sources = set(first.source_ids or [])
+            first_sources = list(existing_sources | missing_ids)
+            resolution.resolved_entities[0] = first.model_copy(update={"source_ids": first_sources})


The Phase 1 recovery logic for missing entities makes a strong and potentially incorrect assumption. If the LLM fails to return some entities from a block, this code assumes they were all merged into the first resolved entity. This can corrupt the merge lineage (source_ids), which is critical for a multi-round ER system.

For example, if a block contains {A, B, C} and the LLM only returns a merged A+B, this logic would incorrectly record that C was also merged into A+B.

A safer approach would be to rely solely on the Phase 2 recovery mechanism for all missing entities. This would treat any entity not returned by the LLM as an un-merged singleton, preserving the integrity of the data lineage. I recommend removing this Phase 1 logic.

gemini-code-assist · 2026-03-08T20:30:11Z

pyproject.toml

+packages = ["src/serf"]

-[tool.black]
+[dependency-groups]


While [dependency-groups] is supported by uv, it's a uv-specific extension. The standard way to define optional dependencies according to PEP 621 is using [project.optional-dependencies]. Using the standard format would improve portability with other packaging tools like pip.

For example:

[project.optional-dependencies] dev = [ "pytest>=8.0", "pytest-asyncio>=1.0", "ruff>=0.11", "zuban>=0.0.23", "pre-commit>=4.0", "types-pyyaml>=6.0", ]

gemini-code-assist · 2026-03-08T20:30:11Z

src/serf/block/faiss_blocker.py

+        index.train(embeddings)  # type: ignore[call-arg]
+        index.add(embeddings)  # type: ignore[call-arg]


The type: ignore[call-arg] comments suggest a potential type mismatch between the embeddings array and what the faiss library expects. While the code may work, these ignores can hide underlying issues, especially with C-extension libraries where type stubs might be incomplete or out of sync. It would be beneficial to investigate the exact cause of the type error and resolve it, which might involve a minor type cast or adjustment, to ensure type safety and remove the need for the ignores.

gemini-code-assist · 2026-03-08T20:30:11Z

src/serf/dspy/types.py

+    def text_for_embedding(self) -> str:
+        """Return text representation for embedding.
+
+        Returns
+        -------
+        str
+            Concatenation of name and description for embedding
+        """
+        parts = [self.name]
+        if self.description:
+            parts.append(self.description)
+        for _key, val in self.attributes.items():
+            if isinstance(val, str) and val:
+                parts.append(val)
+        return " ".join(parts)


The current implementation of text_for_embedding concatenates the name, description, and all string attributes. This can lead to noisy and overly long text for the embedding model, potentially degrading the quality of the semantic blocking. For instance, it might include irrelevant metadata or long boilerplate text from attributes.

Consider making this more configurable. A better approach would be to use a specified list of fields for embedding, similar to the text_fields defined in ERConfig. This would give the user more control over the embedding representation and likely improve blocking performance.

gemini-code-assist · 2026-03-08T20:30:11Z

src/serf/pipeline.py

+def _llm_match_and_merge(blocks: list[EntityBlock], cfg: ERConfig) -> list[Entity]:
+    """Run LLM-based matching on blocks and return resolved entities.
+
+    Parameters
+    ----------
+    blocks : list[EntityBlock]
+        Blocks to process
+    cfg : ERConfig
+        Pipeline configuration
+
+    Returns
+    -------
+    list[Entity]
+        Resolved entities after LLM matching and merging


This function's name, _llm_match_and_merge, implies that it performs both matching and merging. However, the implementation only calls the EntityMatcher, which in turn uses the BlockMatch signature. The BlockMatch prompt instructs the LLM to perform the merge as part of its task. This means the serf.merge.merger.EntityMerger class is not actually used in this pipeline.

This isn't a bug, but it's a design discrepancy worth noting. The current approach (merging within the LLM prompt) is valid and efficient. However, it renders merger.py as dead code in the context of this main pipeline. Consider either renaming this function to better reflect that it only orchestrates the LLM call (e.g., _resolve_blocks_with_llm) or refactoring to use the EntityMerger for a more modular, two-step match-then-merge process, which might offer more control and testability.

…ig for agentic override Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

gemini-code-assist · 2026-03-08T20:32:37Z

I have completed the initial review and provided a summary of the changes. I will now proceed with a more detailed code review and post my findings as review comments shortly. Please let me know if you have any specific areas you'd like me to focus on, or any questions regarding the summary.

…All matching via LLM. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…hing Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…g (Gemini Flash, 30 concurrent) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…king Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

- pipeline.py: Assign UUIDs at entity creation, track all historical UUIDs - uuid_mapper.py: Transitive source_ids/source_uuids collection, dedup, exclude self - merger.py: Dedup source_ids/source_uuids, exclude master's own ID/UUID - matcher.py: Add iteration param to resolve_block/resolve_blocks, set match_skip_history - metrics.py: Add validate_source_uuids function - Tests for dedup, self-exclusion, transitive accumulation, and UUID validation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…analysis, UUID validation, PASS/FAIL checks Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

rjurney · 2026-03-08T23:42:17Z

/gemini review
@gemini review
@claude review
@copilot review

Copilot · 2026-03-08T23:42:23Z

@rjurney I've opened a new pull request, #18, to work on those changes. Once the pull request is ready, I'll request review from you.

gemini-code-assist · 2026-03-08T23:42:39Z

It seems Copilot has opened a new pull request to address some changes. I will await the review request on that new pull request (#18) once it's ready.

rjurney · 2026-03-08T23:44:30Z

It seems Copilot has opened a new pull request to address some changes. I will await the review request on that new pull request (#18) once it's ready.

review now
/gemini review

…ss-cpu Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request presents an impressive and comprehensive initial implementation of the Semantic Entity Resolution Framework (SERF), utilizing modern tooling like uv and ruff, and DSPy for LLM interaction. While the code quality is high, a critical security vulnerability related to Prompt Injection has been identified. Untrusted data from datasets or user input is being included in LLM prompts without sufficient sanitization or structural isolation, potentially allowing attackers to manipulate the entity resolution process or cause denial of service. To mitigate this, consider strengthening prompt engineering with robust delimiters, validating LLM outputs against strict schemas, and implementing safe bounds for LLM-generated configuration parameters. Additionally, address minor code quality issues such as incorrect version specifiers for the ruff tool and some code duplication in the CLI module.

gemini-code-assist · 2026-03-08T23:52:12Z

.pre-commit-config.yaml

-        types: [python]
-  - repo: local
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.11.0


The ruff-pre-commit repository version v0.11.0 appears to be invalid. The astral-sh/ruff-pre-commit repository does not have a tag with this version, which will cause pre-commit install or pre-commit autoupdate to fail. Please update this to a recent, valid version. The latest version is v0.4.4.

rev: v0.4.4

gemini-code-assist · 2026-03-08T23:52:12Z

pyproject.toml

+dev = [
+    "pytest>=8.0",
+    "pytest-asyncio>=1.0",
+    "ruff>=0.11",


The version specifier ruff>=0.11 is not a valid specifier for the ruff package, as its versions follow a 0.x.y scheme (e.g., 0.4.4). This will likely cause dependency resolution to fail. Please correct this to a valid version specifier, for example, by pinning to a recent version.

Suggested change

"ruff>=0.11",

"ruff>=0.4.4",

gemini-code-assist · 2026-03-08T23:52:12Z

src/serf/dspy/agents.py

+        self,
+        dataset_description: str,
+        entity_type: str = "entity",
+    ) -> dspy.Prediction:
+        """Run the agentic ER pipeline.
+
+        Parameters
+        ----------
+        dataset_description : str
+            Summary of the dataset to resolve
+        entity_type : str
+            Type of entities being resolved
+
+        Returns
+        -------
+        dspy.Prediction
+            The agent's action plan and reasoning
+        """
+        return cast(
+            dspy.Prediction,
+            self.react(
+                dataset_description=dataset_description,


The ERAgent.forward method takes a dataset_description parameter which is directly passed to the dspy.ReAct agent. This input is not sanitized or validated, allowing a user to provide a crafted description that could manipulate the agent's reasoning and tool usage (Prompt Injection). In a ReAct loop, this could lead to the agent performing unintended actions or returning misleading results. Consider validating the input or using a more restrictive prompt template that clearly separates user input from instructions.

gemini-code-assist · 2026-03-08T23:52:12Z

src/serf/analyze/profiler.py

+def generate_er_config(
+    profile: DatasetProfile,
+    sample_records: list[dict[str, Any]],
+    model: str = "gemini/gemini-2.0-flash",
+) -> str:
+    """Use an LLM to generate an ER config YAML from a dataset profile.
+
+    Parameters
+    ----------
+    profile : DatasetProfile
+        Statistical profile of the dataset
+    sample_records : list[dict[str, Any]]
+        Sample records from the dataset (5-10 records)
+    model : str
+        LLM model to use for config generation
+
+    Returns
+    -------
+    str
+        YAML string with the recommended ER configuration
+    """
+    api_key = os.environ.get("GEMINI_API_KEY", "")
+    lm = dspy.LM(model, api_key=api_key)
+
+    predictor = dspy.ChainOfThought(GenerateERConfig)
+
+    profile_json = profile.model_dump_json(indent=2)
+    samples_json = json.dumps(sample_records[:10], indent=2, default=str)
+
+    logger.info("Generating ER config with LLM...")
+    with dspy.context(lm=lm, adapter=BAMLAdapter()):
+        result = predictor(


The generate_er_config function constructs an LLM prompt using a statistical profile and sample records from the input dataset. If the dataset contains malicious content, it could influence the LLM to generate an insecure or incorrect entity resolution configuration. This 'Indirect Prompt Injection' could lead to denial of service (e.g., by setting extremely high iteration counts) or manipulation of the ER process. Ensure that the LLM output is validated against a strict schema and that sensitive configuration parameters have safe upper bounds.

gemini-code-assist · 2026-03-08T23:52:12Z

src/serf/match/matcher.py

+    def resolve_block(self, block: EntityBlock, iteration: int = 1) -> BlockResolution:
+        """Process a single block through the LLM.
+
+        Parameters
+        ----------
+        block : EntityBlock
+            Block of entities to resolve
+        iteration : int
+            Current pipeline iteration number
+
+        Returns
+        -------
+        BlockResolution
+            Resolution with merged and non-matched entities
+        """
+        mapper = UUIDMapper()
+        mapped_block = mapper.map_block(block)
+
+        block_records = json.dumps(
+            [e.model_dump(mode="json") for e in mapped_block.entities],
+            indent=2,
+        )
+        few_shot = get_default_few_shot_examples()
+
+        try:
+            lm = self._ensure_lm()
+            with dspy.context(lm=lm, adapter=self._adapter):
+                result = self.predictor(


The EntityMatcher.resolve_block method serializes entity data from the dataset into JSON and includes it directly in the LLM prompt for matching. An attacker could include malicious data within entity attributes (e.g., names or descriptions) designed to subvert the matching logic, cause the LLM to ignore other entities, or produce false matches. This is a form of Indirect Prompt Injection. Consider using robust delimiters and explicit instructions to the LLM to treat the data as untrusted content.

gemini-code-assist · 2026-03-08T23:52:12Z

src/serf/edge/resolver.py

+    async def resolve_edge_block(
+        self, block_key: str, edges: list[dict[str, Any]]
+    ) -> list[dict[str, Any]]:
+        """Resolve a single block of duplicate edges.
+
+        On error, return original edges unchanged.
+
+        Parameters
+        ----------
+        block_key : str
+            Key identifying this block (e.g. JSON of [src, dst, type])
+        edges : list[dict[str, Any]]
+            Edges in this block
+
+        Returns
+        -------
+        list[dict[str, Any]]
+            Resolved edges (deduplicated/merged), or original on error
+        """
+        if len(edges) <= 1:
+            return edges
+
+        try:
+            edge_block_json = json.dumps(edges)
+            result = await asyncio.to_thread(self._predictor, edge_block=edge_block_json)


Similar to the EntityMatcher, the EdgeResolver.resolve_edge_block method passes edge data from the dataset to an LLM prompt. Malicious edge attributes could be used to manipulate the edge deduplication and merging logic via prompt injection. Ensure that the prompt instructions clearly distinguish between the task logic and the data being processed.

gemini-code-assist · 2026-03-08T23:52:12Z

src/serf/cli/main.py

+def _detect_name_column(columns: list[str]) -> str:
+    """Detect the primary name column from a list of column names.
+
+    Parameters
+    ----------
+    columns : list[str]
+        Column names to search
+
+    Returns
+    -------
+    str
+        The detected name column
+    """
+    name_candidates = [
+        "title",
+        "name",
+        "product_name",
+        "company_name",
+        "entity_name",
+    ]
+    for candidate in name_candidates:
+        if candidate in columns:
+            return candidate
+    for col in columns:
+        if col != "id":
+            return col
+    return columns[0]
+
+
+def _dataframe_to_entities(df: Any) -> list[Any]:
+    """Convert a pandas DataFrame to a list of Entity objects.
+
+    Parameters
+    ----------
+    df : pd.DataFrame
+        Input DataFrame with entity records
+
+    Returns
+    -------
+    list[Entity]
+        List of Entity objects
+    """
+
+    from serf.dspy.types import Entity
+
+    entities = []
+    name_col = _detect_name_column(df.columns.tolist())
+    for i, (_idx, row) in enumerate(df.iterrows()):
+        row_dict = row.to_dict()
+        name = str(row_dict.get(name_col, f"entity_{i}"))
+        desc_parts = [
+            str(v) for k, v in row_dict.items() if k != name_col and isinstance(v, str) and v
+        ]
+        entities.append(
+            Entity(
+                id=i,  # Use sequential index — original IDs may be strings
+                name=name,
+                description=" ".join(desc_parts),
+                attributes=row_dict,
+            )
+        )
+    return entities
+


The helper functions _detect_name_column and _dataframe_to_entities in this module duplicate logic that is also implemented in a more comprehensive way in src/serf/pipeline.py. This can lead to maintenance challenges and inconsistencies if one implementation is updated but the other is not. To improve code reuse and maintainability, consider refactoring to use a single, shared implementation for loading data and converting it to Entity objects across all relevant CLI commands. For example, the functions from pipeline.py could be moved to a shared utility module.

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

… FAISS compatibility Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

… processes to fix macOS MPS segfault Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…NING.md from Eridu lessons Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…ngual-e5-base, remove all pip references Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…P=0.885 R=0.581 F1=0.701 Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…rvice profiles Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

… defenses, validate LLM config output, deduplicate CLI helpers Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

The primary SERF interface is DataFrame-in, DataFrame-out. Pydantic types are auto-generated internally from df.schema + DatasetProfile. Document the full flow: DataFrame → Pydantic → JSON → DSPy/LLM → Pydantic → DataFrame. Users never need to define types unless they want custom control.

Add true_positives and false_positives to evaluate_resolution() return dict. Display benchmark results as a pd.DataFrame table instead of individual click.echo lines.

…raphlet-AI/serf into cursor/serf-long-shot-plan-system-b0d4

…ing, deprecate connected components Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

- Add mlflow[genai]>=3.10.1 dependency - Add serf mlflow command to start local MLflow server with SQLite backend - Create serf.tracking module with setup_mlflow() for DSPy autologging - Enable MLflow tracing in run, match, benchmark, benchmark-all commands - Use click.Choice for --dataset in benchmark/download commands - Fix benchmark tests for proper mock patching and type annotations

rjurney and others added 19 commits March 7, 2026 19:24

Add $100 overnight budget constraint for Gemini API usage

1e2c786

Gemini 2.0 Flash only for all ER pipeline operations. Gemini 2.5 Pro allowed only for validation data generation with < 2K API calls.

Add build plan for SERF implementation

21414d2

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Convert from Poetry to uv, replace black/isort/flake8 with Ruff, upda…

61ffbbe

…te config and module structure Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add pipeline types, type generator, and DSPy signatures with tests

a4cfc4b

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add blocking module: embeddings, FAISS blocker, name normalization, p…

ae87c7a

…ipeline with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add matching and merging modules: UUID mapper, matcher, few-shot, mer…

998f491

…ger with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add evaluation metrics, benchmarks, dataset analysis, and edge resolu…

d4c5378

…tion with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add Spark integration (schemas, utils, iceberg, graph) and DSPy agent…

21c8c84

…s with tests Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add complete CLI with analyze, block, match, eval, resolve, benchmark…

9300375

…, download commands Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Update benchmarks to use Leipzig dataset source, add benchmark script…

68ec92b

…s, fix text column detection Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add README, LICENSE, PyPI packaging setup

fb9707f

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add benchmark results: DBLP-ACM F1=0.83, Abt-Buy F1=0.46, DBLP-Schola…

81c823e

…r F1=0.90 (embedding baseline) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Port scripts into CLI: add benchmark-all command, --use-llm/--no-llm …

ae8f8c7

…flag, remove scripts/ directory Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add serf run command for end-to-end ER on any CSV/Parquet/Iceberg wit…

68d1ecf

…h optional YAML config Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Remove embedding-based matching: use embeddings only for blocking, LL…

4070f39

…M for all matching Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Enhance serf analyze to generate LLM-powered ER config YAML with --ou…

de0c251

…tput flag Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add integration tests: LLM-generated ER config from benchmark data wi…

b3aaea1

…th validation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add Publication/Product types, fix block sizes (target=30, max=100), …

c9b3487

…fix analyze LLM guidance, add auto-convergence Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

rjurney self-assigned this Mar 8, 2026

Fix DSPy threading with dspy.context, add head-to-head benchmark comp…

65d8876

…arison test, fix analyze config generation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

gemini-code-assist bot reviewed Mar 8, 2026

View reviewed changes

Default to name-only embedding for blocking, add blocking_fields conf…

c09cd83

…ig for agentic override Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

cursoragent and others added 4 commits March 8, 2026 20:48

Add critical rule: embeddings for blocking only, never for matching. …

9685616

…All matching via LLM. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Set max_tokens=8192 for LLM matcher to prevent output truncation

13b95a1

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add --limit and --concurrency options, tqdm progress bar for LLM matc…

ca166d7

…hing Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Benchmark results: DBLP-ACM P=0.895 R=0.625 F1=0.736 with LLM matchin…

b3f8cb0

…g (Gemini Flash, 30 concurrent) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

cursoragent and others added 4 commits March 8, 2026 21:59

Convert async tests to use pytest-asyncio with @pytest.mark.asyncio

32e7555

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add SCALABILITY.md: vector engine recommendations for beyond-RAM bloc…

e83ec35

…king Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Port Abzu er_eval.py rigor: comprehensive evaluator with dedup, skip …

bc27560

…analysis, UUID validation, PASS/FAIL checks Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Copilot AI mentioned this pull request Mar 8, 2026

Code review fixes: config centralization, dead code removal, import cleanup, threshold normalization #18

Draft

Improve install instructions: add pip and conda paths, note about fai…

3d7a3d0

…ss-cpu Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

gemini-code-assist bot reviewed Mar 8, 2026

View reviewed changes

cursoragent and others added 18 commits March 8, 2026 23:58

Add pyspark-mcp dependency

d2bd6d0

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Fix FAISS segfault on macOS: force CPU encoding, contiguous array for…

fe6d385

… FAISS compatibility Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Subprocess isolation for PyTorch/FAISS: embed and cluster in separate…

23193fb

… processes to fix macOS MPS segfault Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Switch to intfloat/multilingual-e5-large embedding model, add FINE_TU…

490893d

…NING.md from Eridu lessons Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Externalize all model names to config.yml, switch to intfloat/multili…

8a18631

…ngual-e5-base, remove all pip references Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Document: never use pip or uv pip, only uv add/sync/run

18721eb

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Benchmark results with multilingual-e5-base + Gemini Flash: DBLP-ACM …

fb14ea5

…P=0.885 R=0.581 F1=0.701 Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Dockerize: Dockerfile on Ubuntu 24.04 with uv, docker-compose with se…

c2bca43

…rvice profiles Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Address Gemini review round 2: fix ruff version, add prompt injection…

a42d5fa

… defenses, validate LLM config output, deduplicate CLI helpers Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add QUICKSTART.md: end-to-end guide for using SERF

9bde1a4

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Add TP/FP counts to benchmark and eval output as pandas table

77d781b

Add true_positives and false_positives to evaluate_resolution() return dict. Display benchmark results as a pd.DataFrame table instead of individual click.echo lines.

Merge branch 'cursor/serf-long-shot-plan-system-b0d4' of github.com:G…

aa4d480

…raphlet-AI/serf into cursor/serf-long-shot-plan-system-b0d4

Add PySpark-native pipeline with mapInPandas, salt-based block splitt…

2f5732b

…ing, deprecate connected components Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

--max-iterations argument for benchmarks

7cdd728

chore: config for mlflow

63e873a

chore: mlflow[genai]>=3.10.1

e53f0a2

rjurney merged commit c01d504 into main Mar 14, 2026

	id=int(row_dict.get("id", idx)), # type: ignore[arg-type]
	id=idx, # Use the DataFrame index for a consistent integer ID

		index.train(embeddings) # type: ignore[call-arg]
		index.add(embeddings) # type: ignore[call-arg]

Conversation

rjurney commented Mar 8, 2026

Uh oh!

rjurney commented Mar 8, 2026

Uh oh!

gemini-code-assist bot commented Mar 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Mar 8, 2026

Uh oh!

rjurney commented Mar 8, 2026

Uh oh!

Copilot AI commented Mar 8, 2026

Uh oh!

gemini-code-assist bot commented Mar 8, 2026

Uh oh!

rjurney commented Mar 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants