Skip to content

Serf initial implementation: Semantic Entity Resolution Framework#17

Merged
rjurney merged 51 commits intomainfrom
cursor/serf-long-shot-plan-system-b0d4
Mar 14, 2026
Merged

Serf initial implementation: Semantic Entity Resolution Framework#17
rjurney merged 51 commits intomainfrom
cursor/serf-long-shot-plan-system-b0d4

Conversation

@rjurney
Copy link
Contributor

@rjurney rjurney commented Mar 8, 2026

Cursor Agent long-shot prompts overnight to build out the core features of SERF.

rjurney and others added 19 commits March 7, 2026 19:24
Gemini 2.0 Flash only for all ER pipeline operations. Gemini 2.5 Pro
allowed only for validation data generation with < 2K API calls.
Do not reuse Abzu's BAML-generated types. Build fresh domain-agnostic
Pydantic classes for DSPy signatures. Add auto-generation of entity
types from PySpark DataFrame schemas via type_generator module. Add
Spark-to-Python type mapping and DatasetProfile-driven field descriptions.
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…te config and module structure

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ipeline with tests

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ger with tests

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…tion with tests

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…s with tests

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…, download commands

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…s, fix text column detection

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…r F1=0.90 (embedding baseline)

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…flag, remove scripts/ directory

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…h optional YAML config

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…M for all matching

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…tput flag

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…th validation

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…fix analyze LLM guidance, add auto-convergence

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
@rjurney rjurney self-assigned this Mar 8, 2026
@rjurney
Copy link
Contributor Author

rjurney commented Mar 8, 2026

/gemini review
@gemini review

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents the initial, comprehensive implementation of the Semantic Entity Resolution Framework (SERF). It establishes a robust, agentic pipeline for identifying and disambiguating entities across datasets. The changes lay the groundwork for a scalable and intelligent entity resolution system by integrating modern AI techniques like large language models and embeddings with efficient data processing frameworks.

Highlights

  • Project Infrastructure Overhaul: Migrated the project's dependency management from Poetry to uv, updated pre-commit hooks to utilize ruff for linting and formatting, and expanded .gitignore to include common development artifacts and IDE-specific files. An Apache License 2.0 was added to the repository.
  • Core Entity Resolution Pipeline Implementation: Introduced foundational data models including a generic Entity base class, EntityBlock, MatchDecision, and BlockResolution. Implemented semantic blocking using Qwen3 sentence embeddings and FAISS for efficient clustering, along with a SemanticBlockingPipeline to orchestrate the process.
  • Extensive DSPy Integration: Leveraged DSPy for various LLM-powered operations, including BlockMatch for entity matching and merging, EdgeResolve for deduplicating graph edges, AnalyzeDataset for profiling, and GenerateERConfig for LLM-driven configuration. An ERAgent was introduced for agentic control of the ER pipeline.
  • Comprehensive CLI and Benchmarking: Developed a full-featured command-line interface (serf) with commands for run, analyze, block, match, eval, edges, download, benchmark, and benchmark-all. Integrated with standard ER benchmark datasets (Leipzig, DeepMatcher formats) and evaluation metrics.
  • Spark Integration and Utilities: Added modules for PySpark integration, including schema conversion utilities (Pydantic to Spark StructType), graph algorithms (connected_components), and initial Iceberg table support.
  • Detailed Documentation and Planning: Created a comprehensive BUILD_PLAN.md outlining the development phases and tasks. Updated CLAUDE.md and README.md to reflect the new tooling, architecture, and usage instructions for the SERF framework.
Changelog
  • .gitignore
    • Added Python-specific build and environment ignores.
    • Added IDE-specific ignores for .idea and .vscode directories.
  • .pre-commit-config.yaml
    • Replaced Black, Flake8, and Isort hooks with Ruff and Ruff-format.
    • Updated Zuban hook to use uv run and disabled filename passing.
    • Removed Pytest hook.
  • CLAUDE.md
    • Updated development instructions to use uv for dependency management and command execution.
    • Replaced references to Black, Isort, and Flake8 with Ruff for linting and formatting.
    • Updated LLM-related guidance from BAML to DSPy and BAMLAdapter.
    • Removed the BAML Client Generation section.
  • LICENSE
    • Added the Apache License 2.0.
  • README.md
    • Updated project title and added license and Python version badges.
    • Rewrote the project description to detail SERF's agentic approach and phases.
    • Added an architecture table outlining key technologies used.
    • Updated Quick Start and Development sections to reflect uv and Ruff tooling.
    • Included Python API and DSPy Interface usage examples.
    • Added a Benchmark Results section with baseline performance data.
    • Detailed the project's directory structure and configuration access.
  • config.yml
    • Added extensive configuration for models (embedding, LLM), entity resolution parameters (blocking, matching, evaluation, paths), and benchmark datasets.
  • docs/BUILD_PLAN.md
    • Added a detailed, multi-phase build plan for the SERF project, covering infrastructure, type system, DSPy signatures, core modules, evaluation, CLI, and PyPI preparation.
  • docs/SERF_LONG_SHOT_PLAN.md
    • Updated the plan to emphasize fresh DSPy Pydantic types over BAML-generated ones.
    • Included a task for auto-generating entity types from DataFrames.
    • Replaced Poetry with uv in the evolution patterns.
    • Added a section detailing the overnight build budget constraint for Gemini API usage.
  • pyproject.toml
    • Migrated project metadata from Poetry format to PEP 621 [project] format.
    • Updated and added dependencies including dspy-ai, pyspark, sentence-transformers, faiss-cpu, cleanco, tqdm, numpy, and pandas.
    • Replaced Black and Isort configurations with Ruff linting and formatting settings.
    • Added project URLs and classifiers for PyPI publication.
  • src/serf/analyze/init.py
    • Added module initialization for the analyze package.
  • src/serf/analyze/field_detection.py
    • Added a new module for detecting field types based on name and sample values using heuristics and regex.
  • src/serf/analyze/profiler.py
    • Added a new module for DatasetProfiler to analyze dataset statistics and generate_er_config to create ER configurations using an LLM.
  • src/serf/block/embeddings.py
    • Added a new module for EntityEmbedder to compute sentence embeddings using sentence-transformers.
  • src/serf/block/faiss_blocker.py
    • Added a new module for FAISSBlocker to cluster entity embeddings into blocks using FAISS IVF.
  • src/serf/block/normalize.py
    • Added a new module for name normalization utilities, including corporate suffix removal, acronym generation, and multilingual stop word filtering.
  • src/serf/block/pipeline.py
    • Added a new module for SemanticBlockingPipeline to orchestrate the embedding, clustering, and block splitting process.
  • src/serf/cli/main.py
    • Refactored and expanded the CLI with new commands for run, analyze, block, match, eval, edges, download, benchmark, and benchmark-all.
    • Implemented data loading, entity conversion, and pipeline execution logic within CLI commands.
  • src/serf/config.py
    • Improved error handling for file operations and type hints in the Config class.
  • src/serf/dspy/agents.py
    • Added a new module for ERAgent, a DSPy ReAct agent, to control the entity resolution pipeline dynamically.
  • src/serf/dspy/baml_adapter.py
    • Updated field structure formatting to correctly iterate over input fields.
  • src/serf/dspy/signatures.py
    • Added new DSPy signatures for BlockMatch, EntityMerge, EdgeResolve, AnalyzeDataset, and GenerateERConfig.
  • src/serf/dspy/type_generator.py
    • Added a new module for entity_type_from_spark_schema to dynamically generate Pydantic Entity subclasses from Spark schemas.
  • src/serf/dspy/types.py
    • Rewrote the module to define new, domain-agnostic Pydantic types for the ER pipeline, including Entity, Publication, Product, EntityBlock, MatchDecision, BlockResolution, and various metrics types.
  • src/serf/edge/init.py
    • Added module initialization for the edge package.
  • src/serf/edge/resolver.py
    • Added a new module for EdgeResolver to deduplicate graph edges using LLM-guided merging.
  • src/serf/eval/init.py
    • Added module initialization for the eval package.
  • src/serf/eval/benchmarks.py
    • Added a new module for BenchmarkDataset to manage downloading, loading, and evaluating standard entity resolution benchmarks.
  • src/serf/eval/metrics.py
    • Added a new module for evaluation metrics including precision, recall, F1 score, pair completeness, and reduction ratio.
  • src/serf/match/init.py
    • Added module initialization for the match package.
  • src/serf/match/few_shot.py
    • Added a new module for generating default and custom few-shot examples for LLM matching.
  • src/serf/match/matcher.py
    • Added a new module for EntityMatcher to resolve entity blocks using DSPy BlockMatch signatures with async processing.
  • src/serf/match/uuid_mapper.py
    • Added a new module for UUIDMapper to map entity IDs to integers for LLM processing and restore original IDs/UUIDs.
  • src/serf/merge/init.py
    • Added module initialization for the merge package.
  • src/serf/merge/merger.py
    • Added a new module for EntityMerger to combine multiple entities into a single canonical record.
  • src/serf/pipeline.py
    • Added a new module for the run_pipeline function, orchestrating the end-to-end entity resolution process with iterative blocking and LLM matching.
  • src/serf/spark/init.py
    • Added module initialization for the spark package.
  • src/serf/spark/graph.py
    • Added a new module for connected_components function to find connected components in Spark DataFrames.
  • src/serf/spark/iceberg.py
    • Added a new module for Iceberg integration, including session creation and data read/write functions.
  • src/serf/spark/schemas.py
    • Added a new module for Pydantic to Spark schema conversion and schema validation utilities.
  • src/serf/spark/utils.py
    • Added a new module for Spark utility functions, including splitting large blocks and selecting most common properties.
  • tests/test_agents.py
    • Added tests for the ERAgent and its tool functions.
  • tests/test_analyze_benchmark.py
    • Added integration tests for LLM-powered dataset analysis and benchmark config generation.
  • tests/test_benchmarks.py
    • Added tests for benchmark dataset loading, entity conversion, and evaluation.
  • tests/test_cli.py
    • Added tests for the serf CLI commands and their help messages.
  • tests/test_dspy.py
    • Updated DSPy integration tests.
  • tests/test_edge_resolver.py
    • Added tests for the EdgeResolver module.
  • tests/test_embeddings.py
    • Added tests for the EntityEmbedder module.
  • tests/test_faiss_blocker.py
    • Added tests for the FAISSBlocker module.
  • tests/test_few_shot.py
    • Added tests for few-shot example generation functions.
  • tests/test_field_detection.py
    • Added tests for the field type detection utility.
  • tests/test_graph.py
    • Added tests for Spark graph connected components.
  • tests/test_merger.py
    • Added tests for the EntityMerger module.
  • tests/test_metrics.py
    • Added tests for evaluation metrics functions.
  • tests/test_normalize.py
    • Added tests for name normalization utilities.
  • tests/test_profiler.py
    • Added tests for the DatasetProfiler module.
  • tests/test_schemas.py
    • Added tests for Spark schema conversion and validation utilities.
  • tests/test_signatures.py
    • Added tests for DSPy signature definitions.
  • tests/test_type_generator.py
    • Added tests for dynamically generating Pydantic Entity subclasses from Spark schemas.
Activity
  • The author, rjurney, initiated this pull request as the 'Serf initial implementation', indicating a foundational and extensive set of changes.
  • The description mentions that 'Cursor Agent long-shot prompts overnight to build out the core features of SERF', suggesting that a significant portion of this implementation was generated through an automated process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

…arison test, fix analyze config generation

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides an impressive and comprehensive initial implementation of the SERF framework, successfully migrating the project to a modern Python toolchain with uv and ruff, and building out core components for a sophisticated, multi-stage entity resolution pipeline. No security vulnerabilities were found. My review highlights critical correctness issues in data handling and the UUID mapping recovery logic, along with suggestions to improve design consistency and robustness.

]
entities.append(
Entity(
id=int(row_dict.get("id", idx)), # type: ignore[arg-type]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The use of int(row_dict.get("id", idx)) is likely to cause a crash when processing datasets where the 'id' column is not a numeric type. For example, the DBLP-ACM benchmark dataset uses string identifiers like "conf/sigmod/Abadi05". Attempting to cast this to an int will raise a ValueError.

To make this helper more robust, you should handle non-integer IDs gracefully. One approach is to create an internal mapping from the original string IDs to integer IDs if they are not already integers, or simply use the pandas index idx consistently as the primary integer ID for processing.

Suggested change
id=int(row_dict.get("id", idx)), # type: ignore[arg-type]
id=idx, # Use the DataFrame index for a consistent integer ID

Comment on lines +95 to +100
# Phase 1: Add missing IDs to first resolved entity's source_ids
if missing_ids and resolution.resolved_entities:
first = resolution.resolved_entities[0]
existing_sources = set(first.source_ids or [])
first_sources = list(existing_sources | missing_ids)
resolution.resolved_entities[0] = first.model_copy(update={"source_ids": first_sources})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Phase 1 recovery logic for missing entities makes a strong and potentially incorrect assumption. If the LLM fails to return some entities from a block, this code assumes they were all merged into the first resolved entity. This can corrupt the merge lineage (source_ids), which is critical for a multi-round ER system.

For example, if a block contains {A, B, C} and the LLM only returns a merged A+B, this logic would incorrectly record that C was also merged into A+B.

A safer approach would be to rely solely on the Phase 2 recovery mechanism for all missing entities. This would treat any entity not returned by the LLM as an un-merged singleton, preserving the integrity of the data lineage. I recommend removing this Phase 1 logic.

pyproject.toml Outdated
packages = ["src/serf"]

[tool.black]
[dependency-groups]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While [dependency-groups] is supported by uv, it's a uv-specific extension. The standard way to define optional dependencies according to PEP 621 is using [project.optional-dependencies]. Using the standard format would improve portability with other packaging tools like pip.

For example:

[project.optional-dependencies]
dev = [
    "pytest>=8.0",
    "pytest-asyncio>=1.0",
    "ruff>=0.11",
    "zuban>=0.0.23",
    "pre-commit>=4.0",
    "types-pyyaml>=6.0",
]

Comment on lines +100 to +101
index.train(embeddings) # type: ignore[call-arg]
index.add(embeddings) # type: ignore[call-arg]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type: ignore[call-arg] comments suggest a potential type mismatch between the embeddings array and what the faiss library expects. While the code may work, these ignores can hide underlying issues, especially with C-extension libraries where type stubs might be incomplete or out of sync. It would be beneficial to investigate the exact cause of the type error and resolve it, which might involve a minor type cast or adjustment, to ensure type safety and remove the need for the ignores.

Comment on lines +56 to +70
def text_for_embedding(self) -> str:
"""Return text representation for embedding.

Returns
-------
str
Concatenation of name and description for embedding
"""
parts = [self.name]
if self.description:
parts.append(self.description)
for _key, val in self.attributes.items():
if isinstance(val, str) and val:
parts.append(val)
return " ".join(parts)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of text_for_embedding concatenates the name, description, and all string attributes. This can lead to noisy and overly long text for the embedding model, potentially degrading the quality of the semantic blocking. For instance, it might include irrelevant metadata or long boilerplate text from attributes.

Consider making this more configurable. A better approach would be to use a specified list of fields for embedding, similar to the text_fields defined in ERConfig. This would give the user more control over the embedding representation and likely improve blocking performance.

Comment on lines +439 to +452
def _llm_match_and_merge(blocks: list[EntityBlock], cfg: ERConfig) -> list[Entity]:
"""Run LLM-based matching on blocks and return resolved entities.

Parameters
----------
blocks : list[EntityBlock]
Blocks to process
cfg : ERConfig
Pipeline configuration

Returns
-------
list[Entity]
Resolved entities after LLM matching and merging

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function's name, _llm_match_and_merge, implies that it performs both matching and merging. However, the implementation only calls the EntityMatcher, which in turn uses the BlockMatch signature. The BlockMatch prompt instructs the LLM to perform the merge as part of its task. This means the serf.merge.merger.EntityMerger class is not actually used in this pipeline.

This isn't a bug, but it's a design discrepancy worth noting. The current approach (merging within the LLM prompt) is valid and efficient. However, it renders merger.py as dead code in the context of this main pipeline. Consider either renaming this function to better reflect that it only orchestrates the LLM call (e.g., _resolve_blocks_with_llm) or refactoring to use the EntityMerger for a more modular, two-step match-then-merge process, which might offer more control and testability.

…ig for agentic override

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
@gemini-code-assist
Copy link

I have completed the initial review and provided a summary of the changes. I will now proceed with a more detailed code review and post my findings as review comments shortly. Please let me know if you have any specific areas you'd like me to focus on, or any questions regarding the summary.

cursoragent and others added 4 commits March 8, 2026 20:48
…All matching via LLM.

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…hing

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…g (Gemini Flash, 30 concurrent)

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
cursoragent and others added 4 commits March 8, 2026 21:59
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…king

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
- pipeline.py: Assign UUIDs at entity creation, track all historical UUIDs
- uuid_mapper.py: Transitive source_ids/source_uuids collection, dedup, exclude self
- merger.py: Dedup source_ids/source_uuids, exclude master's own ID/UUID
- matcher.py: Add iteration param to resolve_block/resolve_blocks, set match_skip_history
- metrics.py: Add validate_source_uuids function
- Tests for dedup, self-exclusion, transitive accumulation, and UUID validation

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…analysis, UUID validation, PASS/FAIL checks

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
@rjurney
Copy link
Contributor Author

rjurney commented Mar 8, 2026

/gemini review
@gemini review
@claude review
@copilot review

Copy link

Copilot AI commented Mar 8, 2026

@rjurney I've opened a new pull request, #18, to work on those changes. Once the pull request is ready, I'll request review from you.

@gemini-code-assist
Copy link

It seems Copilot has opened a new pull request to address some changes. I will await the review request on that new pull request (#18) once it's ready.

@rjurney
Copy link
Contributor Author

rjurney commented Mar 8, 2026

It seems Copilot has opened a new pull request to address some changes. I will await the review request on that new pull request (#18) once it's ready.

review now
/gemini review

…ss-cpu

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request presents an impressive and comprehensive initial implementation of the Semantic Entity Resolution Framework (SERF), utilizing modern tooling like uv and ruff, and DSPy for LLM interaction. While the code quality is high, a critical security vulnerability related to Prompt Injection has been identified. Untrusted data from datasets or user input is being included in LLM prompts without sufficient sanitization or structural isolation, potentially allowing attackers to manipulate the entity resolution process or cause denial of service. To mitigate this, consider strengthening prompt engineering with robust delimiters, validating LLM outputs against strict schemas, and implementing safe bounds for LLM-generated configuration parameters. Additionally, address minor code quality issues such as incorrect version specifiers for the ruff tool and some code duplication in the CLI module.

types: [python]
- repo: local
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.11.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ruff-pre-commit repository version v0.11.0 appears to be invalid. The astral-sh/ruff-pre-commit repository does not have a tag with this version, which will cause pre-commit install or pre-commit autoupdate to fail. Please update this to a recent, valid version. The latest version is v0.4.4.

    rev: v0.4.4

pyproject.toml Outdated
dev = [
"pytest>=8.0",
"pytest-asyncio>=1.0",
"ruff>=0.11",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The version specifier ruff>=0.11 is not a valid specifier for the ruff package, as its versions follow a 0.x.y scheme (e.g., 0.4.4). This will likely cause dependency resolution to fail. Please correct this to a valid version specifier, for example, by pinning to a recent version.

Suggested change
"ruff>=0.11",
"ruff>=0.4.4",

Comment on lines +160 to +181
self,
dataset_description: str,
entity_type: str = "entity",
) -> dspy.Prediction:
"""Run the agentic ER pipeline.

Parameters
----------
dataset_description : str
Summary of the dataset to resolve
entity_type : str
Type of entities being resolved

Returns
-------
dspy.Prediction
The agent's action plan and reasoning
"""
return cast(
dspy.Prediction,
self.react(
dataset_description=dataset_description,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The ERAgent.forward method takes a dataset_description parameter which is directly passed to the dspy.ReAct agent. This input is not sanitized or validated, allowing a user to provide a crafted description that could manipulate the agent's reasoning and tool usage (Prompt Injection). In a ReAct loop, this could lead to the agent performing unintended actions or returning misleading results. Consider validating the input or using a more restrictive prompt template that clearly separates user input from instructions.

Comment on lines +113 to +144
def generate_er_config(
profile: DatasetProfile,
sample_records: list[dict[str, Any]],
model: str = "gemini/gemini-2.0-flash",
) -> str:
"""Use an LLM to generate an ER config YAML from a dataset profile.

Parameters
----------
profile : DatasetProfile
Statistical profile of the dataset
sample_records : list[dict[str, Any]]
Sample records from the dataset (5-10 records)
model : str
LLM model to use for config generation

Returns
-------
str
YAML string with the recommended ER configuration
"""
api_key = os.environ.get("GEMINI_API_KEY", "")
lm = dspy.LM(model, api_key=api_key)

predictor = dspy.ChainOfThought(GenerateERConfig)

profile_json = profile.model_dump_json(indent=2)
samples_json = json.dumps(sample_records[:10], indent=2, default=str)

logger.info("Generating ER config with LLM...")
with dspy.context(lm=lm, adapter=BAMLAdapter()):
result = predictor(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The generate_er_config function constructs an LLM prompt using a statistical profile and sample records from the input dataset. If the dataset contains malicious content, it could influence the LLM to generate an insecure or incorrect entity resolution configuration. This 'Indirect Prompt Injection' could lead to denial of service (e.g., by setting extremely high iteration counts) or manipulation of the ER process. Ensure that the LLM output is validated against a strict schema and that sensitive configuration parameters have safe upper bounds.

Comment on lines +81 to +108
def resolve_block(self, block: EntityBlock, iteration: int = 1) -> BlockResolution:
"""Process a single block through the LLM.

Parameters
----------
block : EntityBlock
Block of entities to resolve
iteration : int
Current pipeline iteration number

Returns
-------
BlockResolution
Resolution with merged and non-matched entities
"""
mapper = UUIDMapper()
mapped_block = mapper.map_block(block)

block_records = json.dumps(
[e.model_dump(mode="json") for e in mapped_block.entities],
indent=2,
)
few_shot = get_default_few_shot_examples()

try:
lm = self._ensure_lm()
with dspy.context(lm=lm, adapter=self._adapter):
result = self.predictor(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The EntityMatcher.resolve_block method serializes entity data from the dataset into JSON and includes it directly in the LLM prompt for matching. An attacker could include malicious data within entity attributes (e.g., names or descriptions) designed to subvert the matching logic, cause the LLM to ignore other entities, or produce false matches. This is a form of Indirect Prompt Injection. Consider using robust delimiters and explicit instructions to the LLM to treat the data as untrusted content.

Comment on lines +58 to +82
async def resolve_edge_block(
self, block_key: str, edges: list[dict[str, Any]]
) -> list[dict[str, Any]]:
"""Resolve a single block of duplicate edges.

On error, return original edges unchanged.

Parameters
----------
block_key : str
Key identifying this block (e.g. JSON of [src, dst, type])
edges : list[dict[str, Any]]
Edges in this block

Returns
-------
list[dict[str, Any]]
Resolved edges (deduplicated/merged), or original on error
"""
if len(edges) <= 1:
return edges

try:
edge_block_json = json.dumps(edges)
result = await asyncio.to_thread(self._predictor, edge_block=edge_block_json)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Similar to the EntityMatcher, the EdgeResolver.resolve_edge_block method passes edge data from the dataset to an LLM prompt. Malicious edge attributes could be used to manipulate the edge deduplication and merging logic via prompt injection. Ensure that the prompt instructions clearly distinguish between the task logic and the data being processed.

Comment on lines +845 to +907
def _detect_name_column(columns: list[str]) -> str:
"""Detect the primary name column from a list of column names.

Parameters
----------
columns : list[str]
Column names to search

Returns
-------
str
The detected name column
"""
name_candidates = [
"title",
"name",
"product_name",
"company_name",
"entity_name",
]
for candidate in name_candidates:
if candidate in columns:
return candidate
for col in columns:
if col != "id":
return col
return columns[0]


def _dataframe_to_entities(df: Any) -> list[Any]:
"""Convert a pandas DataFrame to a list of Entity objects.

Parameters
----------
df : pd.DataFrame
Input DataFrame with entity records

Returns
-------
list[Entity]
List of Entity objects
"""

from serf.dspy.types import Entity

entities = []
name_col = _detect_name_column(df.columns.tolist())
for i, (_idx, row) in enumerate(df.iterrows()):
row_dict = row.to_dict()
name = str(row_dict.get(name_col, f"entity_{i}"))
desc_parts = [
str(v) for k, v in row_dict.items() if k != name_col and isinstance(v, str) and v
]
entities.append(
Entity(
id=i, # Use sequential index — original IDs may be strings
name=name,
description=" ".join(desc_parts),
attributes=row_dict,
)
)
return entities

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The helper functions _detect_name_column and _dataframe_to_entities in this module duplicate logic that is also implemented in a more comprehensive way in src/serf/pipeline.py. This can lead to maintenance challenges and inconsistencies if one implementation is updated but the other is not. To improve code reuse and maintainability, consider refactoring to use a single, shared implementation for loading data and converting it to Entity objects across all relevant CLI commands. For example, the functions from pipeline.py could be moved to a shared utility module.

cursoragent and others added 18 commits March 8, 2026 23:58
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… FAISS compatibility

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… processes to fix macOS MPS segfault

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…NING.md from Eridu lessons

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…ngual-e5-base, remove all pip references

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…P=0.885 R=0.581 F1=0.701

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…rvice profiles

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
… defenses, validate LLM config output, deduplicate CLI helpers

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
The primary SERF interface is DataFrame-in, DataFrame-out. Pydantic
types are auto-generated internally from df.schema + DatasetProfile.
Document the full flow: DataFrame → Pydantic → JSON → DSPy/LLM →
Pydantic → DataFrame. Users never need to define types unless they
want custom control.
Add true_positives and false_positives to evaluate_resolution() return
dict. Display benchmark results as a pd.DataFrame table instead of
individual click.echo lines.
…raphlet-AI/serf into cursor/serf-long-shot-plan-system-b0d4
…ing, deprecate connected components

Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
- Add mlflow[genai]>=3.10.1 dependency
- Add serf mlflow command to start local MLflow server with SQLite backend
- Create serf.tracking module with setup_mlflow() for DSPy autologging
- Enable MLflow tracing in run, match, benchmark, benchmark-all commands
- Use click.Choice for --dataset in benchmark/download commands
- Fix benchmark tests for proper mock patching and type annotations
@rjurney rjurney merged commit c01d504 into main Mar 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants