SERF: Agentic Semantic Entity Resolution Framework

SERF is an open-source framework for semantic entity resolution — identifying when two or more records refer to the same real-world entity using large language models, sentence embeddings, and agentic AI.

SERF runs multiple rounds of entity resolution until the dataset converges to a stable state, with DSPy agents controlling all phases dynamically.

Stages of entity resolution: blocking, matching, merging

Source: Entity Resolution: Identifying Real-World Entities in Noisy Data

Features

Phase 0 — Agentic Control

DSPy ReAct agents dynamically orchestrate the entire pipeline, adjusting blocking parameters, selecting matching strategies, and deciding when convergence is reached.

Phase 1 — Semantic Blocking

Clusters records using Qwen3 sentence embeddings and FAISS IVF to create efficient blocks for comparison. Auto-scales block size across iterations.

Phase 2 — Schema Alignment, Matching and Merging

All three operations in a single LLM prompt via DSPy signatures with the BAMLAdapter for structured output formatting. Block-level matching lets the LLM see all records simultaneously for holistic decisions.

Phase 3 — Edge Resolution

For knowledge graphs: deduplicate edges that result from merging nodes using LLM-guided intelligent merging.

Architecture

Component	Technology
Package Manager	uv
Data Processing	PySpark 4.x
LLM Framework	DSPy 3.x with BAMLAdapter
Embeddings	multilingual-e5-base via sentence-transformers
Vector Search	FAISS IndexIVFFlat
Linting/Formatting	Ruff
Type Checking	zuban (mypy-compatible)

Quick Start

Installation

git clone https://github.com/Graphlet-AI/serf.git
cd serf
uv sync --extra dev

Docker

# Build
docker compose build

# Run any serf command
docker compose run serf benchmark --dataset dblp-acm

# Run benchmarks
docker compose --profile benchmark up

# Run tests
docker compose --profile test up

# Analyze a dataset (put your file in data/)
docker compose run serf analyze --input data/input.csv --output data/er_config.yml

Set your API key in a .env file or export it:

echo "GEMINI_API_KEY=your-key" > .env

System Requirements

Python 3.12+
Java 11/17/21 (for PySpark)
4GB+ RAM recommended

CLI Usage

# Profile a dataset
serf analyze --input data/companies.parquet

# Run the full ER pipeline
serf resolve --input data/entities.csv --output data/resolved/ --iteration 1

# Run individual phases
serf block --input data/entities.csv --output data/blocks/ --method semantic
serf match --input data/blocks/ --output data/matches/ --iteration 1
serf eval --input data/matches/

# Benchmark against standard datasets
serf download --dataset dblp-acm
serf benchmark --dataset dblp-acm --output data/results/

Python API

from serf.block.pipeline import SemanticBlockingPipeline
from serf.match.matcher import EntityMatcher
from serf.eval.metrics import evaluate_resolution

# Block
pipeline = SemanticBlockingPipeline(target_block_size=50)
blocks, metrics = pipeline.run(entities)

# Match
matcher = EntityMatcher(model="gemini/gemini-2.0-flash")
resolutions = await matcher.resolve_blocks(blocks)

# Evaluate
metrics = evaluate_resolution(predicted_pairs, ground_truth_pairs)

DSPy Interface

import dspy
from serf.dspy.signatures import BlockMatch
from serf.dspy.baml_adapter import BAMLAdapter

lm = dspy.LM("gemini/gemini-2.0-flash", api_key=GEMINI_API_KEY)
dspy.configure(lm=lm, adapter=BAMLAdapter())

matcher = dspy.ChainOfThought(BlockMatch)
result = matcher(block_records=block_json, schema_info=schema, few_shot_examples=examples)

Benchmark Results

Performance on standard ER benchmarks from the Leipzig Database Group. Blocking uses multilingual-e5-base name-only embeddings + FAISS IVF. Matching uses Gemini 2.0 Flash via DSPy BlockMatch.

Dataset	Domain	Left	Right	Matches	Precision	Recall	F1
DBLP-ACM	Bibliographic	2,616	2,294	2,224	0.8849	0.5809	0.7014

Blocking uses name-only embeddings for tighter semantic clusters. All matching decisions are made by the LLM — no embedding similarity thresholds.

Project Structure

src/serf/
├── cli/             # Click CLI commands
├── dspy/            # DSPy types, signatures, agents, adapter
├── block/           # Semantic blocking (embeddings, FAISS, normalization)
├── match/           # UUID mapping, LLM matching, few-shot examples
├── merge/           # Field-level entity merging
├── edge/            # Edge resolution for knowledge graphs
├── eval/            # Metrics, benchmark datasets
├── analyze/         # Dataset profiling, field detection
├── spark/           # PySpark schemas, utils, Iceberg, graph components
├── config.py        # Configuration management
└── logs.py          # Logging

Configuration

All configuration is centralized in config.yml:

from serf.config import config
model = config.get("models.llm")  # "gemini/gemini-2.0-flash"
block_size = config.get("er.blocking.target_block_size")  # 50

Development

# Install dependencies
uv sync

# Run tests
uv run pytest tests/

# Lint and format
uv run ruff check --fix src tests
uv run ruff format src tests

# Type check
uv run zuban check src tests

# Pre-commit hooks
pre-commit install
pre-commit run --all-files

References

Jurney, R. (2024). "The Rise of Semantic Entity Resolution." Towards Data Science.
Khattab, O. et al. (2024). "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." ICLR 2024.
Li, Y. et al. (2021). "Ditto: A Simple and Efficient Entity Matching Framework." VLDB 2021.
Mudgal, S. et al. (2018). "Deep Learning for Entity Matching: A Design Space Exploration." SIGMOD 2018.
Papadakis, G. et al. (2020). "Blocking and Filtering Techniques for Entity Resolution: A Survey." ACM Computing Surveys.

License

Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
assets		assets
data		data
docs		docs
logs		logs
src/serf		src/serf
tests		tests
.dockerignore		.dockerignore
.flake8		.flake8
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SERF: Agentic Semantic Entity Resolution Framework

Features

Phase 0 — Agentic Control

Phase 1 — Semantic Blocking

Phase 2 — Schema Alignment, Matching and Merging

Phase 3 — Edge Resolution

Architecture

Quick Start

Installation

Docker

System Requirements

CLI Usage

Python API

DSPy Interface

Benchmark Results

Project Structure

Configuration

Development

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SERF: Agentic Semantic Entity Resolution Framework

Features

Phase 0 — Agentic Control

Phase 1 — Semantic Blocking

Phase 2 — Schema Alignment, Matching and Merging

Phase 3 — Edge Resolution

Architecture

Quick Start

Installation

Docker

System Requirements

CLI Usage

Python API

DSPy Interface

Benchmark Results

Project Structure

Configuration

Development

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages