SERF is an open-source framework for semantic entity resolution — identifying when two or more records refer to the same real-world entity using large language models, sentence embeddings, and agentic AI.
SERF runs multiple rounds of entity resolution until the dataset converges to a stable state, with DSPy agents controlling all phases dynamically.
DSPy ReAct agents dynamically orchestrate the entire pipeline, adjusting blocking parameters, selecting matching strategies, and deciding when convergence is reached.
Clusters records using Qwen3 sentence embeddings and FAISS IVF to create efficient blocks for comparison. Auto-scales block size across iterations.
All three operations in a single LLM prompt via DSPy signatures with the BAMLAdapter for structured output formatting. Block-level matching lets the LLM see all records simultaneously for holistic decisions.
For knowledge graphs: deduplicate edges that result from merging nodes using LLM-guided intelligent merging.
| Component | Technology |
|---|---|
| Package Manager | uv |
| Data Processing | PySpark 4.x |
| LLM Framework | DSPy 3.x with BAMLAdapter |
| Embeddings | multilingual-e5-base via sentence-transformers |
| Vector Search | FAISS IndexIVFFlat |
| Linting/Formatting | Ruff |
| Type Checking | zuban (mypy-compatible) |
git clone https://github.com/Graphlet-AI/serf.git
cd serf
uv sync --extra dev# Build
docker compose build
# Run any serf command
docker compose run serf benchmark --dataset dblp-acm
# Run benchmarks
docker compose --profile benchmark up
# Run tests
docker compose --profile test up
# Analyze a dataset (put your file in data/)
docker compose run serf analyze --input data/input.csv --output data/er_config.ymlSet your API key in a .env file or export it:
echo "GEMINI_API_KEY=your-key" > .env- Python 3.12+
- Java 11/17/21 (for PySpark)
- 4GB+ RAM recommended
# Profile a dataset
serf analyze --input data/companies.parquet
# Run the full ER pipeline
serf resolve --input data/entities.csv --output data/resolved/ --iteration 1
# Run individual phases
serf block --input data/entities.csv --output data/blocks/ --method semantic
serf match --input data/blocks/ --output data/matches/ --iteration 1
serf eval --input data/matches/
# Benchmark against standard datasets
serf download --dataset dblp-acm
serf benchmark --dataset dblp-acm --output data/results/from serf.block.pipeline import SemanticBlockingPipeline
from serf.match.matcher import EntityMatcher
from serf.eval.metrics import evaluate_resolution
# Block
pipeline = SemanticBlockingPipeline(target_block_size=50)
blocks, metrics = pipeline.run(entities)
# Match
matcher = EntityMatcher(model="gemini/gemini-2.0-flash")
resolutions = await matcher.resolve_blocks(blocks)
# Evaluate
metrics = evaluate_resolution(predicted_pairs, ground_truth_pairs)import dspy
from serf.dspy.signatures import BlockMatch
from serf.dspy.baml_adapter import BAMLAdapter
lm = dspy.LM("gemini/gemini-2.0-flash", api_key=GEMINI_API_KEY)
dspy.configure(lm=lm, adapter=BAMLAdapter())
matcher = dspy.ChainOfThought(BlockMatch)
result = matcher(block_records=block_json, schema_info=schema, few_shot_examples=examples)Performance on standard ER benchmarks from the Leipzig Database Group. Blocking uses multilingual-e5-base name-only embeddings + FAISS IVF. Matching uses Gemini 2.0 Flash via DSPy BlockMatch.
| Dataset | Domain | Left | Right | Matches | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|
| DBLP-ACM | Bibliographic | 2,616 | 2,294 | 2,224 | 0.8849 | 0.5809 | 0.7014 |
Blocking uses name-only embeddings for tighter semantic clusters. All matching decisions are made by the LLM — no embedding similarity thresholds.
src/serf/
├── cli/ # Click CLI commands
├── dspy/ # DSPy types, signatures, agents, adapter
├── block/ # Semantic blocking (embeddings, FAISS, normalization)
├── match/ # UUID mapping, LLM matching, few-shot examples
├── merge/ # Field-level entity merging
├── edge/ # Edge resolution for knowledge graphs
├── eval/ # Metrics, benchmark datasets
├── analyze/ # Dataset profiling, field detection
├── spark/ # PySpark schemas, utils, Iceberg, graph components
├── config.py # Configuration management
└── logs.py # Logging
All configuration is centralized in config.yml:
from serf.config import config
model = config.get("models.llm") # "gemini/gemini-2.0-flash"
block_size = config.get("er.blocking.target_block_size") # 50# Install dependencies
uv sync
# Run tests
uv run pytest tests/
# Lint and format
uv run ruff check --fix src tests
uv run ruff format src tests
# Type check
uv run zuban check src tests
# Pre-commit hooks
pre-commit install
pre-commit run --all-files- Jurney, R. (2024). "The Rise of Semantic Entity Resolution." Towards Data Science.
- Khattab, O. et al. (2024). "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." ICLR 2024.
- Li, Y. et al. (2021). "Ditto: A Simple and Efficient Entity Matching Framework." VLDB 2021.
- Mudgal, S. et al. (2018). "Deep Learning for Entity Matching: A Design Space Exploration." SIGMOD 2018.
- Papadakis, G. et al. (2020). "Blocking and Filtering Techniques for Entity Resolution: A Survey." ACM Computing Surveys.
Apache License 2.0. See LICENSE for details.
