This repository is the official implementation of the paper:
KeySG: Hierarchical Keyframe-Based 3D Scene Graphs
Abdelrhman Werby, Dennis Rotondi, Fabio Scaparro, and Kai O. Arras.
arXiv preprint arXiv:2510.01049, 2025
(Accepted for IEEE International Conference on Robotics and Automation (ICRA), Vienna, Austria, 2026.)
# Clone
git clone https://github.com/keysg-lab/KeySG.git
cd keysg
# Conda environment
conda env create -f environment.yaml
conda activate keysg
# Install KeySG as a package (editable)
pip install -e .Download and place in checkpoints/:
mkdir -p checkpoints
# SAM 2.1 Large
wget -P checkpoints https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt
# RAM++
wget -P checkpoints https://huggingface.co/xinyu1205/recognize-anything-plus-model/resolve/main/ram_plus_swin_large_14m.pthCreate a .env file at the repo root:
OPENAI_API_KEY=sk-... # Required for VLM descriptions and RAG queriesKeySG supports ScanNet, Replica, and HM3DSem. All three require posed RGB-D sequences as input. We follow the same preparation procedure as HOV-SG β please refer to their repository for full download and pre-processing instructions.
Download from the official ScanNet website and extract .sens files using the SensReader tool. Each scene directory should contain RGB frames, depth frames, and camera poses.
Download the scanned RGB-D trajectories from the Nice-SLAM project (not the original Replica dataset). The directory should contain results/ with frame*.jpg / depth*.png and a traj.txt pose file.
Download hm3d-val-habitat-v0.2.tar, hm3d-val-semantic-annots-v0.2.tar, and hm3d-val-semantic-configs-v0.2.tar from Matterport. Then generate posed RGB-D sequences with the habitat-sim renderer β see HOV-SG's gen_hm3dsem_walks_from_poses.py.
# ScanNet scene
keysg-build dataset.kind=scannet dataset.root_dir=/data/ScanNet/scans/scene0011_00
# Replica scene
keysg-build dataset.kind=replica dataset.root_dir=/data/Replica/room0
# HM3DSem scene
keysg-build dataset.kind=hm3dsem dataset.root_dir=/data/HM3DSem/val/00824-Dd4bFSTQ8giOutputs land in output/keysg_rag1/{Dataset}/{Scene}/ by default (configurable in config/main_pipeline.yaml).
You can also run the pipeline directly:
python main_pipeline.py dataset.kind=scannet dataset.root_dir=/data/ScanNet/scans/scene0011_00keysg-vis --scene_dir output/keysg_rag1/ScanNet/scene0011_00
# Open http://localhost:8080The visualizer shows:
- Floor / room / object point clouds β per-instance colors, toggleable layers
- Camera frustums at each keyframe's world-space pose with RGB thumbnails
- Object Grounding panel β type a natural-language query, the matching object highlights in 3D
- Open-Ended Q&A panel β ask anything about the scene; the LLM answers with cited reasoning
Custom port:
keysg-vis --scene_dir <path> --port 8090from hovfun.graph import KeySGGraph
# Load scene graph (RAG index built from cache if available)
graph = KeySGGraph.from_output_dir("output/keysg_rag1/ScanNet/scene0011_00")
# --- Object grounding ---
result = graph.query("red chair near the window")
print(result.target_object.label, result.bbox_3d, result.confidence)
# --- Open-ended Q&A ---
answer = graph.answer_question("What appliances are in this kitchen?")
print(answer["answer"])
print(answer["reasoning"])
print(answer["relevant_object_ids"])
# --- Browse the hierarchy ---
for floor in graph.floors:
print(f"Floor {floor.id}: {floor.summary}")
for room in floor.rooms:
print(f" Room {room.id}: {len(room.objects)} objects, {len(room.keyframes)} keyframes")All settings live in config/main_pipeline.yaml. Any key can be overridden from the CLI using Hydra syntax (key=value).
dataset:
kind: scannet # scannet | hm3dsem | replica
root_dir: /path/to/scene
depth_scale: 1000.0
depth_min: 0.3
depth_max: 4.0
output_dir: output/keysg_rag1
load:
scene_segmentation: false # true = skip re-segmentation
scene_description: false # true = skip re-description
nodes: false # true = skip re-extraction
vlm:
provider: openai # openai | ollama
model: gpt-5-mini
segmentation:
fuse_every_k: 10
voxel_size: 0.05
nodes:
segmentor: gsam2 # segmentation backend
object_tags: vlm # vlm | ram
use_keyframes_only: true # run detection only on sampled keyframes
skip_frames: 15
build_rag: false # build RAG index at end of pipeline# Skip re-segmentation and re-description, re-run node extraction
keysg-build \
dataset.root_dir=/data/ScanNet/scans/scene0011_00 \
load.scene_segmentation=true \
load.scene_description=true \
load.nodes=falseoutput/{run_name}/{Dataset}/{Scene}/
βββ config.yaml # Copy of run config
βββ floor_summaries.json # Floor-level text summaries
βββ keysg_graph.json # Scene graph metadata
βββ scene_description_index.json # Room description index
βββ hovfun.log # Run log
βββ rag_cache/ # Cached embeddings & FAISS indices
β βββ graph_chunks_meta.json
β βββ graph_embeddings.npy
β βββ graph_faiss.index
β βββ graph_frame_visual_*.{npy,index}
β βββ graph_object_visual_*.{npy,index}
βββ segmentation/
βββ floor_{id}/
βββ room_{fid}_{rid}/
βββ {rid}.pkl # Room geometry
βββ {rid}.pcd # Room point cloud
βββ room_{rid}_vlm.json # VLM descriptions + keyframe data
βββ keyframe_poses.json # Camera poses for visualizer
βββ keyframes/ # Saved keyframe RGB images
βββ nodes/ # Extracted object nodes (*.pkl)
βββ labeled_keyframes/ # Keyframes annotated with object IDs
In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLMβs context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLMs to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical multi-modal retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across three distinct benchmarks, 3D object semantic segmentation, functional element segmentation, and complex query retrieval KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.
@article{werby2025keysg,
title={KeySG: Hierarchical Keyframe-Based 3D Scene Graphs},
author={Werby, Abdelrhman and Rotondi, Dennis and Scaparro, Fabio and Arras, Kai O.},
journal={arXiv preprint arXiv:2510.01049},
year={2025}
}