Skip to content

MMintLab/robomme_policy_learning

 
 

Repository files navigation

MME-VLA Policy Learning and Evaluation

🚀 Join Our Community: WeChat Group | Discord

Robomme bench

Outline

Updates

  • [03/2026] 🚀 We release MME-VLA Suite, a family of memory-augmented vision-language-action (VLA) models based on the $\pi_{0.5}$ backbone. See our paper and leaderboard for more details and analysis.

Installation

Install with UV

Install Policy Learning Repo

GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

Set the OPENPI_DATA_HOME path in your ~/.bashrc, e.g. export OPENPI_DATA_HOME=<your_openpi_homedir>. For more details, please refer to OpenPi.

Install RoboMME Simulator

Clone the RoboMME submodule:

git submodule update --init

Then install the RoboMME environment following the documentation here. We use separate environments for VLA training/inference and the RoboMME simulator. During evaluation, we use a WebSocket connection between them, following OpenPi.

Install with Docker

After downloading the data in the data directory and setting up runs in the following structure.
Update the RoboMME submodule with git submodule update --init. Then build the Docker image following this.

Repository Structure

.
├── data
│   ├── robomme_h5_data                 # download robomme raw h5 files here
│   └── robomme_preprocessed_data
│   │   ├── data                        # pickle files
│   │   ├── features                    # precompute siglip token embeddings
│   │   ├── meta                        # statistics for robomme
│   │   ├── memer                       # VLM subgoal training data for MemER
│   │   └── qwenvl                      # VLM subgoal training data for QwenVL
├── examples
│   └── robomme                         # RoboMME simulator evaluation code
├── packages
│   └── openpi-client                   # VLA client & server interface
├── runs
│   ├── assets                          # save norm_stats json files
│   ├── ckpts                           # fine-tuned checkpoints
│   └── evaluation                      # evaluation results
├── scripts                             # train/eval/data_generation scripts
├── src
│   ├── mme_vla_suite                   # MME_VLA code, follows openpi structure 
│   └── openpi                          # original OpenPi code with minor changes
└── third_party

This repository is built on top of OpenPi. We highly recommend becoming familiar with OpenPi first before working with this repo.

Download

Download Training Data

Place all data under the data directory:

mkdir data && cd data

Download the raw RoboMME training files here:

git clone git@hf.co:Yinpei/robomme_data_h5 data/robomme_data_h5

(Optional) Download preprocessed RoboMME data here:

git clone git@hf.co:datasets/Yinpei/robomme_preprocessed_data data/robomme_preprocessed_data

and run uv run scripts/unzip_data.py data/robomme_preprocessed_data to unzip the files.

Alternatively, you can run uv run scripts/build_dataset.py to generate the preprocessed pickle files (takes about 2–3 hours) and/or the VLM subgoal predictor training data (takes about 30–60 minutes).

We also provide data in the LeRobot format here. In our experiments, however, the LeRobot dataloader significantly increased CPU memory usage during training, which can be a bottleneck in shared training environments (e.g., on HPC clusters). For this reason, we use our custom data format and dataloader in this repository.

Download Pre-trained Models

Download the $\pi_{0.5}$-base backbone:

uv run scripts/download_pi05_base.py

Download the pi05_vision_encoder, which is a subset of the $\pi_{0.5}$ parameters used for dataset feature construction without loading the full model. Visual token embeddings are computed and cached for training, and the vision encoder remains frozen in our experiment:

cd $OPENPI_DATA_HOME
git clone git@hf.co:Yinpei/pi05_vision_encoder

Download Fine-tuned VLA/VLM Checkpoints (Optional)

Fine-tuned models and evaluation results are stored under the runs directory. Create it if needed:

mkdir runs
mkdir runs/ckpts        # save all trained models here
mkdir runs/evaluation   # evaluation results
mkdir runs/assets       # save all normalization statistics files here

You can skip the following steps if you plan to fine-tune your own VLA/VLM models directly; see Model Training.

Download MME-VLA variants here:

git clone git@hf.co:Yinpei/mme_vla_suite runs/ckpts/mme_vla_suite

We release all checkpoints for symbolic and perceptual memory, and a subset of recurrent memory variants for research. Recurrent memory is still underperforming; we will release more recurrent variants as results improve.

Download VLM subgoal predictors here:

git clone git@hf.co:Yinpei/vlm_subgoal_predictor runs/ckpts/vlm_subgoal_predictor

Download the fine-tuned $\pi_{0.5}$ baseline here:

git clone git@hf.co:Yinpei/pi05_baseline runs/ckpts/pi05_baseline

After downloading fine-tuned checkpoints, you can run

uv run ./scripts/unzip_ckpt.py runs/ckpts

to unzip all of them.

Model Training

Data Preparation

Prepare training data by either downloading preprocessed files or running:

uv run scripts/build_robomme_dataset.py   --dataset_type robomme_pkl  --raw_data_path=<downloaded_h5_data_dir> --preprocessed_data_path=<your_target_dir>

Then compute normalization statistics (this takes about 3 minutes):

uv run scripts/compute_norm_stats.py --config-name mme_vla_suite --repo-id robomme --dataset-path="data/robomme_preprocessed_data"
uv run scripts/compute_norm_stats.py --config-name pi05_baseline --repo-id robomme --dataset-path="data/robomme_preprocessed_data"

This produces the following structure under runs:

.
├── assets
│   ├── mme_vla_suite
│   │   └── robomme
│   │       └── norm_stats.json
│   └── pi05_baseline
│       └── robomme
│           └── norm_stats.json

You can also compare against our reference norm_stats.json provided here to check whether your processing is correct. Small differences are acceptable.

Train π₀.₅ baseline

This variant does not use history and fine-tunes the $\pi_{0.5}$ checkpoints with the vision encoder frozen (for comparison with MME-VLA):

bash scripts/finetune_pi05_baseline.sh

You can change --exp-name to suit your own experiment naming.

Train MME-VLA policies

bash scripts/finetune_mme_vla_suite.sh

Set MME_VLA_TYPE to train a specific model variant. You can also change --exp-name to suit your own experiment naming.

Train VLM subgoal predictor

robomme_preprocessed_data already contains VLM subgoal prediction data, but you can also generate it with:

uv run scripts/build_robomme_dataset.py  --dataset_type vlm_subgoal_qwenvl  --raw_data_path=<downloaded_h5_data_dir> --preprocessed_data_path=<your_target_dir>
uv run scripts/build_robomme_dataset.py  --dataset_type vlm_subgoal_memer  --raw_data_path=<downloaded_h5_data_dir> --preprocessed_data_path=<your_target_dir>

After the data is ready, run:

micromamba activate robomme
bash scripts/finetune_vlm_subgoal_predictor.sh

Set DATASET_PATH according to which VLM you are training: (1) simple subgoals, (2) grounded subgoals, or (3) MemER-style subgoals.

Evaluation

Evaluation with the integrated script

After downloading the fine-tuned checkpoints, run:

bash scripts/eval.sh

Set the MODEL_TYPE variable to one of the following:

  1. Prior methods: pi05_baseline, MemER
  2. Symbolic MME-VLA: symbolic_simpleSG_oracle, symbolic_simpleSG_gemini, symbolic_simpleSG_qwenvl, symbolic_groundedSG_oracle, symbolic_groundedSG_gemini, symbolic_groundedSG_qwenvl
  3. Perceptual MME-VLA: perceptual-framesamp-context, perceptual-framesamp-modul, perceptual-framesamp-expert, perceptual-tokendrop-context, perceptual-tokendrop-modul, perceptual-tokendrop-expert
  4. Recurrent MME-VLA: recurrent-rmt-context, recurrent-rmt-modul, recurrent-rmt-expert, recurrent-ttt-context, recurrent-ttt-modul, recurrent-ttt-expert

Running eval.sh automatically starts two tmux windows: one for the policy server and one for RoboMME evaluation. If the evaluation is interrupted, you can rerun the script; it will automatically resume from the generated progress.json.

Manual evaluation (per model)

Details are provided here.

Troubleshooting

Q1: Vulkan installation fails.
A1: Please refer to the ManiSkill solution. If it still does not work, we recommend reinstalling the NVIDIA driver and Vulkan packages. We use NVIDIA driver 570.211.01 and Vulkan 1.3.275. You can also switch to CPU rendering:

os.environ['SAPIEN_RENDER_DEVICE'] = 'cpu'
os.environ['MUJOCO_GL'] = 'osmesa'

Q2: Why does the evaluation stop?
A2: We observed that, on long-horizon tasks such as VideoPlaceButton, the WebSocket connection can break due to large video frames. If the evaluation process is interrupted, you can rerun scripts/eval.sh, and the program will resume based on the generated progress.json.

Q3: CUDA runs out of memory when training VLA models.
A3: You can set the environment variable XLA_PYTHON_CLIENT_MEM_FRACTION=0.95 to allow JAX to use more GPU memory.

Acknowledgement

This work was supported in part by NSF SES-2128623, NSF CAREER #2337870, NSF NRI #2220876, NSF NAIRR250085, and NSF IIS-1949634. We would also like to thank the excellent OpenPi codebase from Physical-Intelligence.

Citation

@article{dai2026robomme,
  title={RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies},
  author={Dai, Yinpei and Fu, Hongze and Lee, Jayjun and and Liu, Yuejiang and Zhang, Haoran and Yang, Jianing and Finn, Chelsea and Fazeli, Nima and Chai, Joyce},
  journal={arXiv preprint arXiv:2603.04639},
  year={2026}
}

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.9%
  • Other 1.1%