World-Model-Based Evaluation of Robot Policies

A scalable evaluator that predicts real-world robot policy outcomes using a learned world model, eliminating the need for time-consuming manual rollouts.

Sentience Inc.

Abstract

Evaluating robot policies in the real world is expensive, slow, and often unsafe. We built a world-model-based policy evaluator that simulates policy execution directly in a learned dynamics model. This enables rapid, automated assessment of policy success and failure modes without requiring physical rollouts. We demonstrate strong qualitative alignment between real-world executions and world-model predictions across diverse manipulation tasks, substantially reducing evaluation time while preserving fidelity.

Results

Task 1
Real-World Rollout
World Model Rollout
Instruction: Pick up the white tissue ball and place it in the black mug.
Task 2
Real-World Rollout
World Model Rollout
Instruction: Pick up the white tissue ball and place it in the black mug.
Task 3
Real-World Rollout
World Model Rollout
Instruction: Pick up the white tissue ball and place it in the black mug.
Task 4
Real-World Rollout
World Model Rollout
Instruction: Pick up the white tissue ball and place it in the black mug.

Demo

We demonstrate a world-model-based policy evaluator that predicts real-world robot behavior by simulating policy rollouts in a learned dynamics model. This enables efficient and scalable evaluation of robot policies without requiring physical execution.