Logo EC-Flow

Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow

ICCV 2025

Corresponding Author

TL;DR:

We propose a method for learning robotic manipulation policies solely from action-unlabeled videos, enabling versatile manipulation over deformable objects, occluded environments, and non-object-displacement tasks.

Abstract

Current language-guided robotic manipulation systems often require low-level action-labeled datasets for imitation learning. While object-centric flow prediction methods mitigate this issue, they remain limited to scenarios involving rigid objects with clear displacement and minimal occlusion. In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation from action-unlabeled videos by predicting embodiment-centric flow. Our key insight is that incorporating the embodiment's inherent kinematics significantly enhances generalization to versatile manipulation scenarios, including deformable object handling, occlusions, and non-object-displacement tasks. To connect the EC-Flow with language instructions and object interactions, we further introduce a goal-alignment module by jointly optimizing movement consistency and goal-image prediction. Moreover, translating EC-Flow to executable robot actions only requires a standard robot URDF (Unified Robot Description Format) file to specify kinematic constraints across joints, which makes it easy to use in practice. We validate EC-Flow on both simulation (Meta-World) and real-world tasks, demonstrating its state-of-the-art performance in occluded object handling (62% improvement), deformable object manipulation (45% improvement), and non-object-displacement tasks (80% improvement) than prior state-of-the-art object-centric flow methods.

Method

The proposed EC-Flow overcomes key limitations of prior object-centric work through two modules:

  • Embodiment-Centric Flow Prediction: Instead of tracking objects, we predict pixel-wise flow (i.e., future locations) for randomly sampled points of the embodiment;
  • Kinematic-Aware Action Calculation: Leveraging only the embodiment's URDF model and consecutive predicted EC-Flow, we compute executable actions without requiring action labels for supervision.

Embodiment-Centric Flow Prediction

To predict embodiment-centric flow, we create a training dataset from RGB videos and build a flow prediction model. A key challenge is making sure the predicted motions align with both the robot's body and the task-relevant objects. To solve this, we add an auxiliary goal image prediction task that helps ensure the motion follows kinematic rules and focuses on the right objects. See Figure 1 for an overview.

grade-lv

Figure 1: Embodiment-centric flow prediction network architecture. Branch (a): prediction of embodiment flow. Branch (b): prediction of the goal image which is used as an auxiliary task for aligning flow to object interactions and language instruction.

Kinematic-Aware Action Calculation

Robots move with joints, not as one rigid piece. Our method handles this by first figuring out which joint each predicted motion belongs to (see Figure 2), then calculating how the robot's hand should move. This process is described in Figure 3.

Figure 2

Figure 2: Process of allocating sampled points to specific joints. We only select points that only belong to a single joint.

Figure 3

Figure 3: Kinematic-aware action calculation algorithm.

Experiments

Meta-World

We evaluate our method on 9 tasks from the Meta-World benchmark using a Sawyer robot. Each task involves different object interactions, with randomized setups. We collect 5 demo videos per task and test the learned policy 25 times per task. The results are shown in Tab. 1.

Notably, EC-Flow significantly outperforms baselines on challenging tasks like btn-top-press and hammer-strike, thanks to its robustness to occlusions during manipulation.

grade-lv

Table 1: Simulation results on Meta-World benchmark.

Meta-World Task Demonstrations

Door Open

Door Close

Shelf Place

Button Press

Button Press Top-Down

Faucet Close

Faucet Open

Handle Press

Hammer Strike

Real-World

We evaluate 7 tasks across rigid, deformable, and non-object-displacement categories, each trained on 5 action-free videos. Each task is tested 10 times with randomized object positions. The results are shown in Tab. 2.

EC-Flow uniquely enables successful manipulation of deformable objects and tasks without direct object displacement—scenarios previous object-centric flow methods failed to handle.

grade-lv

Table 2: Results on real-world manipulation tasks..

Real-World Task Demonstrations

Flow Prediction

Open Fridge

Open Drawer

Open Oven

Task Execution

Open Fridge

Open Drawer

Open Oven

Flow Prediction

Fold Clothes

Fold Towel

Rotate Switch

Task Execution

Fold Clothes

Fold Towel

Rotate Switch

Citation

misc{chen2025ecflowenablingversatilerobotic,
  title={EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow}, 
  author={Yixiang Chen and Peiyan Li and Yan Huang and Jiabing Yang and Kehan Chen and Liang Wang},
  year={2025},
  eprint={2507.06224},
  archivePrefix={arXiv},
  primaryClass={cs.RO}
}