Humanoid Whole-Body Manipulation

1Robotics Institute, 2School of Computer Science, Carnegie Mellon University
CoRL 2026, Austin TX, USA

We investigate FlowRL and propose SNQF (Self-Normalizing Q-weighted Flow), a simpler flow-based RL method, on the g1-window-v0 whole-body manipulation task from HumanoidBench, featuring the Unitree G1 humanoid with two dexterous Shadow Hands.

Abstract

Humanoid robots hold promise for versatile locomotion and manipulation in human environments, yet learning to control their high-dimensional dynamics remains challenging. Recent benchmarks show that state-of-the-art reinforcement learning algorithms require millions of samples and struggle on complex tasks. We investigate FlowRL, a flow-based off-policy actor-critic method, on locomotion and manipulation tasks from HumanoidBench. We propose two modifications: (1) replacing FlowRL's auxiliary behavior-optimal critic with an AWR-style exponential Q-weighting that eliminates two networks and a hyperparameter, and (2) upgrading all network components to a SimBa-inspired residual architecture for more stable training at greater depth. Our experiments demonstrate that these modifications maintain or improve performance across humanoid locomotion tasks while substantially reducing algorithmic complexity. On the challenging g1-window-v0 window-cleaning task, FlowRL achieves the highest sustained return (~70) among all evaluated methods, while our SNQF reaches ~35 with significantly lower complexity. Neither PPO, SAC, TD-MPC2, DreamerV3, nor our proposed methods achieve task success, motivating further work on high-dimensional whole-body manipulation.

Environment

We use HumanoidBench, a MuJoCo-based simulation environment featuring a Unitree G1 humanoid robot with two dexterous Shadow Hands. The robot has:

  1. Observation space: 151D (51D body proprioception + 50D per hand)
  2. Action space: 61D position control (19D body + 21D per hand) at 50 Hz
  3. Physics: MuJoCo MJX with realistic contact dynamics, runs at ~1000 FPS

We focus on the g1-window-v0 task: a whole-body manipulation scenario where the robot must grasp a window wiping tool and keep its tip parallel to a window by following a prescribed vertical velocity of 0.5 m/s. Success requires precise whole-body coordination to maintain tool-surface contact while simultaneously adjusting posture to execute the wiping motion. The reward function has two components: a manipulation term that rewards stable upright stance, hands close to the tool, and tool velocity tracking; and a window contact term that is only active when the tool touches the glass, incentivizing the robot to keep five distinct contact points on the wiper flush against the window pane.

The g1-window-v0 window-cleaning task in HumanoidBench.

Method

FlowRL

FlowRL parameterizes the policy as a state-conditioned velocity field, generating actions via ODE integration from Gaussian noise. It derives a constrained policy search objective that jointly maximizes Q-values while bounding the Wasserstein-2 distance to a behavior-optimal policy implicit in the replay buffer. However, FlowRL's constraint relies on two auxiliary networks — a behavior-optimal Q-network and V-network estimated via expectile regression — adding algorithmic complexity and an additional hyperparameter τ.

Modification 1: SNQF — Eliminating the Behavior-Optimal Critic

Motivated by AWR/AWAC, we show that FlowRL's behavior-optimal critic can be replaced by a single exponential reweighting of buffer actions using only the current Q-function. This yields the Self-Normalizing Q-weighted Flow (SNQF) objective, which reweights buffer actions by exp(Q(s,a)/α): high-Q actions are amplified, low-Q actions suppressed. SNQF requires only the standard twin Q-network; the Qπβ*, Vπβ* networks and the expectile hyperparameter τ are eliminated entirely, reducing the total network count from five to three.

Modification 2: SimBa-Inspired Residual Networks

We replace sequential LayerNorm–Dense blocks with pre-norm residual feedforward units initialized near the identity: x ← x + W₂ SiLU(W₁ LN(x)), where W₂ is initialized at scale 0.1. All three networks follow the blueprint input → Dense → [ResBlock]n → LN → Dense with n=3 for critics and n=2 for the actor. Skip connections allow gradients to flow cleanly through deep stacks, reducing vanishing-gradient issues in online RL with larger networks.

Experiment Results

We evaluate six methods on the g1-window-v0 task: PPO, SAC, TD-MPC2, DreamerV3, FlowRL, and our SNQF. All results report mean episode return versus environment steps. PPO, SAC, and TD-MPC2 are evaluated over three random seeds; DreamerV3, FlowRL, and SNQF over one seed. A random policy baseline achieves 2.87 ± 0.66 over 100 episodes.

Evaluation returns for g1-window-v0 with all methods. FlowRL achieves the highest sustained return (~70), followed by SAC (peaks 60–75 but collapses), SNQF (~35), TD-MPC2 (~45–57), DreamerV3 (~40–50), and PPO (~19–21).

PPO

PPO exhibits slow but remarkably consistent learning throughout the entire 20M step budget. All three seeds converge to returns of 19–21 with inter-seed variance never exceeding 2–3 return units. This reproducibility is a direct consequence of PPO's conservative on-policy updates. However, PPO never progresses beyond basic postural stability: it never reaches the tool, never makes wiper contact, and the task remains essentially unexplored within the 20M step budget.

PPO rollout on the g1-window-v0 window-cleaning task.

SAC

SAC is far more sample-efficient than PPO, surpassing PPO's 5M-step performance in under 500K steps. Two seeds reach peak returns of 60–75 around 2–2.5M steps, which is 3–4× higher than PPO's final performance. However, none of the three seeds sustain these peak returns. Seeds oscillate violently or collapse almost entirely after 4.5M steps, driven by Q-value overestimation and the robot's inability to simultaneously maintain stable upright posture while executing the wiping motion.

SAC rollout on the g1-window-v0 window-cleaning task.

FlowRL

FlowRL learns quickly relative to all other baselines, rising to a smoothed return of ~50 by 1M steps and plateauing around 70 by 2M steps — the highest sustained return among all evaluated methods. The constrained policy search objective prevents the catastrophic policy collapse observed in SAC. The persistent high variance (instantaneous spikes of 150–200) suggests the flow policy continues to explore diverse action modes without collapsing to a single consistent behavioral strategy.

FlowRL rollout on the g1-window-v0 window-cleaning task.

SNQF (Ours)

SNQF learns steadily from the outset, rising to a smoothed return of ~30–35 by 1M steps and plateauing around 35 through 2M steps. The plateau is roughly half that of FlowRL on this task, indicating that removing the Qπβ* machinery carries a performance cost on contact-dependent tasks where the Q-function converges slowly. Importantly, on easier locomotion tasks (e.g., h1-balance-simple-v0), SNQF matches or outperforms FlowRL despite using one fewer critic system.

SNQF rollout on the g1-window-v0 window-cleaning task.

Random Policy Baseline

The random policy achieves a mean return of 2.87 ± 0.66 over 100 episodes with no learning trend, serving as a lower-bound reference.

Random policy rollout — mean return 2.87 ± 0.66 over 100 episodes.