Humanoid robots hold promise for versatile locomotion and manipulation in human
environments, yet learning to control their high-dimensional dynamics remains challenging.
Recent benchmarks show that state-of-the-art reinforcement learning algorithms require
millions of samples and struggle on complex tasks. We investigate FlowRL, a flow-based
off-policy actor-critic method, on locomotion and manipulation tasks from HumanoidBench.
We propose two modifications: (1) replacing FlowRL's auxiliary behavior-optimal critic
with an AWR-style exponential Q-weighting that eliminates two networks and a
hyperparameter, and (2) upgrading all network components to a SimBa-inspired residual
architecture for more stable training at greater depth. Our experiments demonstrate that
these modifications maintain or improve performance across humanoid locomotion tasks while
substantially reducing algorithmic complexity. On the challenging g1-window-v0
window-cleaning task, FlowRL achieves the highest sustained return (~70) among all
evaluated methods, while our SNQF reaches ~35 with significantly lower complexity.
Neither PPO, SAC, TD-MPC2, DreamerV3, nor our proposed methods achieve task success,
motivating further work on high-dimensional whole-body manipulation.
We use HumanoidBench, a MuJoCo-based simulation environment featuring a Unitree G1 humanoid robot with two dexterous Shadow Hands. The robot has:
We focus on the g1-window-v0 task: a whole-body manipulation scenario where
the robot must grasp a window wiping tool and keep its tip parallel to a window by
following a prescribed vertical velocity of 0.5 m/s. Success requires precise whole-body
coordination to maintain tool-surface contact while simultaneously adjusting posture to
execute the wiping motion. The reward function has two components: a manipulation term
that rewards stable upright stance, hands close to the tool, and tool velocity tracking;
and a window contact term that is only active when the tool touches the glass,
incentivizing the robot to keep five distinct contact points on the wiper flush against
the window pane.
The g1-window-v0 window-cleaning task in HumanoidBench.
FlowRL parameterizes the policy as a state-conditioned velocity field, generating actions via ODE integration from Gaussian noise. It derives a constrained policy search objective that jointly maximizes Q-values while bounding the Wasserstein-2 distance to a behavior-optimal policy implicit in the replay buffer. However, FlowRL's constraint relies on two auxiliary networks — a behavior-optimal Q-network and V-network estimated via expectile regression — adding algorithmic complexity and an additional hyperparameter τ.
Motivated by AWR/AWAC, we show that FlowRL's behavior-optimal critic can be replaced by a single exponential reweighting of buffer actions using only the current Q-function. This yields the Self-Normalizing Q-weighted Flow (SNQF) objective, which reweights buffer actions by exp(Q(s,a)/α): high-Q actions are amplified, low-Q actions suppressed. SNQF requires only the standard twin Q-network; the Qπβ*, Vπβ* networks and the expectile hyperparameter τ are eliminated entirely, reducing the total network count from five to three.
We replace sequential LayerNorm–Dense blocks with pre-norm residual feedforward units
initialized near the identity: x ← x + W₂ SiLU(W₁ LN(x)), where W₂ is
initialized at scale 0.1. All three networks follow the blueprint
input → Dense → [ResBlock]n → LN → Dense with n=3 for critics
and n=2 for the actor. Skip connections allow gradients to flow cleanly through deep
stacks, reducing vanishing-gradient issues in online RL with larger networks.
We evaluate six methods on the g1-window-v0 task: PPO, SAC, TD-MPC2,
DreamerV3, FlowRL, and our SNQF. All results report mean episode return versus
environment steps. PPO, SAC, and TD-MPC2 are evaluated over three random seeds;
DreamerV3, FlowRL, and SNQF over one seed. A random policy baseline achieves
2.87 ± 0.66 over 100 episodes.
Evaluation returns for g1-window-v0 with all methods.
FlowRL achieves the highest sustained return (~70), followed by SAC (peaks 60–75 but collapses),
SNQF (~35), TD-MPC2 (~45–57), DreamerV3 (~40–50), and PPO (~19–21).
PPO exhibits slow but remarkably consistent learning throughout the entire 20M step budget. All three seeds converge to returns of 19–21 with inter-seed variance never exceeding 2–3 return units. This reproducibility is a direct consequence of PPO's conservative on-policy updates. However, PPO never progresses beyond basic postural stability: it never reaches the tool, never makes wiper contact, and the task remains essentially unexplored within the 20M step budget.
PPO rollout on the g1-window-v0 window-cleaning task.
SAC is far more sample-efficient than PPO, surpassing PPO's 5M-step performance in under 500K steps. Two seeds reach peak returns of 60–75 around 2–2.5M steps, which is 3–4× higher than PPO's final performance. However, none of the three seeds sustain these peak returns. Seeds oscillate violently or collapse almost entirely after 4.5M steps, driven by Q-value overestimation and the robot's inability to simultaneously maintain stable upright posture while executing the wiping motion.
SAC rollout on the g1-window-v0 window-cleaning task.
FlowRL learns quickly relative to all other baselines, rising to a smoothed return of ~50 by 1M steps and plateauing around 70 by 2M steps — the highest sustained return among all evaluated methods. The constrained policy search objective prevents the catastrophic policy collapse observed in SAC. The persistent high variance (instantaneous spikes of 150–200) suggests the flow policy continues to explore diverse action modes without collapsing to a single consistent behavioral strategy.
FlowRL rollout on the g1-window-v0 window-cleaning task.
SNQF learns steadily from the outset, rising to a smoothed return of ~30–35 by 1M steps
and plateauing around 35 through 2M steps. The plateau is roughly half that of FlowRL on
this task, indicating that removing the Qπβ* machinery carries a
performance cost on contact-dependent tasks where the Q-function converges slowly.
Importantly, on easier locomotion tasks (e.g., h1-balance-simple-v0), SNQF
matches or outperforms FlowRL despite using one fewer critic system.
SNQF rollout on the g1-window-v0 window-cleaning task.
The random policy achieves a mean return of 2.87 ± 0.66 over 100 episodes with no learning trend, serving as a lower-bound reference.
Random policy rollout — mean return 2.87 ± 0.66 over 100 episodes.