LimX Dynamics

ARM: Advantage Reward Modeling for Long-Horizon Manipulation

LimX Dynamics

Abstract

Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement often relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. We propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy—Progressive, Regressive, and Stagnant—that significantly reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables robust, automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating superior stability and data efficiency over state-of-the-art VLA baselines with near-zero human intervention during policy training.

Accuracy 99.4% success rate on long-horizon towel folding
Efficiency ×2.5 higher annotation efficiency than subtask-style labeling

Overview

The Reward Engineering Bottleneck—
ARM resolves it.

Existing reward models for long-horizon manipulation predicate progress on strict temporal monotonicity, expose quantization ambiguity in failure states, and depend on brittle subtask partitions that miss critical intra-stage transitions such as recovery and corrective maneuvers. ARM decouples progress rewards from global temporal anchors by estimating relative advantage (interval gain) instead of absolute progress—a task-agnostic primitive that is intuitive to annotate and naturally accommodates regressive behaviors.

Tri-state Labeling Strategy

Progressive, Regressive, and Stagnant labels impose minimal cognitive load, achieve high cross-annotator consistency, and are natively compatible with heterogeneous and fragmented DAgger-style datasets— without requiring subtask boundaries or numeric progress values.

MIMO Temporal Advantage Transformer

A multimodal model integrating temporal video sequences with proprioceptive states. The MIMO architecture predicts the entire advantage sequence in a single forward pass and anchors predictions with a task-completion head for globally consistent progress reconstruction.

Advantage-Weighted Behavior Cloning

Length-invariant interval gains adaptively reweight action chunks, filtering suboptimal samples and prioritizing high-value recovery trajectories—extending RA-BC with adaptive scaling coefficients for fragmented DAgger data.

Comparison between MISO and MIMO architectures used for reward modeling.

Project Video

System demo and long-horizon manipulation behavior

Complete Demonstrations

Full towel-folding rollouts

Two complete manipulation sequences are placed here to bridge the overall project demo and the detailed reward inference visualizations below.

Complete towel-folding operation video.

Another full rollout showing the complete towel-folding procedure.

ARM Inference Visualization

Reward inference trajectories on representative episodes

These visualizations show ARM's temporal inference process on two different episodes, making it easier to inspect how the model tracks progress, hesitation, and recovery over time.

Inference visualization from an early rollout segment.

Inference visualization from a later representative rollout.

Method

From lightweight labels to globally consistent reward trajectories

ARM consists of three linked stages: lightweight tri-state annotation, advantage reward modeling with historical context, and AW-BC policy optimization using length-invariant gains. This design removes the need for task-specific heuristics that define absolute progress.

Task-agnostic labeling Causal temporal modeling Offline RL compatible Robust to backtracking

Overview of our proposed framework.

Step A

Lightweight Tri-state Advantage Labeling

Annotators inspect short temporal segments instead of assigning dense global progress scores. Each segment only needs a directional judgment, which makes supervision faster, more consistent, and more robust to non-monotonic behaviors such as recovery, hesitation, and backtracking.

Core Label Space

$$y \in \{-1, 0, +1\}$$

The annotation target is reduced to a tri-state directional signal rather than a dense scalar progress value.

Illustration of the lightweight tri-state advantage labeling strategy — Lightweight Tri-state Advantage Labeling Strategy

Stagnant

No substantial progress is made, corresponding to waiting or idle behavior.

-1

Regressing

The state deviates from the goal, encounters an error, or results in failure.

Progressing

The state effectively advances toward the task goal.

Input

Short observation segments sampled from long-horizon demonstrations.

Why It Matters

It removes the need for fragile subtask boundaries and dense scalar progress annotations.

Step B

Advantage Reward Modeling with Historical Context

ARM uses causal temporal context to infer interval gains from local observations and tri-state supervision. Rather than scoring each state in isolation, the model reasons about whether the system is moving forward, stalling, or regressing given what has happened before.

Modeling Choice

Historical observations help distinguish genuine progress from noisy motion, tangles, or temporary regressions.

Output

Predicted interval advantages that can be accumulated into a globally consistent progress trajectory.

Efficiency

The sequence-oriented ARM formulation supports parallel inference and avoids the heavy redundancy of sliding-window reward evaluation.

Step C

Global Reconstruction and Advantage-Weighted Behavior Cloning

The predicted interval gains are reconstructed into trajectory-level progress signals and then converted into length-invariant weights for behavior cloning. This makes the final policy focus on demonstrations and action chunks that create meaningful task advancement.

Normalized Interval Gain

$$\Delta G_t = (P_{t+H} - P_t) \cdot \frac{L_{seq}}{\bar{L}}$$

ARM rescales interval progress by episode length to remove the bias induced by heterogeneous trajectory durations.

Reconstruction

Interval-level rewards are accumulated into a more coherent notion of overall progress across the trajectory.

Policy Signal

Higher-value chunks receive stronger supervision, while low-quality or non-productive behavior is down-weighted.

Statistical Weighting

$$\tilde{w}_i = \operatorname{clamp}\!\left(\frac{\Delta G_i - b_{lower}}{b_{upper} - b_{lower} + \epsilon}, 0, 1\right)$$

Batch statistics clamp regressive samples toward zero while preventing a few outliers from dominating training.

AW-BC Objective

$$L_{AW-BC}(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}}\left[-\tilde{w}(s,a)\log \pi_\theta(a\mid s)\right]$$

The policy is optimized by weighting action likelihood with the reconstructed advantage signal from ARM.

Result

The final AW-BC stage improves offline policy learning under long-horizon manipulation with recovery and backtracking.

Results

Higher policy quality with less annotation burden

The paper evaluates ARM on a demanding long-horizon towel-folding task. Compared with prior baselines, ARM handles backtracking and corrective phases more reliably while significantly improving final policy success and data efficiency.

99.4%

Long-horizon towel-folding success rate with Gr00t N1.5 + ARM (AW-BC).

32 samples/h

Policy execution throughput under the proposed evaluation setting shown in the paper.

3.6

Average folding precision reported for the final policy in the main experiment table.

Annotation

Faster than subtask-style supervision

ARM replaces brittle progress scoring with a label space that is easier for humans to apply consistently across long-horizon episodes.

Learning

Higher-value action chunks receive more weight

Reconstructed gains translate directly into action reweighting, helping the policy focus on demonstrations that contribute useful progress.

Reliability

Better handling of recovery and regression

Relative advantage remains informative when the robot backtracks, tangles, or corrects itself during execution.

Quantitative Evidence

The following tables consolidate the main quantitative evidence behind ARM, covering annotation efficiency, inference cost, and downstream policy performance in long-horizon manipulation.

Table 1. Quantitative Evaluation of Reward Models. All models are evaluated on a validation set of 50 trajectories. “MSE” measures the trajectory reconstruction fidelity against Ground Truth (GT) progress (normalized to [0, 1]). The bottom section reports the Success Identification Accuracy, assessing the Completion Head’s ability to correctly classify the final state of Standard (SE, 12 successful episodes), and Failure (FE, 12 failed episodes) trajectories. Best performances are highlighted in bold.

Metrics	SARM	ARM (Ours)
MSE ↓	0.0059	0.0014
Success Identification Accuracy (%)
Standard (SE)	83.3 (10/12)	100.0 (12/12)
Failure (FE)	91.7 (11/12)	100.0 (12/12)

Table 2. Quantitative Comparison of Downstream Policy Performance. We report the success rate, operational task throughput (episodes completed per hour), and folding precision (final edge alignment score; detailed annotation protocol provided in the Supplementary Material) on the long-horizon towel-folding task. Our proposed AW-BC (ARM) framework significantly outperforms both standard Behavior Cloning and prior reward-aware baselines across all metrics.

Model	Success Rate (%)	Task Throughput (Episodes / hr)	Folding Precision (Score)
BC-Baseline (GR00T N1.5)	62.1	18	2.2
RA-BC (GR00T + SARM)	78.5	24	2.7
AW-BC (GR00T + ARM)	99.4	32	3.6

Table 3. Labeling Efficiency Comparison. We compare the sample throughput per 8-hour shift across different protocols.

Annotation Protocol	Rate (Samples/8h)
Human Baseline (Seg.)^†	100
Human Tri-state (Ours)^†	250
VLM (Qwen3-VL)^‡	∼ 400
Auto Tri-state (Ours)^‡	> 2,000

^† Per single human annotator.

^‡ Inference throughput on a single NVIDIA A100 GPU.

Qualitative Analysis

Qualitative comparison of ground-truth labeling and reconstructed progress behavior

Beyond the quantitative tables, the paper also compares how different supervision schemes and reconstructed progress signals behave over real long-horizon episodes. The figures below highlight differences between reference labeling, SARM-style progress estimation, and the smoother trajectory reconstructed by ARM.

GT Comparison

Ground-truth comparison across three labeling schemes

This figure compares the ground-truth progress targets under three labeling strategies, exposing how different supervision protocols distribute progress over the same long-horizon manipulation episode.

Ground-truth comparison across three progress labeling methods

Progress Comparison

Tri-state ground truth versus SARM and ARM outputs

This figure compares the tri-state ground truth with the progress signals predicted by SARM and ARM, illustrating how ARM produces a more coherent and temporally aligned reconstruction under complex transitions.

Tri-state ground truth compared with SARM and ARM reconstructed progress

"By shifting reward learning from absolute progress to relative advantage, ARM enables cost-effective annotation, robust progress reconstruction, and stronger long-horizon policy optimization."

Citation

Reference

The following BibTeX reflects the current submission venue and project page.

@inproceedings{mao2026arm,
  title     = {ARM: Advantage Reward Modeling for Long-Horizon Manipulation},
  author    = {Yiming Mao and Zixi Yu and Weixin Mao and Yinhao Li and
               Qirui Hu and Zihan Lan and Minzhao Zhu and Hua Chen},
  booktitle = {CVPR 2026 Workshop GigaBrain Challenge Submission},
  year      = {2026},
  url       = {https://aiming1998.github.io/ARM/}
}