Abstract
Long-horizon robotic manipulation remains challenging for reinforcement
learning (RL) because sparse rewards provide limited guidance for credit
assignment. Practical policy improvement often relies on richer
intermediate supervision, such as dense progress rewards, which are
costly to obtain and ill-suited to non-monotonic behaviors such as
backtracking and recovery. We propose Advantage Reward Modeling (ARM),
a framework that shifts from hard-to-quantify absolute progress to
estimating relative advantage. We introduce a cost-effective tri-state
labeling strategy—Progressive, Regressive, and Stagnant—that
significantly reduces human cognitive overhead while ensuring high
cross-annotator consistency. By training on these intuitive signals,
ARM enables robust, automated progress annotation for both complete
demonstrations and fragmented DAgger-style data. Integrating ARM into
an offline RL pipeline allows for adaptive action-reward reweighting,
effectively filtering suboptimal samples. Our approach achieves a
99.4% success rate on a challenging long-horizon
towel-folding task, demonstrating superior stability and data efficiency
over state-of-the-art VLA baselines with near-zero human intervention
during policy training.