We compare CAMP (Ours) against ACT, DP, π0.5, and MemoryVLA across four real-robot manipulation tasks. Hover a card for details.
Long horizon, contact rich manipulation is inherently partially observable. This is as a single visual observation rarely captures a robot's full action context, including prior attempts, interactions, or progress. Consequently, standard visuomotor policies or vision-language-action models are prone to struggle in such tasks due to a lack of memory. To address this, we introduce Compressed Action Memory Policy (CAMP) based on the insight that a robot's own action history serves as a highly informative, self-supervised signal, enabling the policy to learn a robust, compact history representation. In our approach, we train a memory module to maintain a compressed representation of past actions, forcing it to encode a latent behavioral memory of all the robot's past interactions that can then be used to better contextualize future actions. This allows our approach to implicitly track generalized task progress and learn from failed attempts without any additional supervision, or external oversight. We evaluate CAMP across four real-robot setups and two novel simulation benchmarks: Memory-T-Bench and Memory-Manip-Bench. By demonstrating substantial gains over state-of-the-art baselines, CAMP is, to our knowledge, the first policy to demonstrate substantial success on contact-rich partially observable manipulation tasks purely through learned memory.
Memory-T-Bench comprises four PushT-derived tasks that share the same contact-rich dynamics but cannot be solved from a single frame. Three are track-task-progress tasks -- the policy must remember what it has already accomplished -- and one is a learn-from-failure task, where it must remember what did not work. Hover a card for details.
Push the T-block into three goal regions in any order -- without ever pushing it into the same region twice.
Success: all three goals reached, each exactly once.
Two T-blocks must exchange positions, but a mid-episode frame no longer reveals which block started where.
Success: each block ends in the other's starting position.
The same swap as Swap-Direct, but the starting assignment is re-randomized every episode.
Success: the two blocks exchange positions despite the shuffled start.
Three tracks lead to the goal, but two are high-friction and block the T. Reach the goal through the open track -- without retrying a track that already failed.
Success: the T reaches the goal via the low-friction track, with no repeated failed attempt.
Memory-Manip-Bench extends the study to seven partially observable 3D manipulation tasks across a range of contact. Six are learn-from-failure -- the policy must remember which options already failed -- and one (Swap-Block) requires tracking task progress. Hover a card for details.
Swap two blocks via a buffer spot while recalling their initial positions.
Success: each block ends in the other's initial position.
Probe buttons to infer a hidden one-to-one button--bulb mapping, then light the bulb with the red base.
Success: the lightbulb is activated using the inferred mapping.
Lift each look-alike cover once, moving on the moment nothing is found beneath it.
Success: the hidden target is uncovered without re-checking a cover.
Attempt a stack to discover which of the two blocks has a stackable bottom.
Success: the stackable block is identified and stacked.
Probe three holes until finding the one that accepts the peg at full depth.
Success: the peg is fully inserted into the accepting hole.
Open closed drawers one at a time, skipping any already found empty, until the soda is located.
Success: the soda is found without re-opening an empty drawer.
Only one of the three doors is unlocked; try the doors not yet opened, skipping any already found locked, until one opens.
Success: the unlocked door is opened without re-trying a locked door.
We compare CAMP (Ours) against ACT, DP, π0.5, and MemoryVLA across four real-robot manipulation tasks. Hover a card for details.
Swap two cans via a buffer spot while recalling their initial positions.
Success: each can ends in the other's initial position.
Wipe a random plate with a random brush and return it, then wipe the other plate with the other brush -- a choice the current frame doesn't reveal.
Success: both plates are wiped, each with a different brush.
Push the T-block into three goal regions in any order, without repeating one.
Success: all three goals reached, each exactly once.
Probe three holes until finding the one that accepts the peg at full depth.
Success: the peg is fully inserted into the accepting hole.
For more details and insights, please refer to the paper.
We introduced CAMP, a memory-augmented visuomotor policy for long-horizon, contact-rich manipulation under partial observability that turns the robot's own action history into a scalable, self-supervised learning signal. CAMP demonstrates consistent gains across Memory-T-Bench, Memory-Manip-Bench, and multiple challenging real-robot tasks.
Limitations remain. Tasks requiring extremely-long-horizon memory are still challenging for our method. Additionally, extending CAMP to dynamic and dexterous manipulation is a natural direction we leave open for future work.