Figure: Our self-supervised visual affordance model guides the robot to the vicinity of actionable regions in the environment with a model-based policy. Once inside this area, we switch to a local reinforcement learning policy, in which we embed our affordance model to favor the same object regions favored by people and to boost sample-efficiency.

Robots operating in human-centered environments should have the ability to understand how objects function: what can be done with each object, where this interaction may occur, and how the object is used to achieve a goal. To this end, we propose a novel approach that extracts a self-supervised visual affordance model from human teleoperated play data and leverages it to enable efficient policy learning and motion planning. We combine model-based planning with model-free deep reinforcement learning (RL) to learn grasping policies that favor the same object regions favored by people, while requiring minimal robot interactions with the environment. We find that our policies train 4x faster than the baselines and generalize better to novel objects because our visual affordance model can anticipate their affordance regions.

We evaluate our algorithm, Visual Affordance-guided Policy Optimization (VAPO), with both diverse simulation manipulation tasks and real world robot tidy-up experiments to demonstrate the effectiveness of our affordance-guided policies.

Technical Approach

We propose a method for sample-efficient policy learning of complex manipulation tasks that is guided by a self-supervised visual affordance model. Concretely, we learn affordances that are grounded in real human behavior from teleoperated "play" data. Play data is not random, but rather structured by human knowledge of object affordances (e.g., if people see a drawer in a scene, they tend to open it). Moreover, affordances discovered from unlabeled play are functional affordances, priming a robot to approach an object the way a human would. On the other hand, teleoperated play data does not bear the risk of the correspondence problem as opposed to recordings directly from human demonstrations.

Labeling Figure
Figure: Visualization of our self-supervised object affordance labelling. We leverage a self-supervised signal of a robot's gripper opening and closing during human teleoperation to project the 3D tool-center-point into the static and gripper cameras. We label the neighboring pixels within a radius around the afforded region with a binary segmentation mask and direction vectors from each pixel towards the affordance region center.

Our approach decomposes object manipulation into a sample-efficient combination of model-based planning and model-free reinforcement learning. Concretely, we first predict object affordances and drive the end-effector from free-space to the vicinity of the afforded region with a model-based method. Once inside this area, the model cannot be trusted and we switch to a reinforcement learning policy in which the agent is rewarded for interacting with the afforded regions. This way, the local policy has a "human prior" for how to approach an object, but is free to discover its exact grasping strategy. The contribution of our visual affordance model to boost sample-efficiency is two-fold: 1) driving the model-based planner to the vicinity of afforded regions, 2) guiding a local grasping RL policy to favor the same object regions favored by people. Standard model-free RL faces a number of challenges, since the policy must solve two problems: representation learning and task learning from high-dimensional raw observations in a single end-to-end training procedure. As in practice solving both problems together is difficult, embedding our visual affordance model within a reinforcement learning loop alleviates the representation learning challenge.

Network architecture
Figure: Overview of the full approach. The affordance model takes an image from either camera as input to predict object affordance masks and center pixel predictions (top left). The static camera affordances are used to select a position that the model-based policy will move towards (bottom left). We then switch to a RL policy which takes as input the the predictions of the gripper camera affordance, the robot's proprioception, the distance to the predicted center, and the current RGB-D image (right).

The reward function should not only signal a successful object interaction, but also guide the exploration process to focus on actionable object regions. To realize this, we leverage the visual affordance model to guide the agent to get close to the affordance centers. Given the detected affordance center and the fact that the RL policy only acts locally within a neighborhood, we normalize the euclidean distance between the end effector and the affordance center to create a positive reward which increases as we get closer to the detected center. Additionally if the agent goes outside the neighborhood, it receives a negative reward and if it successfully lifts an object it receives a positive reward.

Qualitative Results

Real-world tidy-up 1

Real-world tidy-up 2

Real-world tidy-up 3

Real-world tidy-up local-SAC baseline

VAPO Generalization to novel objects



A software implementation of Visual Affordance-guided Policy Optimization (VAPO) based on PyTorch, together with the used datasets, can be found in our GitHub repository for academic usage and is released under the MIT license.


Affordance Learning from Play for Sample-Efficient Policy Learning
Jessica Borja, Oier Mees, Gabriel Kalweit, Lukas Hermann, Joschka Boedecker, Wolfram Burgard
IEEE International Conference on Robotics and Automation (ICRA), 2022