The ability to segment unseen objects is critical for robots to autonomously perform manipulation tasks in new environments. While previous works train deep neural networks on large datasets to learn RGB/RGB-D feature embeddings for segmentation, cluttered scenes often result in inaccurate masks, especially undersegmentation. We introduce RISeg, a novel approach that improves upon static image-based segmentation masks through robot interaction and a designed body frame-invariant feature (BFIF). The key insight is that the spatial twists of frames randomly attached to the same rigid object, when transformed to a fixed reference frame, will be equal despite varying linear and angular velocities. By identifying regions of segmentation uncertainty, strategically introducing object motion through minimal interactions (2-3 per scene), and matching BFIFs, RISeg is able to significantly improve segmentation without relying on object singulation. On cluttered real-world tabletop scenes, RISeg achieves an average object segmentation accuracy of 80.7%, an increase of 28.2% over state-of-the-art static methods.
Supplementary notes can be added here, including code and math.