Learning Correspondence from the Cycle-Consistency of Time

CVPR 2019

Xiaolong Wang*
Allan Jabri*
Alexei A. Efros
[GitHub]
[Slides]
[Paper]





Abstract

We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model optimizes a spatial feature representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods. Overall, we find that the learned representation generalizes surprisingly well, despite being trained only on indoor videos and without fine-tuning.



Results on Tracking Texture, Mask and Pose

       

       

       



Video for More Results


Paper

Xiaolong Wang*, Allan Jabri*, Alexei A. Efros.
Learning Correspondence from the Cycle-consistency of Time.
In CVPR, 2019 (Oral Presentation).
(hosted on arXiv)

[Bibtex]

Acknowledgements

We thank members of the BAIR community for helpful discussions and feedback, and Sasha Sax and Michael Janner for comments on drafts. AJ is supported by the PD Soros Fellowship. XW is supported by the Facebook PhD Fellowship. This work was also supported, in part, by NSF grant IIS-1633310 and Berkeley DeepDrive.