Learning Correspondence from the Cycle-Consistency of Time

CVPR 2019

Xiaolong Wang*
Allan Jabri*
Alexei A. Efros


We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model optimizes a spatial feature representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods. Overall, we find that the learned representation generalizes surprisingly well, despite being trained only on indoor videos and without fine-tuning.

Results on Tracking Texture, Mask and Pose




Video for More Results


Xiaolong Wang*, Allan Jabri*, Alexei A. Efros.
Learning Correspondence from the Cycle-consistency of Time.
In CVPR, 2019 (Oral Presentation).
(hosted on arXiv)



We thank members of the BAIR community for helpful discussions and feedback, and Sasha Sax and Michael Janner for comments on drafts. AJ is supported by the PD Soros Fellowship. XW is supported by the Facebook PhD Fellowship. This work was also supported, in part, by NSF grant IIS-1633310 and Berkeley DeepDrive.