Space-Time Correspondence as a Contrastive Random Walk

Allan A. Jabri
Andrew Owens
Alexei A. Efros

NeurIPS 2020


This paper proposes a simple self-supervised approach for learning representations for visual correspondence from raw video. We cast correspondence as link prediction in a space-time graph constructed from a video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a node embedding in which pairwise similarity defines transition probabilities of a random walk. Prediction of long-range correspondence is efficiently computed as a walk along this graph. The embedding learns to guide the walk by placing high probability along paths of correspondence. Targets are formed without supervision, by cycle-consistency: we train the embedding to maximize the likelihood of returning to the initial node when walking along a graph constructed from a 'palindrome' of frames. We demonstrate that the approach allows for learning representations from large unlabeled video. Despite its simplicity, the method outperforms the self-supervised state-of-the-art on a variety of label propagation tasks involving objects, semantic parts, and pose. Moreover, we show that self-supervised adaptation at test-time and edge dropout improve transfer for object-level correspondence.



More Qualitative Results


Allan Jabri, Andrew Owens, Alexei A. Efros.
Space-Time Correspondence as a Contrastive Random Walk.
NeurIPS 2020, Oral Presentation.



We thank Amir Zamir, Ashish Kumar, Tim Brooks, Bill Peebles, Dave Epstein, Armand Joulin, and Jitendra Malik for very helpful feedback. We are also grateful to the wonderful members of VGG for hosting us during a dreamy semester at Oxford. This work would not have been possible without the hospitality of Port Meadow and the swimming pool on Iffley Road. Research was supported, in part, by NSF grant IIS-1633310, the DARPA MCS program, and NSF IIS-1522904. AJ is supported by the PD Soros Fellowship.