With the advancement of modern technology, especially the increase in network speed, videos are taking more and more important places among media types. With vast potential applications, video recognition has received great attention. However, video recognition is a non-trivial task: a lot of training data are needed for complicated neural networks, but annotated data are hard to acquire. As a result, there is a growing tendency to bank on self-supervised learning approaches that can make use of unlabeled data.