If you're planing to use information provided on this site, please keep in mind that all numbers and papers are added by authors without double checking. We of course try to keep results as accurate as possible, and whenever we got notice of an error it will be fixed, but this does not release you from the obligation of reading the papers and double checking the numbers listed here before using them.


Paper : Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Author : Joao Carreira, Andrew Zisserman

Dataset URL

Description : Kinetics is a large-scale, high-quality dataset of YouTube video URLs which include a diverse range of human focused actions. The dataset consists of approximately 300,000 video clips, and covers 400 human action classes with at least 400 video clips for each action class. Each clip lasts around 10s and is labeled with a single class. All of the clips have been through multiple rounds of human annotation, and each is taken from a unique YouTube video. The actions cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.

Number of Videos : 300000

Number of Classes : 400


Evaluation: Kinetics-val

Description: Top-1 results for the validation or test set of the Kinetics dataset. Results of the val and test set should be comparable.


Result Paper Description URL Peer Reviewed Year
Result Paper Description URL Peer Reviewed Year
71.6 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset[Joao Carreira, Andrew Zisserman] Two-stream I3D on test set URL Yes 2017
58 Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition[Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh] 3D ResNet-34 URL Yes 2017
65.1 Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?[Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh] ResNeXt-101 URL No 2017
79.4 Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification[Xiang Long , Chuang Gan , Gerard de Melo , Jiajun Wu , Xiao Liu , Shilei Wen] Attention Cluster (RGB + Flow + Audio) URL No 2017
72.4 Appearance-and-Relation Networks for Video Classification[Limin Wang , Wei Li , Wen Li ,Luc Van Gool] ARTNet with TSN URL No 2017
74.2 Attend and Interact: Higher-Order Object Interactions for Video Understanding[Chih-Yao Ma , Asim Kadav , Iain Melvin , Zsolt Kira, Ghassan AlRegib , and Hans Peter Graf] SINet (dot-product attention) URL No 2017
77.7 Non-local Neural Networks[Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He] NL I3D (RGB) URL No 2017
62.2 Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification[Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, Luc Van Gool] T3D URL No 2017
74.2 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset[Joao Carreira, Andrew Zisserman] I3D on the test set (w. ImageNet pretraining) URL Yes 2017
75.4 A Closer Look at Spatiotemporal Convolutions for Action Recognition[Du Tran , Heng Wang , Lorenzo Torresani , Jamie Ray, Yann LeCun, Manohar Paluri] URL Yes 2018
47 What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets[De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani , Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles] URL Yes 2018
71.5 Recognize Actions by Disentangling Components of Dynamics[Yue Zhao, Yuanjun Xiong, Dahua Lin] Disen. RGB only URL Yes 2018

If you want to add this result data into your web page, please insert the following HTML code on your web page: