Disclaimer
If you're planing to use information provided on this site, please keep in mind that all numbers and papers are added by authors without double checking. We of course try to keep results as accurate as possible, and whenever we got notice of an error it will be fixed, but this does not release you from the obligation of reading the papers and double checking the numbers listed here before using them.
Kinetics
Paper : Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Author : Joao Carreira, Andrew Zisserman
Dataset URL
Description : Kinetics is a large-scale, high-quality dataset of YouTube video URLs which include a diverse range of human focused actions. The dataset consists of approximately 300,000 video clips, and covers 400 human action classes with at least 400 video clips for each action class. Each clip lasts around 10s and is labeled with a single class. All of the clips have been through multiple rounds of human annotation, and each is taken from a unique YouTube video. The actions cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.
Number of Videos : 300000
Number of Classes : 400
Resources
Evaluation: Kinetics-val
Description: Top-1 results for the validation or test set of the Kinetics dataset. Results of the val and test set should be comparable.
Results
Result |
Paper |
Description |
URL |
Peer Reviewed |
Year |
Result |
Paper |
Description |
URL |
Peer Reviewed |
Year |
71.6 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset[Joao Carreira, Andrew Zisserman]
|
Two-stream I3D on test set |
URL
|
Yes
|
2017
|
58 |
Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition[Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh]
|
3D ResNet-34 |
URL
|
Yes
|
2017
|
65.1 |
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?[Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh]
|
ResNeXt-101 |
URL
|
No
|
2017
|
79.4 |
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification[Xiang Long , Chuang Gan , Gerard de Melo , Jiajun Wu , Xiao Liu , Shilei Wen]
|
Attention Cluster (RGB + Flow + Audio) |
URL
|
No
|
2017
|
72.4 |
Appearance-and-Relation Networks for Video Classification[Limin Wang , Wei Li , Wen Li ,Luc Van Gool]
|
ARTNet with TSN |
URL
|
No
|
2017
|
74.2 |
Attend and Interact: Higher-Order Object Interactions for Video Understanding[Chih-Yao Ma , Asim Kadav , Iain Melvin , Zsolt Kira, Ghassan AlRegib , and Hans Peter Graf]
|
SINet (dot-product attention) |
URL
|
No
|
2017
|
77.7 |
Non-local Neural Networks[Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He]
|
NL I3D (RGB) |
URL
|
No
|
2017
|
62.2 |
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification[Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, Luc Van Gool]
|
T3D |
URL
|
No
|
2017
|
74.2 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset[Joao Carreira, Andrew Zisserman]
|
I3D on the test set (w. ImageNet pretraining) |
URL
|
Yes
|
2017
|
75.4 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition[Du Tran , Heng Wang , Lorenzo Torresani , Jamie Ray, Yann LeCun, Manohar Paluri]
|
|
URL
|
Yes
|
2018
|
47 |
What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets[De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani , Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles]
|
|
URL
|
Yes
|
2018
|
71.5 |
Recognize Actions by Disentangling Components of Dynamics[Yue Zhao, Yuanjun Xiong, Dahua Lin]
|
Disen. RGB only |
URL
|
Yes
|
2018
|
79 |
SlowFast Networks for Video Recognition[Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He]
|
SlowFast, R101 + NL |
URL
|
No
|
2018
|
76.3 |
Holistic Large Scale Video Understanding[Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jurgen Gall, Rainer Stiefelhagen, Luc Van Gool]
|
HATNet (32 frames) |
URL
|
No
|
2019
|
82.8 |
Large-scale weakly-supervised pre-training for video action recognition[Deepti Ghadiyaram, Matt Feiszli, Du Tran, Xueting Yan, Heng Wang, Dhruv Mahajan]
|
WeakLargeScale (RGB) |
URL
|
No
|
2019
|
68.7 |
Spatio-Temporal Channel Correlation Networks for Action Classification[Ali Diba*, Mohsen Fayyaz*, Vivek Sharma, M Mahdi Arzani, Rahman Yousefzadeh, Juergen Gall, Luc Van Gool]
|
STC-ResNext101 (32 frames) RGB Only, Top-1 |
URL
|
Yes
|
2018
|
77.2 |
Evolving Space-Time Neural Architectures for Videos[AJ Piergiovanni, Anelia Angelova, Alexander Toshev, and Michael Ryoo]
|
|
URL
|
Yes
|
2019
|
77.9 |
Representation Flow for Action Recognition[AJ Piergiovanni and Michael Ryoo]
|
|
URL
|
Yes
|
2019
|
74.8 |
MARS: Motion-Augmented RGB Stream for Action Recognition[Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, Cordelia Schmid]
|
input = RGB frames (No Pretraining) |
URL
|
Yes
|
2019
|
82.6 |
Video Classification with Channel-Separated Convolutional Networks[Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli]
|
CSN on RGB |
URL
|
Yes
|
2019
|
72.8 |
Multi-Fiber Networks for Video Recognition[Yunpeng Chen,Yannis Kalantidis,Jianshu Li,Shuicheng Yan,Jiashi Feng]
|
|
URL
|
No
|
2018
|
If you want to add this result data into your web page, please insert the following HTML code on your web page: