83.5 |
Multi-view super vector for action recognition[Cai, Z., Wang, L., Peng, X., Qiao, Y]
|
MVSV |
URL
|
Yes
|
2014
|
87.9 |
Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice[Peng, X., Wang, L., Wang, X., Qiao, Y]
|
|
URL
|
Yes
|
2016
|
88.3 |
A multi-level representation for action recognition[Wang, L., Qiao, Y., Tang, X]
|
|
URL
|
Yes
|
2016
|
88 |
Two-stream convolutional networks for action recognition in videos[Simonyan, K., Zisserman, A]
|
|
URL
|
Yes
|
2014
|
88.1 |
Human action recognition using factorized spatio-temporal convolutional networks[Sun, L., Jia, K., Yeung, D., Shi, B.E]
|
|
URL
|
Yes
|
2015
|
90.3 |
Action recognition with trajectory-pooled deepconvolutional descriptors[Wang, L., Qiao, Y., Tang, X]
|
|
URL
|
Yes
|
2015
|
91.7 |
Long-term temporal convolutions for action recognition[Varol, G., Laptev, I., Schmid, C]
|
|
URL
|
Yes
|
2016
|
93.1 |
A key volume mining deep framework for action recognition[Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y]
|
|
URL
|
Yes
|
2016
|
94.2 |
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[Limin Wang , Yuanjun Xiong , Zhe Wang , Yu Qiao , Dahua Lin , Xiaoou Tang , and Luc Van Gool]
|
|
URL
|
Yes
|
2016
|
88.6 |
Beyond short snippets: Deep networks for video classification[Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G]
|
|
URL
|
Yes
|
2015
|
85.2 |
Learning spatiotemporal features with 3d convolutional networks[Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M]
|
|
URL
|
No
|
2015
|
90.3 |
Hidden Two-Stream Convolutional Networks for Action Recognition[Yi Zhu , Zhenzhong Lan ,Shawn Newsam ,Alexander G. Hauptmann ]
|
|
URL
|
No
|
2017
|
94.6 |
Action Representation Using Classifier Decision Boundaries[Jue Wang , Anoop Cherian , Fatih Porikli , Stephen Gould]
|
|
URL
|
No
|
2017
|
93.6 |
ActionVLAD: Learning spatio-temporal aggregation for action classification[Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell]
|
|
URL
|
Yes
|
2017
|
94.6 |
Spatiotemporal Pyramid Network for Video Action Recognition[Yunbo Wang, Mingsheng Long, Jianmin Wang, Philip S. Yu]
|
Spatiotemporal Pyramid Network / BN-Inception |
URL
|
Yes
|
2017
|
94.9 |
Spatiotemporal Multiplier Networks for Video Action Recognition[Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes]
|
Spatiotemporal Multiplier Networks + IDT |
URL
|
Yes
|
2017
|
92.3 |
Generalized Rank Pooling for Activity Recognition[Anoop Cherian, Basura Fernando, Mehrtash Harandi, Stephen Gould]
|
Generalized Rank Pooling + IDT-FV |
URL
|
Yes
|
2017
|
76.3 |
Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition[ An-An Liu, Yu-Ting Su, Wei-Zhi Nie, Mohan Kankanhalli]
|
HC-MTL with STIP + BOW |
URL
|
Yes
|
2017
|
93.4 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset[Joao Carreira, Andrew Zisserman]
|
Two-Stream I3D, ImageNet pre-training |
URL
|
Yes
|
2017
|
98 |
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset[Joao Carreira, Andrew Zisserman]
|
Two-Stream I3D, Kinetics pre-training |
URL
|
Yes
|
2017
|
89.8 |
Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor[Aaron Chadha, Alhabib Abbas and Yiannis Andreopoulos]
|
Codec Based |
URL
|
No
|
2017
|
94.5 |
Learning Gating ConvNet for Two-Stream based Methods in Action Recognition[Jiagang Zhu , Wei Zou , Zheng Zhu]
|
Gated TSN |
URL
|
No
|
2017
|
95.4 |
Learning Long-Term Dependencies for Action Recognition With a Biologically-Inspired Deep Network[Yemin Shi, Yonghong Tian, Yaowei Wang, Wei Zeng, Tiejun Huang]
|
shuttleNet |
URL
|
Yes
|
2017
|
95.8 |
Eigen Evolution Pooling for Human Action Recognition[Yang Wang, Vinh Tran, Minh Hoai]
|
Eigen TSN + DTD |
URL
|
No
|
2017
|
93.6 |
Lattice Long Short-Term Memory for Human Action Recognition[Lin Sun, Kui Jia, Kevin Chen, Dit Yan Yeung, Bertram E. Shi, Silvio Savarese]
|
Lattice LSTM |
URL
|
Yes
|
2017
|
91.1 |
Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection[Mohammadreza Zolfaghari , Gabriel L. Oliveira, Nima Sedaghat, Thomas Brox]
|
Chained Multi-stream Networks |
URL
|
Yes
|
2017
|
98 |
End-to-end Video-level Representation Learning for Action Recognition[Jiagang Zhu, Wei Zou, Zheng Zhu, Lin Li]
|
DTPP (Kinetics pre-training) |
URL
|
No
|
2017
|
94.3 |
Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion[Weiyao Lin, Yang Mi, Jianxin Wu, Ke Lu, Hongkai Xiong]
|
CO2FI + ASYN |
URL
|
No
|
2017
|
95.2 |
Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion[Weiyao Lin, Yang Mi, Jianxin Wu, Ke Lu, Hongkai Xiong]
|
CO2FI + ASYN+IDT |
URL
|
No
|
2017
|
94.5 |
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?[Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh]
|
ResNeXt-101 (64f) |
URL
|
No
|
2017
|
94.6 |
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification[Xiang Long , Chuang Gan , Gerard de Melo , Jiajun Wu , Xiao Liu , Shilei Wen]
|
Attention Cluster RGB+Flow |
URL
|
No
|
2017
|
94.3 |
Appearance-and-Relation Networks for Video Classification[Limin Wang , Wei Li , Wen Li ,Luc Van Gool]
|
ARTNet with TSN (Pre-train dataset Kinetics) |
URL
|
No
|
2017
|
95.2 |
Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion[Weiyao Lin , Yang Mi , Jianxin Wu , Ke Lu , Hongkai Xiong]
|
CO2FI + ASYN + IDT |
URL
|
No
|
2017
|
93.2 |
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification[Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, Luc Van Gool]
|
T3D+TSN ( Three splits) |
URL
|
No
|
2017
|
91.6 |
Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification[Xiaodong Yang, Pavlo Molchanov, Jan Kautz]
|
|
URL
|
Yes
|
2016
|
94.9 |
Compressed Video Action Recognition[Chao-Yuan Wu and Manzil Zaheer and Hexiang Hu and R. Manmatha and Alexander J. Smola and Philipp Kraehenbuehl]
|
CoViAR + optical flow |
URL
|
No
|
2017
|
94.3 |
Making Convolutional Networks Recurrent for Visual Sequence Learning[Xiaodong Yang, Pavlo Molchanov, Jan Kautz ]
|
|
URL
|
Yes
|
2018
|
97.3 |
A Closer Look at Spatiotemporal Convolutions for Action Recognition[Du Tran , Heng Wang , Lorenzo Torresani , Jamie Ray, Yann LeCun, Manohar Paluri]
|
|
URL
|
Yes
|
2018
|
79 |
What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets[De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani , Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles]
|
|
URL
|
Yes
|
2018
|
93.8 |
Activity Recognition based on a Magnitude-Orientation Stream Network[Caetano, C., de Melo, V. H. C., dos Santos, J. A., Schwartz, W. R.]
|
According to the results, just using our Magnitude-Orientation Stream (MOS), we outperform many methods . In comparison with C3D, we outperform them by 5.3 p.p. using our temporal stream and 8.6 p.p. when combining it with Very Deep Two-Stream. This indicates that our magnitude orientation approach learns temporal information better than the approaches that perform 3D convolution operations directly. It is worth mentioning that we also improved the results achieved by the original two-stream by |
URL
|
Yes
|
2017
|
98.2 |
PoTion: Pose MoTion Representation for Action Recognition[Vasileios Choutas, Philippe Weinzaepfel, Jérôme Revaud, Cordelia Schmid]
|
I3D + PoTion |
URL
|
Yes
|
2018
|
88.9 |
MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition[Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, Wenjun Zeng]
|
MiCT-Net |
URL
|
Yes
|
2018
|
94.7 |
MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition[Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, Wenjun Zeng]
|
Two-stream MiCT-Net |
URL
|
Yes
|
2018
|
96 |
Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition[Shuyang Sun, Zhanghui Kuang, Wanli Ouyang, Lu Sheng, Wei Zhang]
|
RGB + OFF(RGB) + OFF(optical flow) + OFF(raw-OFF) |
URL
|
Yes
|
2018
|
95.9 |
Recognize Actions by Disentangling Components of Dynamics[Yue Zhao, Yuanjun Xiong, Dahua Lin]
|
Disen w. pretrained ImageNet+Kinetics |
URL
|
Yes
|
2018
|
66.1 |
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning[Chuang Gan, Boqing Gong, Kun Liu, Hao Su, Leonidas J. Guibas]
|
GG-CNN ImageNet pretraining |
URL
|
Yes
|
2018
|
94.3 |
Making Convolutional Networks Recurrent for Visual Sequence Learning[Xiaodong Yang, Pavlo Molchanov, Jan Kautz]
|
PreRNN-SIH |
URL
|
Yes
|
2018
|
95.4 |
End-to-End Learning of Motion Representation for Video Understanding[Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, Junzhou Huang]
|
TVNets + IDT |
URL
|
Yes
|
2018
|
87.8 |
Learning and Using the Arrow of Time[Donglai Wei, Jospeh Lim, Andrew Zisserman, William T. Freeman]
|
AoT (flow only) |
URL
|
Yes
|
2018
|
94.2 |
Procedural Generation of Videos to Train Deep Action Recognition Networks[César Roberto de Souza, Adrien Gaidon, Yohann Cabon, Antonio Manuel López Peña]
|
Leveraging our synthetic dataset and multi-task models, we increase the performance from 93.6 to 94.2 |
URL
|
No
|
2017
|
95.1 |
Unsupervised Universal Attribute Modelling for Action Recognition[Debaditya Roy, K. Sri Rama Murty, C. Krishna Mohan]
|
|
URL
|
No
|
2018
|
94.5 |
IF-TTN: Information Fused Temporal Transformation Network for Video Action Recognition[Ke Yang, Peng Qiao, Dongsheng Li, Yong Dou]
|
MV-IF-TTN |
URL
|
No
|
2019
|
96.2 |
IF-TTN: Information Fused Temporal Transformation Network for Video Action Recognition[Ke Yang, Peng Qiao, Dongsheng Li, Yong Dou]
|
Full IF-TTN |
URL
|
No
|
2019
|
97.7 |
Holistic Large Scale Video Understanding[Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jurgen Gall, Rainer Stiefelhagen, Luc Van Gool]
|
HATNet (32 frames) |
URL
|
No
|
2019
|
97.1 |
Hidden Two-Stream Convolutional Networks for Action Recognition[Yi Zhu, Zhenzhong Lan, Shawn Newsam, Alexander Hauptmann]
|
Hidden Two-stream(I3D) |
URL
|
Yes
|
2018
|
96 |
Spatial-Temporal Pyramid Based Convolutional Neural Network for Action Recognition[Zhenxing Zheng, Gaoyun An, Dapeng Wu, Qiuqi Ruan]
|
S-TPNet + iDT |
URL
|
No
|
2019
|
91.9 |
Moments in Time Dataset: one million videos for event understanding[Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, Aude Oliva]
|
ResNet50 I3D pretrained on Moments and Kinetics |
URL
|
Yes
|
2019
|
96.5 |
Spatio-Temporal Channel Correlation Networks for Action Classification[Ali Diba*, Mohsen Fayyaz*, Vivek Sharma, M Mahdi Arzani, Rahman Yousefzadeh, Juergen Gall, Luc Van Gool]
|
STC-ResNext 101 (64 frames) RGB Only |
URL
|
Yes
|
2018
|
93.2 |
Temporal 3d convnets using temporal transition layer[Ali Diba, Mohsen Fayyaz, Vivek Sharma, A Hossein Karami, M Mahdi Arzani, Rahman Yousefzadeh, Luc Van Gool]
|
|
URL
|
No
|
2018
|
97.6 |
MARS: Motion-Augmented RGB Stream for Action Recognition[Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, Cordelia Schmid]
|
input = RGB frames (Pretrained on Kinetics) |
URL
|
Yes
|
2019
|
98.2 |
Global and Local Knowledge-Aware Attention Network for Action Recognition[Zhenxing Zheng, Gaoyun An, Dapeng Wu, Qiuqi Ruan]
|
global and local attention + I3D |
URL
|
No
|
2019
|
96 |
Multi-Fiber Networks for Video Recognition[Yunpeng Chen,Yannis Kalantidis,Jianshu Li,Shuicheng Yan,Jiashi Feng]
|
|
URL
|
No
|
2018
|