7 minute read

关键字:video understanding · video representations · video classification · action recognition · Action Recognition | untrimmed Video · trimmed Video

知道这段视频在做什么:包括视频分类,动作识别/行为识别/行为分析;
就是对时域预先分割好的序列判定其所属行为动作的类型;

1 综述

  1. Digital Video Processing
    1995 paper
    《数字视频处理》一书包含:
    去噪和恢复运动估计:图像形成;运动模型;
    运动估计:差分,匹配,优化和变换域方法,3D运动和形状估计;
    跟踪:
    分段:颜色和运动分段,变化检测,镜头边界检测,视频消光;
    视频过滤:视频质量图像过滤:梯度估计,边缘检测,缩放,多分辨率表示,增强;多帧过滤,运动补偿过滤,多帧标准转换,多帧噪声滤波;
    压缩:无损压缩,JPEG,小波和JPEG2000视频压缩;

  2. Delving Deeper into Convolutional Networks for Learning Video Representations
    ICLR 2016 2015-11-19 paper

  3. The THUMOS Challenge on Action Recognition for Videos
    2016-04-21 paper

  4. Going Deeper into Action Recognition: A Survey
    2016-05-16 paper

  5. Deep Learning for Video Classification and Captioning
    2016-09-22 paper

  6. A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences
    2018-01-08 paper

  7. A Survey of Video Based Action Recognition in Sports
    2018-09 paper

  8. A study on deep learning spatiotemporal models and feature extraction techniques for video understanding
    2020-01-24

2 理论

  1. On the Integration of Optical Flow and Action Recognition
    CVPR 2018 2017-12-22 paper | blog | blog
    探讨了双流法中为什么光流有用;
    • 作者认为two-streams 的光流不是表示运动信息,而是表示外观不变性;
    • 用行为识别分类误差来训练(fine tune)光流比起用EPE误差来能获得更好的行为识别效果;

3 经典

  1. StNet: Local and Global Spatial-Temporal Modeling for Action Recognition
    AAAI 29 28-11-05 百度 MIT paper | pytorch | pytorch-full | paddlepaddle | 机器之心 | 解读

  2. Tiny Video Networks
    2019-10-15 paper
    $\bullet \bullet \bullet$
    视频片段处理时间 cpu 37ms,gpu 10ms;

4 通用

  1. A Variable Size Block Matching Based Descriptor for Human Action Recognition
    2015-11 paper

  2. Deep Temporal Linear Encoding Networks
    CVPR 2017 2016-11-21 paper | zxcvbnm2333
    TLE:

  3. Non-local Neural Networks
    CVPR 2018 2017-11-21 paper | caffe2 | zhihu

  4. ECO: Efficient Convolutional Network for Online Video Understanding
    ECCV 2018 2018-04-24 paper | caffe | pytorch | 林天威

  5. $A^2$ -Nets: Double Attention Networks
    NIPS 2018 2018-10-27 paper | Kivee123 | zhihu

  6. Temporal Shift Module for Efficient Video Understanding
    2018-11-20 paper | pytorch-official | pytorch | 琪瑞

  7. Convolutional sparse coding for capturing high speed video content
    2018-06-13 paper

  8. Learning Video Representations from Correspondence Proposals
    CVPR 2019 (Oral) 2019-05-20 paper
    CPNet:

  9. Lightweight Network Architecture for Real-Time Action Recognition
    2019-05-21 paper
    $\bullet \bullet$

  10. DynCNN: An Effective Dynamic Architecture on Convolutional Neural Network for Surveillance Videos
    ICLR 2019 2019 paper | openreview

  11. I Have Seen Enough: A Teacher Student Network for Video Classification Using Fewer Frames
    CVPR 2018 (BIVU)2018-05-12 paper

  12. TSM: Temporal Shift Module for Efficient Video Understanding
    ICCV 2019 2018-11-20 paper | pytorch-official

5 基本网络

5.1 传统方法

  1. Dense Trajectories and Motion Boundary Descriptors for Action Recognition
    2013-01-25 paper
    iDT

  2. Action recognition with improved trajectories
    ICCV 2013 2013-10-16 paper
    传统方法,密集轨迹算法(DT算法) iDT;

5.2 单帧 CNN

5.3 扩展 CNN

5.4 双流法

  1. Two-Stream Convolutional Networks for Action Recognition in Videos
    NIPS 2014 2014-06-09 VGG 团队 paper | pytorch | 陈泰红
    TSC

  2. Convolutional Two-Stream Network Fusion for Video Action Recognition
    CVPR 2016 2016-04-22 paper | pytorch | 牛牛存

  3. Real-time Action Recognition with Enhanced Motion Vector CNNs
    CVPR 2016 2016-04-26 paper | caffe | caffe-test | AUTO1993
    为了提速,使用运动失量代替光流,这直接导致了准确度的下降,于是用迁移学习,把训练好的光流网络的信息迁移到运动矢量上,准确度显著提升;速度达到 390 fps,是光流法的 27 倍;

  4. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
    ECCV 2016 2016-08-02 香港中文·汤晓鸥 paper | caffe | pytorch | pytorch | AI之路
    TSN,是双流网络的 benchmark 之一;为了解决 long-term 的问题,作者提出使用多个双流网络,分别捕捉不同时序位置的 short-term 信息,然后进行融合;

  5. Spatiotemporal Residual Networks for Video Action Recognition
    NIPS 2016 2016-11-07 paper | matlab

  6. Deep Local Video Feature for Action Recognition
    CVPR 2017 2017-01-28 paper
    TSN 改进一:fusion 部分,不同的片段的应该有不同的权重,而这部分由网络学习而得,最后由 SVM 分类得到结果;

  7. Hidden Two-Stream Convolutional Networks for Action Recognition

  8. Temporal Segment Networks for Action Recognition in Videos
    2017-05-08 paper | caffe | pytorch

  9. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    CVPR 2017 2017-05-22 DeepMind paper
    I3D:基于inception-V1模型,将2D卷积扩展到3D卷积,融合了双流和 C3D,准确度取得了飞跃提升,达到 80%,用了 64 块 GPU;

  10. Video Classification With CNNs: Using The Codec As A Spatio-Temporal Activity Sensor
    ICIP 2017 2017-10-14 paper | tensorflow-official
    超快;用运动矢量代替光流,选择性解码代替全部解码;

  11. End-to-end Video-level Representation Learning for Action Recognition
    ICPR 2018 2017-11-11 paper | caffe | 山水之间2018
    DTPP:为了捕捉不同长度的信息,在时空上都进行 pooling;

  12. Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification
    2017-11-22 paper | pytorch
    T3D:一方面是采用了 3D densenet,区别于之前的 inception 和 Resnet 结构;另一方面,TTL 层,使用不同尺度的卷积(inception 思想)来捕捉讯息;

  13. Temporal Relational Reasoning in Videos
    ECCV 2018 2017-11-22 MIT·周博磊 paper | pytorch | Elaine_Bao
    TSN 改进二:关注时序关系推理,对于仅靠关键帧(单帧RGB图像)无法辨别的动作,如摔倒,其实可以通过时序推理进行分类;除了两帧之间时序推理,还可以拓展到更多帧之间的时序推理;通过对不同长度视频帧的时序推理,最后进行融合得到结果;
    该模型建立 TSN 基础上,在输入的特征图上进行时序推理;增加三层全连接层学习不同长度视频帧的权重;

  14. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks
    ICCV 2017 2017-11-28 微软亚研院 paper | caffe
    P3D:改进 ResNet 内部连接中的卷积形式;

  15. A Closer Look at Spatiotemporal Convolutions for Action Recognition
    2017-11-30 facebook paper | S3D-G_pytorch | 张智勐SDU

  16. Rethinking Spatiotemporal Feature Learning For Video Understanding
    ECCV 2018 2017-12-13 google paper | S3D-G_pytorch | 张智勐SDU

  17. DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition
    CVPR 2019 2019-01-11 paper

5.5 C3D

  1. 3D Convolutional Neural Networks for Human Action Recognition
    2013 paper | 陈泰红

  2. C3D: Generic Features for Video Analysis
    ICCV 2015 2014-12-02 facebook paper project | caffe-official | tensorflow | 极市平台
    Learning spatiotemporal features with 3d convolutional networks

5.6 RNN

  1. A Torch Library for Action Recognition and Detection Using CNNs and LSTMs
    2016 paper

  2. Delving Deeper into Convolutional Networks for Learning Video Representations
    ICLR 2016 2015-11-19 paper
    GRU

  3. RPAN:An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos
    ICCV 2017 oral 中科院深圳先进·院乔宇
    与传统的 Video-level category 训练 RNN 不同,这篇文章还提出了 Pose-attention 的机制;
    贡献:
    • 不同于之前的pose-related action recognition,这篇文章是端到端的RNN,而且是 spatial-temporal evolutionos of human pose;
    • 不同于独立的学习关节点特征(human-joint features),这篇文章引入的pose-attention机制通过不同语义相关的关节点(semantically-related human joints)分享attention参数,然后将这些通过human-part pooling层联合起来
    • 视频姿态估计,通过文章的方法可以给视频进行粗糙的姿态标记;
  4. TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition
    2017-03-30 paper | pytorch | torch-lua | torch-lua

  5. Aggregating Frame-level Features for Large-Scale Video classification
    2017-07-04 paper
    2017 google cloud & YouTube-8M 挑战赛第四名作品;

5.7 图网络

图卷积网络资源

  1. Graph-based Spatial-temporal Feature Learning for Neuromorphic Vision Sensing
    2019-10-08 paper
    能够保持事件区间,具有连续性;

  2. Human Action Recognition with Multi-Laplacian Graph Convolutional Networks
    2019-10-15 paper
    拉普拉斯图网络;

6 扩展技术

6.1 注意力

  1. Action Recognition using Visual Attention
    25-11-12 paper | theano | 张智勐SDU | 解读

  2. Attentional Pooling for Action Recognition

  3. Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos
    ICLR 29 28-10- paper | 解读

  4. Semantic Adversarial Network with Multi-scale Pyramid Attention for Video Classification
    2019-03-06 paper
    仅基于 RGB 的双流网络;

  5. Video Action Transformer Network
    CVPR 2019 2018-12-06 DeepMind paper

  6. Marginalized Average Attentional Network for Weakly-Supervised Learning
    ICML 29 29-03-05 paper | pytorch | openreview | 解读

6.2 符号图

  1. Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding
    2019-05-17 paper

6.3 关键帧

7 改进方向

7.1 处理输入

空间

  1. A Key Volume Mining Deep Framework for Action Recognition
    CVPR 2016 2015 paper
    Key Volume Mining:处理输入数据,先选取关键帧,再进行分类,两阶段的方法;

  2. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos
    CVPR 2017 2016-11-24 paper | tensorflow
    不同与关键帧提取,直接输入整个视频,让卷积/池化操作自动忽略冗余帧;
    网络效果比 key volume mining 差一点,但是模型简单;

时间

  1. Hidden Two-Stream Convolutional Networks for Action Recognition
    ACCV 2018 2017-04-02 paper | caffe-official | 陈泰红
    舍弃光流,使用网络自动学习到相邻帧的运动信息,速度提高了 10 倍;
    论文主要参考了 flownet,即使用神经网络学习生成光流图,然后作为 temporal 网络的输入;该方法提升了光流的质量,而且模型大小也比 flownet 小很多;有论文证明,光流质量的提高,尤其是对于边缘微小运动光流的提升,对分类有关键作用;
    另一方面,论文中也比较了其余的输入格式,如 RGB diff,但效果没有光流好;

  2. Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition
    CVPR 2018 2017-11-29 paper | caffe | Elaine_Bao

7.2 时空信息融合

  1. Beyond Gaussian Pyramid: Multi-skip Feature Stacking for Action Recognition
    2014-11-24 paper

  2. Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks
    ICCV 2015 2015-10-02 paper | 张智勐SDU

  3. Spatiotemporal Multiplier Networks for Video Action Recognition
    CVPR 2017 2017 paper | matlab | BojackHorseman
    空间和时序网络的主体都是 ResNet,增加了从 Motion Stream 到 Spatial Stream的交互;论文还探索多种方式;

  4. Spatiotemporal Pyramid Network for Video Action Recognition
    CVPR 2017 paper
    文章认为行为识别的关键就在于如何很好的融合空间和时序上的特征;作者发现,传统双流网络虽然在最后有 fusion 的过程,但训练中确是单独训练的,最终结果的失误预测往往仅来源于某一网络,并且空间/时序网络各有所长;论文分析了错误分类的原因:空间网络在视频背景相似度高的时候容易失误,时序网络在 long-term 行为中因为 snippets length 的长度限制容易失误;那么能否通过交互,实现两个网络的互补呢;
    文章提出了 STCB 模块;交互方面,在保留空间、时序流的同时,对时空信息进行了一次融合,最后三路融合;

  5. Attentional Pooling for Action Recognition
    NIPS 2017 2017-11-04 paper | project | tensorflow-official | 若羽
    从pooling的层面提高了双流的交互能力;

  6. ActionVLAD for video action classification
    CVPR 2017 2017-04-10 paper | project | tensorflow | 思考中的哈士奇
    从pooling的层面提高了双流的交互能力;

  7. On the Connection of Deep Fusion to Ensembling
    2016-11-23 paper | mxnet-deep_fusion | sunkeke
    lastest name:Deep Convolutional Neural Networks with Merge-and-Run Mappings
    基于ResNet的结构探索新的双流连接方式

  8. Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence
    CVPR 2019 2019-05-22 paper | pytorch

7.3 多模态

  1. Unseen Action Recognition with Multimodal Learning
    28-06-21 paper | pytorch | 解读

  2. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
    ICCV 2019 2019-08-22 paper
    在初期融合了图像、流和音频;什么是流

  3. Seeing and Hearing Egocentric Actions: How Much Can We Learn?
    ICCV 2019 2019-10-15 paper
    图和声音;

7.4 AutoML

  1. Video Action Recognition Via Neural Architecture Searching
    ICIP 29 29-07-10 paper | 解读

8 数据类型

8.1 修剪视频识别

即 one-hot 类型;

8.2 未修剪视频识别

  1. UntrimmedNets for Weakly Supervised Action Recognition and Detection
    CVPR 2017 2017-03-09 paper | caffe

9 应用

9.1 手势识别

9.1.1 关键帧

10 数据集

  1. CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning
    2019-10-10 paper | [project]9https://rohitgirdhar.github.io/CATER/
    | code-official

11 其他

  1. Large-scale video classification with convolutional neural networks
    CVPR 2014 2014 李飞飞 paper | 夏洛的网

  2. Efficient Large Scale Video Classification
    2015-05-22 paper

  3. Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors
    CVPR 2015 2015-05-19 paper | caffe-matlab

  4. Deep Multi Scale Video Prediction Beyond Mean Square Error
    2015-11-17 paper

  5. Long-term temporal convolutions for action recognition
    2016-01-15 paper | project | torch-lua | BojackHorseman
    LTC

  6. Multi-Task Clustering of Human Actions by Sharing Information
    CVPR 2017 paper

  7. Deep Learning on Lie Groups for Skeleton-Based Action Recognition
    CVPR 2017 2016-12-18 paper | matlab
    融合了李群和神经网络;

  8. Trajectory Convolution for Action Recognition
    NIPS 2018 paper | zhihu
    TrajectoryNet:轨迹卷积网络

  9. Adversarial Perturbations Against Real-Time Video Classification Systems
    2018-07-02 paper

  10. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing
    ACM MM 2018 2018-08-02 paper | Keras | LIP-Dataset
    视频人物分割;

  11. Deep Adaptive Temporal Pooling for Activity Recognition
    ACM Multimedia 2018 2018-08-22 paper


TOP

附录

A 数据集

数据集 来源 视频数 动作数 说明
UCF-101 YouTube 13320 101  
HMDB51 YouTube 7000 51  
Kinetics YouTube 300K 400  
activity-net   10024+4926+5044 200  
1M sport   1.2 million 487 每类 1k-3k 个视频
  1. 法国INRIA Data Sets & Images 数据集和图像库

B 研究员

David Forsyth
Michal Irani
Ming Lin-Alibaba 管超
张智勐
Yue Zhao

C 参考资料

  1. 视频行为检测&分类方案整理
    paper with code Awesome Action Recognition
    MVision
    Christoph Feichtenhofer
    Gurkirt Singh

D 开源项目

  1. MMAction-pytorch

Comments