FenixEdu™

Dissertação

{en_GB=Augmentation of Two-stream CNN architectures with context and attention for action detection and recognition} {} EVALUATED

Detalhes: {pt=Tarefas como reconhecimento de acções são um passo promissor em várias áreas como vendas, robótica, classificação de videos e sistemas de recomendação. Recentemente, foram apresentados datasets dificeis que são representativos da tarefa de detecção e reconhecimento de acções multi-pessoa e multi-label. Propomos melhorar as arquiteturas two-stream CNN estado-da-arte para esta tarefa. Estas arquiteturas estão limitadas no facto em que tentam detectar acções independentemente do background e de outras pessoas no mesmo video. Com este fim, três novas contribuições são apresentadas: filtros de atenção, streams de contexto e uma combinação de ambos. Para os filtros de atenção, com o objectivo de não só extrair informação de um target mas também do background, treinamos arquiteturas two-stream CNN com diferentes tipos de filtros aplicados nos inputs RGB e Optical Flow. Para as streams de contexto, com o objectivo de prever as labels de um target usando as labels dos seus vizinhos, usamos as labels do dataset para codificar explicitamente a relação entre classes executadas por multiplas pessoas como features de contexto e treinamos redes LSTM nestas features. Finalmente, combinamos estes métodos através da fusão das streams de contexto com as arquiteturas two-stream treinadas com filtros de atenção. Os resultados mostram que a combinação dos primeiros dois métodos supera a performance de cada um e todos os melhoramentos superam a baseline., en=Tasks such as action detection and recognition are a promising step in several areas such as retail, security, robotics and recommendation systems. Recently, challenging datasets have been introduced, which are representative of the task of multi-person spatiotemporal action detection and recognition task with multi-labels. We propose to augment the state-of-the-art two-stream CNN architectures for this task. These architectures are limited in that they try to detect the actions independent of the background and other humans in the same video. To this end, three novel contributions are presented: attention filtering, context streams and a combination of both. For attention filtering, with the goal of not only extracting information from a target but from the image background, we train two-stream CNN architectures with different kinds of filters applied on RGB and Optical Flow inputs. For context streams, with the goal of predicting the labels of a target using the labels of the surrounding neighbours, we use dataset labels to explicitly encode the relationship of classes performed by multiple humans as context features and train LSTM networks on these features. Finally, we combine these methods by fusing the context streams with the two-stream approaches trained with attention filtering. Results show the combination of the first two methods outperforms each of them and that all augmentations improve on a two-stream CNN baseline.}
Keywords: {pt=Detecção de Acções, Reconhecimento de Acções, Datasets Multi-label, Filtros de atenção, Relações espatiotemporais, Redes Neuronais Convolucionais, en=Action Detection, Action Recognition, Multi label datasets, Attention filters, Spatiotemporal relationships, Convolutional Neural Networks}

Discussão: novembro 9, 2018, 9:0