FenixEdu™

Dissertação

{en_GB=Action-conditioned disentanglement of agent and objects for video prediction in robotic tasks} {} EVALUATED

Detalhes: {pt=Baseado na sua experiência, o cérebro humano é capaz de anticipar futuros iminentes e assim simplificar tarefas rotineiras como ler ou conduzir. Ainda assim, o papel desta capacidade de previsão nas faculdades humanas poderá ser de uma ainda maior prepoderância, como sugerem teorias do cérebro que relacionam perceção, aprendizagem e previsão. O estudo da previsão de inputs sensoriais poderá assim revelar-se um importante passo no desenvolvimento de sistemas inteligentes. Prever inputs visuais é talvez o modo mais evidente de concretizar estas ideias. Apesar de nos últimos anos, o esforço para desenvolver modelos de previsão de vídeo ter vindo a crescer, existe ainda grande dificuldade em ultrapassar o rápido aumento de complexidade trazido pela dimensão dos frames e pelo horizonte de previsão. Opções para mitigar este problema incluem a dissociação de fontes de informação e o uso das futuras ações de um agente como um input extra. Se o modelo interpretar corretamente as consequências de cada ação, estas modificações permitem ainda a utilização destes modelos em tarefas de planeamento. Tendo isto em consideração, este trabalho oferece as seguintes contribuições: (i) uma nova métrica que avalia modelos de previsão de vídeo com base na sua capacidade para guiar tomadas de decisão; (ii) o design de um modelo que separa informação relativa ao agente de informação sobre objetos, baseado nas futuras ações e na estrutura inerente a um vídeo; (iii) a investigação sobre a capacidade da previsão em separado da informação de objetos para melhorar o estado da arte em previsão de vídeo., en=Based on past experience, our brain is capable of anticipating the imminent future, simplifying routine tasks such as reading or driving. Yet, prediction may have an even more significant role in human faculties, as suggested by the growing support for prediction based theories of the brain that connect perception, learning and prediction. If these are correct, studying the prediction of sensory signals may turn out to be an important stepping stone on the path towards intelligent systems. Predicting visual input, in the form of video frames, is maybe the most obvious way of materializing these ideas. The effort for designing video prediction models has intensified in past years, yet current solutions are still hampered by the rapid increase in difficulty that comes with the size of the frames and the prediction horizon. Two ways of mitigating these problems are to disentangle sources of information and to add future actions as an extra input. Furthermore, if the model is able to interpret the implications of each action, these modifications allow its use in robotic planning tasks. With this in consideration, this work offers the following contributions: (i) a new method that evaluates video prediction models based on their ability to guide action decisions; (ii) a model that separates agent information from information of objects, using knowledge of future actions and the inherent structure of video data; and (iii) we investigate whether separately predicting object information, conditioned on the actions, can improve the state of the art in video prediction. }
Keywords: {pt=Previsão de Vídeo, Representações desassociadas, Robótica, Benchmarking, Codificação Preditiva, Aprendizagem Profunda, en=Video Prediction, Disentangled representations, Robotics, Benchmarking, Predictive Coding, Deep Learning}

Discussão: outubro 21, 2020, 10:0