基于深度學(xué)習(xí)的人類行為識別和視頻描述生成

發(fā)布時間：2018-06-14 23:30

本文選題：行為識別 + 視頻描述生成　；參考：《電子科技大學(xué)》2017年碩士論文

【摘要】：視頻智能分析一直是計算機(jī)視覺領(lǐng)域的研究熱點(diǎn)。其中涉及到不同的研究問題,包括視頻語義分割,視頻追蹤,視頻檢索,行為識別以及視頻描述生成等等。為了更進(jìn)一步構(gòu)建視頻內(nèi)容與高層語義之間的橋梁,本文圍繞兩個具體的視頻應(yīng)用進(jìn)行深入研究:視頻中的行為識別和視頻描述生成。具體而言,我們把行為識別視作低層語義分類問題。而將視頻描述問題視作為高層語義生成問題。因為該任務(wù)要求同時理解視覺內(nèi)容和自然語言。而針對這些問題的挑戰(zhàn),本文具體研究兩個問題:1)如何構(gòu)建一個算法計算視頻對應(yīng)的模式。2)如何構(gòu)建一個有效的計算框架來架起視頻內(nèi)容與自然語言的橋梁。對于視頻中的行為識別而言,傳統(tǒng)的方法將此問題化為多分類問題,并且提出了不同的視頻特征提取方法。然而,傳統(tǒng)的方法基于低層信息進(jìn)行提取,比如從視覺紋理信息或者視頻中的運(yùn)動估計值。由于提取的信息單一,不能很好的代表視頻內(nèi)容,進(jìn)而導(dǎo)致所優(yōu)化的分類器并不是最優(yōu)的。而作為深度學(xué)習(xí)中的一項技術(shù),卷積神經(jīng)網(wǎng)絡(luò)將特征學(xué)習(xí)和分類器學(xué)習(xí)融合為一個整體,并且成功應(yīng)用在視頻中的行為識別方面。然而,當(dāng)前提出的并應(yīng)用在行為識別方面的卷積神經(jīng)網(wǎng)絡(luò)框架仍具有三種限制:1)輸入網(wǎng)絡(luò)的視頻的空間尺寸必須固定大小;2)輸入網(wǎng)絡(luò)的視頻的時長固定;3)網(wǎng)絡(luò)提取短時序結(jié)構(gòu)的特征。這使得網(wǎng)絡(luò)模型應(yīng)用在極強(qiáng)的限制條件下,不利于現(xiàn)實(shí)場景下的應(yīng)用擴(kuò)展。為了解決以上的問題,本文提出一種基于3D卷積網(wǎng)絡(luò)的端到端識別模型。該模型實(shí)現(xiàn)在任意尺度和時長的視頻條件下進(jìn)行行為識別。具體而言,首先將一個視頻劃分為一系列連續(xù)的視頻片段。然后,將連續(xù)的視頻片段輸入到由卷積計算層和時空金字塔池化層組成的3D神經(jīng)網(wǎng)絡(luò)得到連續(xù)的視頻片段特征。然后通過長短記憶模型計算全局的視頻特征作為行為模式。我們在UCF101,HMDB51和ACT三個通用的數(shù)據(jù)集上評估提出的模型。實(shí)驗結(jié)果顯示,和目前流行的2D或3D為基礎(chǔ)的神經(jīng)網(wǎng)絡(luò)模型相比,提出的方法在識別性能上得到了提升。在視頻描述生成方面,以編碼-解碼為基礎(chǔ)的框架已經(jīng)得到了廣泛的應(yīng)用。最近,時序注意力機(jī)制已經(jīng)被提出并且被證明能夠提升以編碼-解碼為基礎(chǔ)的描述生成模型的性能。然而,時序注意力機(jī)制只解決了視頻內(nèi)容的選取。對于語句的上下文則是由先驗語義而定。然而,在視頻描述生成這個方面,目前的方法沒有同時考慮時序注意力機(jī)制和先驗語義建模。為了解決這個問題,本文提出一個新的端到端的神經(jīng)網(wǎng)絡(luò)模型,能夠?qū)⒏邔拥囊曈X語義概念融入到時序注意力機(jī)制,并促進(jìn)生成更準(zhǔn)確的視頻描述。在提出的框架中,編碼神經(jīng)網(wǎng)絡(luò)模型用于提取視頻的視覺特征,并且通過該特征預(yù)測語義概念。同時,解碼神經(jīng)網(wǎng)絡(luò)根據(jù)視覺特征與語義信息來生成連貫的自然語言句子。具體而言,解碼神經(jīng)網(wǎng)絡(luò)結(jié)合了視覺特征和語義表達(dá)特征。并且將語義信息和注意力機(jī)制嵌入到GRU神經(jīng)網(wǎng)絡(luò)單元中去更加準(zhǔn)確的學(xué)習(xí)句子的生成。本文在兩個代表性的數(shù)據(jù)集上(MSVD和MSRVTT)驗證提出的框架。實(shí)驗結(jié)果顯示提出的網(wǎng)絡(luò)模型在BLEU和METEOR兩個評價標(biāo)準(zhǔn)上,比以往的方法得到更好的性能評估。
[Abstract]:Video intelligence analysis has been a hot topic in the field of computer vision. It involves different research issues, including video semantic segmentation, video tracking, video retrieval, behavior recognition and video description generation. In order to further build a bridge between video content and high level semantics, this paper focuses on two specific videos. Conduct in-depth study: behavior recognition and video description generation in video. Specifically, we regard behavior recognition as a low level semantic classification problem. Video description is considered as a high-level semantic generation problem. The task requires simultaneous understanding of visual content and natural language. Two problems: 1) how to build an algorithm for computing video corresponding to.2) how to build an effective framework to build a bridge between video content and natural language. For the behavior recognition in video, the traditional method turns this problem into a multi classification problem, and presents different methods of video feature extraction. The traditional method is based on low level information, such as the motion estimation from visual texture information or video. Because the extracted information is single, it can not represent the video content well, and the optimized classifier is not optimal. As a technique in depth learning, the convolution neural network will learn the feature and the feature. Classifier learning is integrated into a whole and is successfully applied to behavior recognition in video. However, the current convolution neural network framework which is proposed and applied to behavior recognition still has three limitations: 1) the space size of the video in the input network must be fixed in size; 2) the time length of the video in the input network; 3) network extraction. In order to solve the above problems, this paper proposes an end to end recognition model based on 3D convolution network. This model implements the behavior recognition under arbitrary and long time video conditions. First, a video is divided into a series of continuous video clips. Then, the continuous video clips are input into the 3D neural network composed of the convolution computing layer and the spatio-temporal Pyramid pool layer to get the continuous video clip features. Then the video features of the global video are calculated by the long and short memory model as the behavior pattern. We are in the UCF101, The experimental results show that the proposed method has been improved in recognition performance compared with the current popular 2D or 3D based neural network models. In the aspect of video description generation, the framework based on encoding and decoding has been widely used. Recently, time series has been applied to HMDB51 and 3D based neural network models. The attention mechanism has been proposed and proved to be able to improve the performance of the description generation model based on the coding decode. However, the time series attention mechanism only solves the selection of video content. The context of the statement is determined by a priori semantics. However, the current method is not tested at the same time in the aspect of video description generation. In order to solve this problem, this paper proposes a new end to end neural network model, which can integrate the high-level visual semantic concepts into the time series attention mechanism and promote more accurate video description. In the frame, the coded neural network model is used to extract video. The visual features are predicted and the semantic concepts are predicted by this feature. At the same time, the neural network is decoded to generate coherent natural language sentences based on visual features and semantic information. In particular, the decoded neural network combines the visual features and semantic expression features. And the semantic information and attention mechanism are embedded into the GRU neural network unit. To more accurately learn the generation of sentences. This paper validates the proposed framework on two representative data sets (MSVD and MSRVTT). The experimental results show that the proposed network model is better performance evaluation than previous methods on the two evaluation criteria of BLEU and METEOR.
【學(xué)位授予單位】：電子科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.41;TP181

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 劉相濱,向堅持,王勝春;人行為識別與理解研究探討[J];計算機(jī)與現(xiàn)代化;2004年12期

2 李寧;須德;傅曉英;袁玲;;結(jié)合人體運(yùn)動特征的行為識別[J];北京交通大學(xué)學(xué)報;2009年02期

3 張偉東;陳峰;徐文立;杜友田;;基于階層多觀測模型的多人行為識別[J];清華大學(xué)學(xué)報(自然科學(xué)版);2009年07期

4 吳聯(lián)世;夏利民;羅大庸;;人的交互行為識別與理解研究綜述[J];計算機(jī)應(yīng)用與軟件;2011年11期

5 申曉霞;張樺;高贊;薛彥兵;徐光平;;一種魯棒的基于深度數(shù)據(jù)的行為識別算法[J];光電子.激光;2013年08期

6 鄭胤;陳權(quán)崎;章毓晉;;深度學(xué)習(xí)及其在目標(biāo)和行為識別中的新進(jìn)展[J];中國圖象圖形學(xué)報;2014年02期

7 曾青松;余明輝;賀衛(wèi)國;李玲;;一種行為識別的新方法[J];昆明理工大學(xué)學(xué)報(理工版);2009年06期

8 谷軍霞;丁曉青;王生進(jìn);;基于人體行為3D模型的2D行為識別[J];自動化學(xué)報;2010年01期

9 李英杰;尹怡欣;鄧飛;;一種有效的行為識別視頻特征[J];計算機(jī)應(yīng)用;2011年02期

10 王新旭;;基于視覺的人體行為識別研究[J];中國新通信;2012年21期

相關(guān)會議論文前7條

1 苗強(qiáng);周興社;於志文;倪紅波;;一種非覺察式的睡眠行為識別技術(shù)研究[A];第18屆全國多媒體學(xué)術(shù)會議（NCMT2009）、第5屆全國人機(jī)交互學(xué)術(shù)會議（CHCI2009）、第5屆全國普適計算學(xué)術(shù)會議（PCC2009）論文集[C];2009年

2 齊娟;陳益強(qiáng);劉軍發(fā);;基于多模信息感知與融合的行為識別[A];第18屆全國多媒體學(xué)術(shù)會議（NCMT2009）、第5屆全國人機(jī)交互學(xué)術(shù)會議（CHCI2009）、第5屆全國普適計算學(xué)術(shù)會議（PCC2009）論文集[C];2009年

3 方帥;曹洋;王浩;;視頻監(jiān)控中的行為識別[A];2007中國控制與決策學(xué)術(shù)年會論文集[C];2007年

4 黃紫藤;吳玲達(dá);;監(jiān)控視頻中簡單人物行為識別研究[A];第18屆全國多媒體學(xué)術(shù)會議（NCMT2009）、第5屆全國人機(jī)交互學(xué)術(shù)會議（CHCI2009）、第5屆全國普適計算學(xué)術(shù)會議（PCC2009）論文集[C];2009年

5 安國成;羅志強(qiáng);李洪研;;改進(jìn)運(yùn)動歷史圖的異常行為識別算法[A];第八屆中國智能交通年會優(yōu)秀論文集——智能交通與安全[C];2013年

6 王忠民;曹棟;;坐標(biāo)轉(zhuǎn)換在移動用戶行為識別中的應(yīng)用研究[A];2013年全國通信軟件學(xué)術(shù)會議論文集[C];2013年

7 劉威;李石堅;潘綱;;uRecorder:基于位置的社會行為自動日志[A];第18屆全國多媒體學(xué)術(shù)會議（NCMT2009）、第5屆全國人機(jī)交互學(xué)術(shù)會議（CHCI2009）、第5屆全國普適計算學(xué)術(shù)會議（PCC2009）論文集[C];2009年

相關(guān)重要報紙文章前4條

1 李晨光;導(dǎo)入CIS要注意什么？[N];河北經(jīng)濟(jì)日報;2001年

2 農(nóng)發(fā)行鹿邑支行黨支部書記行長劉永貞;發(fā)行形象與文化落地農(nóng)[N];周口日報;2007年

3 東林;行為識別新技術(shù)讓監(jiān)控沒有“死角”[N];人民公安報;2007年

4 田凱　徐蕊李政育信木祥;博物館安全的國際經(jīng)驗[N];中國文物報;2014年

相關(guān)博士學(xué)位論文前10條

1 邵延華;基于計算機(jī)視覺的人體行為識別研究[D];重慶大學(xué);2015年

2 仝鈺;基于條件隨機(jī)場的智能家居行為識別研究[D];大連海事大學(xué);2015年

3 馮銀付;多模態(tài)人體行為識別技術(shù)研究[D];浙江大學(xué);2015年

4 姜新波;基于三維骨架序列的人體行為識別研究[D];山東大學(xué);2015年

5 裴利沈;視頻中人體行為識別若干問題研究[D];電子科技大學(xué);2016年

6 周同馳;行為識別中基于局部時空關(guān)系的特征模型研究[D];東南大學(xué);2016年

7 徐海燕;復(fù)雜環(huán)境下行為識別特征提取方法研究[D];東南大學(xué);2016年

8 吳云鵬;集體行為的識別與仿真研究[D];鄭州大學(xué);2017年

9 劉艷秋;舍飼環(huán)境下母羊產(chǎn)前典型行為識別方法研究[D];內(nèi)蒙古農(nóng)業(yè)大學(xué);2017年

10 何衛(wèi)華;人體行為識別關(guān)鍵技術(shù)研究[D];重慶大學(xué);2012年

相關(guān)碩士學(xué)位論文前10條

1 王軒瀚;基于深度學(xué)習(xí)的人類行為識別和視頻描述生成[D];電子科技大學(xué);2017年

2 胡珂杰;基于3D骨骼的人體行為識別關(guān)鍵技術(shù)研究[D];江南大學(xué);2018年

3 唐小琴;基于全局和局部運(yùn)動模式的人體行為識別研究[D];西南大學(xué);2015年

4 胡秋揚(yáng);可穿戴式個人室內(nèi)位置和行為監(jiān)測系統(tǒng)[D];浙江大學(xué);2015年

5 陳鈺昕;基于時空特性的人體行為識別研究[D];燕山大學(xué);2015年

6 任亮;智能車環(huán)境下車輛典型行為識別方法研究[D];長安大學(xué);2015年

7 金澤豪;并行化的人體行為識別方法研究與實(shí)現(xiàn)[D];華南理工大學(xué);2015年

8 王呈;穿戴式多傳感器人體日�；顒颖O(jiān)測系統(tǒng)設(shè)計與實(shí)現(xiàn)[D];南京理工大學(xué);2015年

9 王露;基于稀疏時空特征的人體行為識別研究[D];蘇州大學(xué);2015年

10 于靜;基于物品信息和人體深度信息的行為識別研究[D];山東大學(xué);2015年

，

本文編號：2019450

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://lk138.cn/kejilunwen/zidonghuakongzhilunwen/2019450.html

上一篇：大紅斑蝶算法及離子運(yùn)動算法的改進(jìn)研究
下一篇：多仿生機(jī)器魚分布式編隊控制算法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

国产伦乱,一曲二曲欧美日韩,AV在线不卡免费在线不卡免费,搞91AV视频

基于深度學(xué)習(xí)的人類行為識別和視頻描述生成