My Account
Log in New to Bookswagon?
Sign up
- Your Account
- Personal Settings
- Your Orders
- Your Wishlist
- Your Gift Certificate
- Your Addresses
- Change Password
- Currency AEDAED
Log out
0
0

My Account

Home

Account

Wishlist

Cart

Multimodal Speaker Localization and Identification for Video Processing

(Paperback) | Released: 26 Jan 2017

By: Yongtao Hu (Author) | Publisher: Open Dissertation Press | Publisher Imprint: Open Dissertation Press

Write Reviews

AED171

Out of Stock

Notify me when this book is in stock

Multimodal Speaker Localization and Identification for Video Processing

Format: Paperback

About the Book

This dissertation, "Multimodal Speaker Localization and Identification for Video Processing" by Yongtao, Hu, 胡永涛, was obtained from The University of Hong Kong (Pokfulam, Hong Kong) and is being sold pursuant to Creative Commons: Attribution 3.0 Hong Kong License. The content of this dissertation has not been altered in any way. We have altered the formatting in order to facilitate the ease of printing and reading of the dissertation. All rights not granted by the above license are retained by the author. Abstract: With the rapid growth of the multimedia data, especially for videos, the ability to better and time-efficiently understand them is becoming increasingly important. For videos, speakers, which are normally what our eyes are focused on, have played a key role to understand the content. With the detailed information of the speakers like their positions and identities, many high-level video processing/- analysis tasks, such as semantic indexing, retrieval summarization. Recently, some multimedia content providers, such as Amazon/IMDb and Google Play, had the ability to provide additional cast and characters information for movies and TV series during playback, which can be achieved via a combination of face tracking, automatic identification and crowd sourcing. The main topics includes speaker localization, speaker identification, speech recognition, etc. This thesis first investigates the problem of speaker localization. A new algorithm for effectively detecting and localizing speakers based on multimodal visual and audio information is presented. We introduce four new features for speaker detection and localization, including lip motion, center contribution, length consistency and audio-visual synchrony, and combine them in a cascade model. Experiments on several movies and TV series indicate that, all together, they improve the speaker detection and localization accuracy by 7.5%-20.5%. Based on the locations of speakers, an efficient optimization algorithm for determining appropriate locations to place subtitles is proposed. This further enables us to develop an automatic end-to-end system for subtitle placement for TV series and movies. The second part of this thesis studies the speaker identification problem in videos. We propose a novel convolutional neural networks (CNN) based learning frame- work to automatically learn the fusion function of both faces and audio cues. A systematic multimodal dataset with face and audio samples collected from the real-life videos is created. The high variation of the samples in the dataset, including pose, illumination, facial expression, accessory, occlusion, image quality, scene and aging, wonderfully approximates the realistic scenarios and allows us to fully explore the potential of our method in practical applications. Extensive experiments on our new multi-modal dataset show that our method achieves state-of-the-art performance (over 90%) in speaker naming task without using face/person tracking, facial landmark localization or subtitle/transcript, thus making it suitable for real-life applications. The speaker-oriented techniques presented in this thesis have lots of applications for video processing. Through extensive experimental results on multiple real-life videos including TV series, movies and online video clips, we demonstrate the ability to extend our previous multimodal speaker localization and speaker identification algorithms in video processing tasks. Particularly, three main categories of applications are introduced, including (1) combine applying our speaker-following video subtitles and speaker naming work to enhance video viewing experience, where a comprehensive usability study with 219 users verifies that our subtitle placement method outperformed both conventional fixed-position subtitling and another previous dynamic subtitling method in terms of enhancing the overall viewing experience and reducing eyestrain; (2

Best Sellers

See All

Quick View

Too Good To Be True Prajakta Koli

4.3

(6)

AED45

Quick View

Thank You for Leaving Rithvik Singh

(5)

AED42

Quick View

Atomic Habits (EXP) James Clear

No Review Yet

AED94

Quick View

My First Library

4.1

(8)

AED61

Quick View

Dopamine Detox Thibaut Meurisse

No Review Yet

AED41

Quick View

Money Myths and Mantras Devina Mehra

4.7

(3)

AED45

Quick View

Meditations Marcus Aurelius

4.3

(6)

AED40

Quick View

Harry Potter Box Set: The Complete Collection (Children’s Paperback) J.K. Rowling

4.3

(8)

AED226

Quick View

Atomic Habits James Clear

4.6

(5)

AED65

Quick View

The Art of Being Alone Renuka Gavrani

(6)

AED43

Quick View

Animals Tales From Panchtantra

4.5

(10)

AED43

Quick View

My First Book of Patterns Pencil Control

4.6

(8)

AED20

Product Details

ISBN-13: 9781361033906
Publisher: Open Dissertation Press
Publisher Imprint: Open Dissertation Press
Height: 279 mm
No of Pages: 122
Weight: 299 gr

ISBN-10: 1361033908
Publisher Date: 26 Jan 2017
Binding: Paperback
Language: English
Spine Width: 7 mm
Width: 216 mm

Related Categories

Multimodal Speaker Localization and Identification for Video Processing

Multimodal Speaker Localization and Identification for Video Processing Format: Paperback

Best Sellers

Similar Products

Customer Reviews

Multimodal Speaker Localization and Identification for Video Processing

Inspired by your browsing history

Multimodal Speaker Localization and Identification for Video Processing

Format: Paperback