Some of the material in is restricted to members of the community. By logging in, you may be able to gain additional access to certain collections or items. If you have questions about access or logging in, please use the form on the Contact Page.
This research explores the idea of extracting three-dimensional features from video clips, in order to aid various video analysis and mining tasks. Although video analysis problems are well-established in the literature, the use of three-dimensional information in is scarce due to the inherent difficulties of building such a system. When the only input to the system is a video stream with no previous knowledge of the scene or camera (a typical scenario in video analysis), extracting meaningful and accurate 3D representations becomes a very difficult task. However, several recently proposed methods have shown some progress in working towards this goal by applying techniques from various other topics including simultaneous localization and mapping, structure from motion, and 3D reconstruction. In the research presented here, I present two main contributions towards solving this problem. First, I propose a method capable of generating a three-dimensional representation of a scene as observed by a monocular video, using no previous information. The method exploits the movement of the camera while robustly tracking features over time in order to obtain multiple views of a scene and perform 3D reconstruction. This system performs automatic camera calibration, estimates the three-dimensional structure of the scene, and tracks the scene across time while refining its results as new frames are obtained. Additionally, the system can track a scene even under the presence of moving people, a limitation of most SLAM and SFM approaches available in the literature. Secondly, I present a method for extracting the three-dimensional pose and motion of a person in a video. The method extends previously published work related to two-dimensional human pose estimation by incorporating a human motion model and expands the two-dimensional pose onto three dimensions using several heuristics. Together, these methods yield an intrinsic 3D representation of the static background and the people in a scene which can be used to solve various video analysis tasks. To prove the feasibility of my proposed method, I show how it can be used to solve a selection of video analysis tasks. First, I show how a three-dimensional point cloud of the scene can be used along with robust feature tracking to detect shot- boundaries in the video. Next, I present an automatic approach to stereoscopic video conversion using no prior knowledge of the input video. Finally, I illustrate how a three-dimensional human model can be incorporated with simple linear classifiers to perform human action recognition with high classification results.