|

|
Accurate 3D Pose Estimation From a Single
Depth Image
We present a novel system to estimate body pose
configuration from a single depth map. It combines both pose detection and pose refinement. The input depth map is matched with a
set of pre-captured motion exemplars to generate a body
configuration estimation, as well as semantic labeling of the input point
cloud. The initial estimation is then refined by directly fitting the body
configuration with the observation (e.g., the input depth). In addition to
the new system architecture, our other contributions include modifying a
point cloud smoothing technique to deal with very noisy input depth maps, a
point cloud alignment and pose search algorithm that is view-independent
and efficient. Experiments on a public dataset show that our approach
achieves significantly higher accuracy than previous state-of-art methods.
|
|

|
Automatic Real-Time Video Matting
Using Time-of-Flight Camera and Multichannel Poisson Equations
We present an automatic real-time video matting system.
The proposed system consists of two novel components. In order to
automatically generate trimaps for live videos, we advocate a
Time-of-Flight (TOF) camera-based approach to video bi-layer segmentation.
Our algorithm combines color and depth cues in a probabilistic fusion
framework. The scene depth information returned by the TOF camera is less
sensitive to environment changes, which makes our method robust to
illumination variation, dynamic background and camera motion. For the
second step, we perform alpha matting based on the segmentation result. Our
matting algorithm is based on a set of novel Poisson equations that are
derived for handling multichannel color vectors, as well as the depth
information captured. Real-time processing speed is achieved through
optimizing the algorithm for parallel processing on graphics hardware. We
demonstrate the effectiveness of our matting system on an extensive set of
experimental results.
|
|

|
Video Stereolization:
Combining Motion Analysis with User Interaction
We present a semi-automatic system that converts
conventional videos into stereoscopic videos by combining motion analysis
with user interaction, aiming to transfer as much as possible labeling work
from the user to the computer. In addition to the widely-used structure
from motion (SFM) techniques, we develop two new methods that analyze the
optical flow to provide additional qualitative depth constraints. They
remove the camera movement restriction imposed by SFM so that general
motions can be used in scene depth estimation ? the central problem in mono-to-stereo conversion. With
these algorithms, the user’s labeling task is significantly simplified. We
further developed a quadratic programming approach to incorporate both
quantitative depth and qualitative depth (such as these from user
scribbling) to recover dense depth maps for all frames, from which
stereoscopic view can be synthesized. In addition to visual results, we
present user study results showing that our approach is more intuitive and
less labor intensive, while producing 3D effect comparable to that from
current state-of-the-art interactive algorithms.
|
|

|
Interreflection Removal
for Photometric Stereo by Using Spectrum-dependent Albedo
We present a novel method that can separate m-bounced light
and remove the interreflections in a photometric stereo setup. Under the
assumption of a uniformly colored lambertian surface, the intensity of a
point in the scene is the sum of 1-bounced light through m-bounced light
rays. Ruled by the law of diffuse reflection, whenever a light ray is
bounced by the surface, its intensity will be attenuated by the factor of
albedo ヱ. This implies that the measured
intensity value can be written as a polynomial function of ヱ, and the intensity contribution of the m-bounced light
rays are expressed by the term of ヱm.
Therefore, when we change the surface albedo, the intensity of the
m-bounced light is changed to the order of m. This non-linearity gives us
the possibility to separate the m-bounced light. In practice, we illuminate
the scene with different light colors to effectively simulate different
surface albedos since albedo is spectrum dependent. Once the m-bounced
light rays are separated, we can perform the photometric stereo algorithm
on the 1-bounced light (direct lighting) images to produce the 3D shape
without the impact of interreflections. Experiments have shown that we get
significantly improved scene reconstruction with a minimum of two color
image
|
|

|
Learning-based Face Modeling
from a Single Image
The 3D reconstruction of a face from a single frontal
image is an ill-posed problem. This is further accentuated when the face
image is captured under different poses and/or complex illumination
conditions. We aim to solve the shape recovery problem from a single facial
image under these challenging conditions. The local image models for each
patch of facial images and the local surface models for each patch of 3D
shape are learned using a non-linear dimensionality reduction technique,
and the correspondences between these local models are then learned by a
manifold alignment method. By combining the local shapes, the global shape
of a face can be reconstructed directly using a single least-square system
of equations. We perform experiments on synthetic and real data, and
validate the algorithm against the ground truth. Experimental results show
that our method can yield accurate shape recovery from out-of-training
samples with a variety of pose and illumination variations.
|
|

|
Semantic Segmentation of
Urban Scenes Using Dense Depth Maps
We present a framework for semantic scene parsing and
object recognition based on dense depth maps. Five view independent 3D
features that vary with object class are extracted from dense depth maps at
a super-pixel level for training a classifier using randomized decision
forest. Our formulation integrates multiple features in a Markov Random
Field (MRF) framework to segment and recognize different object classes in
query street scene images. We evaluate our method both quantitatively and
qualitatively on the challenging Cambridge-driving Labeled Video Database
(CamVid). The result shows that only using dense depth information, we can
achieve overall better accurate segmentation and recognition than that from
sparse 3D features or appearance, or even the combination of sparse 3D
features and appearance, advancing state-of-the-art performance.
|
|

|
A Volumetric Approach
for Merging Range Images of Semi-Rigid Objects Captured at Different Time
Instances
We present a framework for reconstructing complete 4D
models of semi-rigid objects from a single stereoscopic sequence,extending the powerful structure-from motion method to
dynamic scenes. We developed a novel volumetric distance field warping
function so that depth maps from different time, even if there are
non-rigid deformations, can be mapped to time t and merged together.
|
|

|
Fusion of Passive Stereo and
Time-of-Flight
Time-of-flight range sensors have error characteristics
which are complementary to passive stereo. They provide real time depth
estimates in conditions where passive stereo does not work well, such as on
white walls. In contrast, these sensors are noisy and often perform poorly
on the textured scenes for which stereo excels. We introduce a method for
combining the results from both methods that performs better than either
alone. A depth probability distribution function from each method is
calculated and then merged. In addition, stereo methods have long used
global methods such as belief propagation and graph cuts to improve
results, and we apply these methods to this sensor. Since time-of-flight
devices have primarily been used as individual sensors, they are typically
poorly calibrated. We introduce a method that substantially improves upon
the manufacturer’s calibration. We show that these techniques lead to
improved accuracy and robustness.
|
|

|
Modeling Deformable
Objects from a Single Depth Camera
We propose a novel approach to reconstruct complete 3D
deformable models over time by a single depth camera, provided that most parts
of the models are observed by the camera at least once. The core of this
algorithm is based on the assumption that the deformation is continuous and
predictable in a short temporal interval. While the camera can only capture
part of a whole surface at any time instant, partial surfaces reconstructed
from different times are assembled together to form a complete 3D surface
for each time instant, even when the shape is under severe deformation. A
mesh warping algorithm based on linear mesh deformation is used to align
different partial surfaces. A volumetric method is then used to combine
partial surfaces, fix missing holes, and smooth alignment errors. Our
experiment shows that this approach is able to reconstruct visually
plausible 3D surface deformation results with a single camera.
|
|

|
Multi-Projector
Display Systems
The goal of this research is to create prototypes of
rapidly assembled and calibrated multi-projector display systems capable of
displaying any content in any situation. These displays provide ultra-high
resolution in a large format with a short depth footprint. Due to their
rapid assembly and calibration, they are ideally suited for portable
operations such as mobile command centers for first responders,
field-deployable troop training environments, and conference and trade show
displays. The software driving these displays empowers the user with easy
content management and control of multiple windows in multiple displays.
Still images, video, live capture feeds, remote desktop connections, DVR
servers, and other content can be easily managed and simultaneously
integrated into one display.
|
|

|
Unsupervised
Learning of High-order Structural Semantics from Images
We present a new unsupervised learning algorithm to find
high-order frequently occurring visual patterns (semantics) in images
beyond the spatial proximity assumption. We believe semantics are composed
by image features with consistent geometric relationships sufficiently
often. An efficient polynomial-time algorithm is developed to search for
meaningful and strong associations between pair-wise visual clusters over the
entire image space. High-order composite visual structures are extracted by
frequent subgraph mining on a undirected labeled
graph built upon all pair-wise associations.
|
|

|
Physically Guided Liquid Surface
Modeling from Videos
We present an image-based reconstruction framework to
model real water scenes captured by stereoscopic video. The combination of
image-based reconstruction with physically-based simulation allows us to
model complex and dynamic objects such as fluid. Using a depth map sequence
as initial conditions, we use a physically based approach that
automatically fills in missing regions, removes outliers, and refines the
geometric shape so that the final 3D model is consistent to both the input
video data and the laws of physics.
|
|

|
Fusion of Passive Stereo and
Time-of-Flight (Active)
Time-of-flight range sensors have error characteristics which
are complementary to passive stereo. They provide real time depth estimates
in conditions where passive stereo does not work well, such as on white
walls. In contrast, these sensors are noisy and often perform poorly on the
textured scenes for which stereo excels. We introduce a method for
combining the results from both methods that performs better than either
alone. We show that these techniques lead to improved accuracy and
robustness.
|
|
 
|
Pixel Router
Even with the recent
rapid advancement in hardware, the demand from high-end graphics
applications (including video games) seems to always outpace the capability
that a single GPU can offer. As we migrate from a single GPU to multiple
GPUs or eventually GPU clusters, how to effectively assemble the final
image from these distributed rendering nodes becomes an important issue.
Here we propose to develop a flexible pixel compositor to solve this
problem.
|
|

|
Spatial-Depth Supre
Resolution for Range Images
We present a new
post-processing step to enhance the resolution of range images. Using one
or two registered and potentially high-resolution color images as
reference, we iteratively refine the input low-resolution range image, in
terms of both its spatial resolution and depth precision. Evaluation using
the Middlebury benchmark shows across-the-board improvement for sub-pixel
accuracy. We also demonstrated its effectiveness for spatial resolution
enhancement up to 100X with a single reference image.
|
|
 
|
Light
Fall-off Stereo
LFS-a new method for rcomputing depth from scenes beyond
lambertian reflectance and texture. Compared to previous reconstruction
methods for non-lamebrain scenes, LFS needs as few as two images, does not
require calibrated camera or light sources, or reference objects in the
scene.
|
|

|
3D Urban Reconstruction from Video
This project aims at developing a fully automated system for
the accurate and rapid 3D reconstruction of urban environments from video
streams. The system collects multiple video streams, as well as GPS and INS
measurements in order to place the reconstructed models in geo-registered
coordinates. Besides high quality in terms of both geometry and appearance,
we aim at real-time performance on a combination of CPUs and GPUs.
|
|
 
|
BRDF Invariant Stereo
using Light Transport Constancy
Nearly all existing methods for stereo reconstruction
assume that scene reflectance is Lambertian and make use of brightness
constancy as a matching invariant. We introduce a new invariant for stereo
reconstruction called Light Transport Constancy, which allows completely
arbitrary scene reflectance (BRDFs).
|
|
  
|
Towards Space-time Light Field Rendering
In this paper we
propose a novel framework, space-time light field rendering, which allows continuous
exploration of a dynamic scene in both spatial and temporal domain with
unsynchronized input video sequences.
|
|

|
Toward the Light Field Display:
Autostereoscopic Rendering
via a Cluster of Projectors
Ultimately, a
display device should be capable of reproducing the visual effects that are
produced by reality. In this paper we introduce an autostereoscopic display
that uses a scalable array of digital light projectors and a projection
screen augmented with microlenses to simulate a light field for a given
three-dimensional scene.
|

|
Projector-Whiteboard-Camera
System for Remote Collaboration
In a typical remote collaboration setup, two or
more projector-camera pairs are "cross-wired" to form a full-duplex
system for two-way communication. A whiteboard can be used as the projector
screen, and in that case, the whiteboard server as an output device as well
as an input device. Users can write on the whiteboard to comment on what is
projected or to add new thoughts in the discussion.
|
|

|
Wide-area Rapid Iris Image Capture with
Pan-tilt-zoom Cameras
In response to the DHS
interest in fast biometric measurement, this project will develop a system
to rapidly capture iris images of moving human subjects at long range.
Working in concert with a currently available commercial iris
identification software package, our system will provide fast, accurate,
and automated biometric identification first for homeland security and also
for other fields requiring identification or authentication.
|
|
 
|
High Quality
and Real-time Stereo Algorithms
We have been working on designing algorithms for dense
two-frame stereo matching problem aiming at both high reconstruction
quality and real-time performance. Evaluation using the benchmark Middlebury stereo database shows
that our algorithms are among the best in terms of both quality and speed.
|
|
|
|
Dr.
Yang's Previous and Current Work:
|
|
3D Reconstruction and View
Synthesis (from May 2001)
|
|
  

|
3D
Physically-based 2D View Synthesis
As part of Dr. Yang's thesis work, he is working on a new
statistical approach for view synthesis. It is particular effective for
texture-less regions and specular highlights, two major problems that most
existing reconstruction techniques would have difficulty with. We are
preparing to report our work to ICCV 2003. Some initial results are
presented on the left, the top row shows several input images while the
bottom row shows the reconstructed point cloud.
|
|

|
Real-time Stereo
A
multi-resolution stereo algorithm that can be implemented on commodity
graphics hardware. A paper and a live demo appeared in CVPR 2003.
|
|

|
Real-time View Synthesis on
Graphics Hardware
We present a novel use of commodity graphics hardware that
effectively combines a plane-sweeping algorithm with view synthesis for
real-time, on-line 3D scene acquisition and view synthesis. The heart of
our method is to use programmable Pixel Shader technology to square
intensity differences between reference image pixels, and then to choose
final colors that correspond to the minimum difference, i.e. the most
consistent color. We filed an invention disclosure with UNC.
|
|
Internship at Microsoft Research
(Mentor: Zhengyou Zhang),
Summer 2001
|
|

|
Eye-Gaze Correction
Dr.
Yang's internship at Microsoft Research (MSR) during summer 2001 has focused
on maintaining eye-contact for desktop video teleconferencing. They took a
model-based approach that incorporates a detailed individualized
three-dimensional head model with stereoscopic analysis. This approach is
very effective; they probably achieved the most realistic results in
published literature for eye gaze correction. In the process, they can also
get very accurate 3D tracking results of the head pose. The images show the
face model projected on the tracked head. MSR has filed two patent applications
for our algorithms and systems.
|
|
Large Format Display
(2000-2001)
|
|

|
PixelFlex: A Reconfigurable
Multi-Projector Display System
The PixelFlex system is composed of ceiling-mounted
projectors, each with computer-controlled pan, tilt, zoom and focus; and a
camera for closed-loop calibration. Working collectively, these
controllable projectors function as a single logical display capable of
being easily modified into a variety of spatial formats. The left image
shows a stacked configuration that can be used for stereo display.
|
|

|
Automatic Projector
Display Surface Estimation Using Every-Day Imagery
We introduce a new method for continuous display surface
auto-calibration. Using a camera that observes the display surface, we
match image features in whatever imagery is being projected, with the
corresponding features that appear on the display surface, to continually
refine an estimate for the display surface geometry. In effect we enjoy the
high signal-to-noise ratio of "structured" light (without getting
to choose the structure) and the unobtrusive nature of passive
correlation-based methods.
|
|
Tele-Immersion
(1998-current)
|
|

|
Group Teleconferencing
We want
to design a system that facilitate many-to-many teleconferencing. Instead
of providing a perceptively correct view for every single user, we strive
to provide the best approximating view for the entire group as a
whole. We demonstrate two real-time acquisition-through-rendering
algorithms: one is based on view dependent texture mapping with
automatically acquired approximating geometry, and the other uses an array
of cameras to perform Light Field style rendering.
|
|

|
3D Tele-Immersion
The goal
of Tele-Immersion is to enable users at geographically distributed sites to
collaborate in real time in a shared, simulated environment as if they were
in the same physical room. While the entire project was a
interdisciplinary, multi-site collaboration, Dr. Yang was mainly invovled
in in real-time data capture and distribution.
|
|

|
2D Immersive Teleconferencing
We worked on improving the field
of view and resolution for 2D video teleconferencing. The result is a
simple, yet effective technique for producing geometrically correct imagery
for teleconferencing environments. The necessary image transformations are
derived by finding a direct one-to-one mapping between a capture device and
a display device for a fixed viewer location, thus completely avoiding the
need for any intermediate, complex representations of screen geometry,
capture and display distortions, and viewer location. Using this technique,
we can easily build an immersive teleconferencing system using multiple
projectors and cameras.
|
|
|
Geometrically
Correct Imagery for Teleconferencing
|
|
|
Multi-Projector Displays Using
Camera-Based Registration
|