Michael S. Brown
W. Brent Seales
University of North Carolina at Chapel Hill
Chapel Hill, NC 27516
Current camera-monitor teleconferencing applications produce unrealistic imagery and break any sense of presence for the participants. Other capture-display technologies can be used to provide more compelling teleconferencing. However, complex geometries in capture-display systems make producing geometrically correct imagery difficult. It is usually impractical to detect, model and compensate for all effects introduced by the capture-display system. Most applications simply ignore these issues and rely on the user acceptance of the camera-monitor paradigm.
This paper presents a new and simple technique for producing geometrically correct imagery for teleconferencing environments. The necessary image transformations are derived by finding a mapping between a capture and display device for a fixed viewer location. The capture-display relationship is computed directly in device coordinates and completely avoids the need for any intermediate, complex representations of screen geometry, capture and display distortions, and viewer location. We describe our approach and demonstrate it via several prototype implementations that operate in real-time and provide a substantially more compelling feeling of presence than the standard teleconferencing paradigm.
Keywords Telepresence, immersive display, teleconference, video conference, graphics, real-time.
Video-conferencing is a reality and is already widely available over dedicated lines at a number of institutions and businesses. Emerging technologies within the computer networking community are promising to make the Internet a viable medium for high-fidelity real-time video. While there has been wide interest in addressing the underlying transport technologies for real-time video, the issue of how best to capture and then present a video-stream has been neglected.
|[ Typical teleconferencing] [height=0.9in]figs/standard.ps [ Compelling teleconferencing] [height=0.9in]figs/improved.ps|
Currently, the majority of video and teleconferencing application limit the capture of a video stream to a single sensor, and the display to a CRT or flat-panel device. This camera-monitor interface does not provide a compelling or convincing presence to the participants [#!Yamaashi!#]. The display scale and constrained field of view of these interfaces, allow for very limited interaction, providing the users with only a small ``through-a-window'' view of each other (fig 1).
In this paper, we present a new and simple technique that enables more realistic and compelling teleconferencing applications. We relax constraints on both the capture device and the display environment, and show how to capture and then display the image data in a way that guarantees that the viewer will see perspectively correct imagery. We construct a scalable display environment by tiling a display surface with many light projectors. Image capture is performed with any image sensor that maintains a common center of projection (COP). This can be a single camera with a wide-angle lens, or a panoramic image device composed of many individual cameras. Using an explicit capture-to-display surface mapping, we capture and render perspectively correct imagery in real-time for a stationary user. Our technique compensates for camera and display distortion, and allows the display geometry (surface and projector orientations) to be arbitrary. This means that any existing display surface, such the corner of an office or a set of white panels leaned against the wall, can be used as the display area. Our technique is easily scalable to very wide field of view (WFOV) environments. We demonstrate this scalability in section 6 by showing a WFOV teleconferencing environment we created using multiple light projectors and a common COP multi-camera rig.
There are numerous video conferencing products available. Hewitt [#!Hewitt!#] provides an excellent review of current video conferencing products and standards. Most video conferencing applications concentrate on networking issues and assume that the capture/display paradigm is that of a single camera and a CRT. Their emphasis is on image coding, network transport and collaborative user interfaces. These applications provide for the users a ``video in a window''[#!Mbone!#,#!mcanne!#] interface. Though they have improved significantly over the years, this through-a-window paradigm inhibits a realistic sense of presence between the participants.
There have been several approaches to making teleconferencing applications seem more natural and compelling. One interesting approach is the CU-SeeMe VR [#!cornell!#] system, which combines teleconferencing with virtual reality. In this system, live video streams from conference participants are projected onto flat surfaces. These flat surfaces representing participants are allowed to roam around within a shared virtual conference room. This is a first-order approach to using avatars for teleconferencing; participants see a 3D view of the 2D avatars within the virtual space. Although the system maintains the collaborative virtual space, the final output is still a desktop application, and the participants see the virtual world and the representation of others in it through a small window on a CRT.
Yamaashi addressed the limitation of the camera/monitor model by providing a user with two separate views: one wide-angle, and the other a controllable, detailed view [#!Yamaashi!#]. The two views are separated into different viewports, and a simple user interface allows the user to pan the detailed view towards regions of interest. Although it enhances the user's awareness of the remote site, the geometric discontinuities of the viewports break any natural sense of presence. Yamaashi also points out the need for correcting the distortion for the wide-angle view.
Raskar et al. [#!OOTF!#] proposed a teleconferencing and tele-collaboration interface that moves away from the desktop metaphor and toward an immersive context. They proposed to extract reflectance and depth information dynamically for all visible pixels in a room including walls, furniture, objects, and people. With an exhaustive representation of the environment, the system could exchange models over the network with a remote site with similar setup. These models would allow very realistic images of people and objects to be rendered and displayed within the correct geometric setting of the environment. A practical implementation of this system is not currently possible and requires solutions to difficult geometric problems as well as systems-level problems.
Teleconferencing applications have three major components: video capture, transport, and display. We concentrate on video capture and its display, and assume a 30 frame-per-second delivery capability of the underlying network.
Traditional video conferencing systems generally have a one-to-one camera-to-display relationship, i.e., a camera at one site captures a video stream and sends it to a another site, which displays the video on some display device. This setup is duplicated at both sites for two-way communication. Generally, other than compression and decompression for bandwidth constraints, no other processing is applied to the video stream. Undesirable effects, such as camera lens distortion, are not removed, and appear as distortion in the final imagery. To lessen these effects, video-conferencing systems use narrow field of view (FOV) lenses, which help keep such distortion to a minimum. However, depending on the number of participants, a narrow FOV can limit the interaction between the users. To aid this limitation, video conferencing cameras are sometimes mounted on controllable pan-tilt units, or even managed by a camera person, allowing the camera to continuously frame the object of attention.
In many cases capture distortion can be corrected and is not always undesirable. Nayar's Omni-Camera [#!Nayar!#], which uses a parabolic mirror to reflect incoming light toward a single center of projection, is able to capture 360 degree horizontal FOV imagery. Software is available to un-distort the imagery to make it look correct. There are several commercially available devices that produce imagery in this manner [#!OMNI!#,#!Nalwa!#]. Systems that capitalize on introducing a known distortion and provide software/hardware for un-distorting to produce WFOV imagery certainly help to allow more participants to be viewed by the users. A primary drawback is that the teleconferencing application becomes dependent on a very specialized capture device.
Teleconferencing imagery is almost always viewed on a flat display, typically a CRT device. This presents two problems. First, the scale of the imagery is almost always much smaller than real-life (postage-stamp size is a descriptive term heard quite frequently). Second, when WFOV capture devices are used, CRT and flat panel displays are inappropriate because their FOV is usually significantly smaller than the FOV of the capture devices. In order to view WFOV imagery, the user must scale the imagery down or scroll through the imagery via software. Using a wall of flat panel displays can increase the FOV, but introduces obvious seams between adjacent panels, again breaking the sense of presence.
One solution is to these problems is to use light projectors, which can display large-scale imagery. Multiple light projectors can be overlapped or carefully abutted to produce WFOV displays. This idea has been used in a number of immersive display systems, such as the University of Illinois at Chicago's CAVE [#!CAVE!#] and TriDimensions theaters [#!triD!#]. Apart from distortions at the capture device, light projector devices also introduce their own distortions. The distortion from using light projectors is often more pronounced than CRT's or LCD panels, because the display surface is no longer an integrated part of the display device itself. Typical problems include non-planar display surfaces, off-axis projection, and key-stoning. Existing systems, like the CAVE, try to avoid display distortion by carefully mounting the projectors and the display surface, which greatly increases the setup and maintenance cost. Such display environments are extremely expensive and difficult for untrained users to setup, requiring precise construction, and use high-end graphic workstations for image rendering. These constraints make these systems an undesirable solution for teleconferencing applications.
Recent techniques using light projector displays that are less rigidly constrained have emerged [#!japan_paper!#], and those algorithms allow the formation of seamless imagery even while displaying on arbitrary surfaces. In our work, we expand those techniques and unify them with the capture device.
We first discuss the straightforward case of one camera and one light projector. If the camera performs a pure perspective projection (i.e., modeled exactly as a pin-hole camera) and is positioned such that its optical axis is orthogonal to a planar display surface, the mapping between camera coordinates C(u,v) and projector coordinates P(x,y) can be described as a change in scale: C(u,v) = P(su,sv), where s is the scale factor. Knowing the scale factor makes it possible to capture and display images so that they appear geometrically correct.
In practice, capture devices do not exhibit pure perspective projection and introduce perturbations of the mappings such that . It is often possible to model this perturbation with first and second order approximation to the distortion, which is generally radial distortion.
The situation becomes much more difficult to model with the introduction of the display's distortion, such as off-axis projection and a non-planar display surface. In order to produce correct imagery for a user in such an environment, the following parameters must be known [#!OOTF!#] 1:
Solving for these parameters is difficult and impractical. Even when all the parameters were known, rendering the imagery would be a computationally intensive operation, requiring a high-end graphics workstation for real-time execution.
We present a solution that avoids the difficulties of solving for the above parameters. Figure 2 outlines the idea. In Fig. 2(a) we see 3-D points M1, M2, M3 are illuminated on the display surface by the projector pixels p1, p2, p3. These points project to the users' image plane at u1, u2, u3. Due to distortion in the camera the user points are perturbed such that , , . In Fig. 2(b) we place camera where the viewer's eye is to be located. For this given location we find a mapping between the pixels in the projector and their viewed location in the camera, i.e., . Part 2(c) shows the camera moved to a different location. The camera imagery is warped ( ) before the image is displayed through the projector. When the ``pre-warped'' image is displayed onto the display surface, each pixel will undergo the complex ``forward warp'', which undoes the pre-warp and moves the displayed pixels back to a location that will appear correct to the viewer.
This solution is a one-step method for finding the mapping that compensates for all of the geometric distortions introduced in the capture/display system. With this approach, producing correct imagery requires a 2-D warp of the captured coordinates ci to projector coordinates pj. This mapping is obtained directly by turning on the projector's pixels pi one-by-one and viewing them with the desired capture device.
The imagery will appear correct to the view when their eyes are located approximately where the capture device's center-of-projection was positioned when the mapping was computed. This location, a sweet spot, is the place from which the viewer should watch the imagery. As the viewer moves away from the sweet spot, the imagery will begin to appear distorted.
The technique easily scales for multiple overlapped projectors. We find the mapping between each projector individually. Corresponding projector pixels in the overlapped regions are mapped to the same camera pixel resulting in geometrically seamless imagery between the projectors, without the need for image mosaicing or other image processing. For camera devices that introduce a known distortion, such as compound camera devices or parabolic-mirror cameras, the technique can also apply directly as long as the device maintains a common center of projection.
|[We would like to produce correction imagery for the user.] [height = 2.0in, angle=270]figs/idea1.ps [We find a mapping between display and camera at the desired user location.] [height = 2.0in, angle=270]figs/idea2.ps [Using the above mapping, the cameras imagery is warped so that it looks correct when viewed by the user.] [height = 2.0in, angle=270]figs/idea3.ps|
We have implemented several capture/display configurations using our technique. We developed our applications using Visual C++, on a Windows NT platform. We used Dell 410 and 610 workstations. These machines have 400 Mhz Pentium II processors and 128 to 256 MB memory. We used analog video networks to transfer the video signals. The prototypes presented here emphasize geometrically correct and compelling imagery, not reliable network delivery of data. Each PC uses Matrox Meteor II video capture card(s) and an Intergraph 3400 series graphics card for hardware texture-mapping.
Our software is broken into two stages: finding the display/capture mapping (which we loosely call calibration), and the rendering application. Each run separately.
The calibration process collects the capture/display correspondences between the capture device (a single camera or a multi-camera rig) and display device (a set of fixed light projectors). Ideally, for every pixel P(xi,yi) in the display device, we need to know its corresponding pixel C(ui,vi) in the camera. It is often difficult and unnecessary to find correspondences for each pixel in the projectors. Usually the resolution of the camera is less than that of the projector, and a one-to-one mapping does not exist. Instead, we sub-sample the projector pixels from a given sweet spot. The calibration software allows the user to specify the sample resolution. The sub-sampled projector pixels are used to form a 2-D mesh in the projector's image plane.
Since the resolution of the capture device is less than that of the projector, we use features in the projector image instead of individual pixels. Our features are pixel blocks. The mapping is computed between the centroids of the features in the projector and the detected feature centroid in the capture device.
To facilitate feature detection, we use binary-coded structured light [#!strl!#] to obtain robust correspondence between features in the projector and the camera. Every feature is assigned a unique id. That id is coded in binary using n bits. We then create n black and white patterns as follows: for every blob in the kth pattern, it is colored white if the kth bit of its id is 1, otherwise it is assigned the background color (black). Viewing these patterns with a camera, it is easy to compute which feature is being viewed by reconstructing its id from the viewed bit-pattern. Synchronization is maintained by controlling the projector and camera from the same PC.
In Fig 3 we give an example of four features; in (b), we illustrate the principle of detecting features via binary-coded structured light.
When the camera located in the desired position (sweet spot), the structured-light calibration automatically generates the explicit display/capture mapping. This data is used by the rendering application.
The rendering application uses the capture-to-display data to perform the image warping. From the sampled projector pixels P(xi,yi) a 2-D triangulated mesh is created in the projector image plane using Delaney triangulation. For each vertex P(xi,yi), its corresponding capture coordinate C(ui, ui) is used as a texture coordinate. The camera's frame is used as a texture. The texture is updated every time a new frame from the camera is acquired. The rendering application is written in OpenGL, which is portable to various platforms (e.g., SGI) and is supported in hardware on a number of PC video cards.
This sections outlines several capture/display configurations we have implemented. Refer to table 1 for performance timings. The following section provides further discussion on these experiments.
Our first experiment uses a single camera with a fish-eye lens and a planar display surface. Fig. 4 (a) shows the input from the camera displayed without any processing. Radial distortion is obvious (note that straight lines appear curved). Finding a mapping between this device and the display produces a 2-D mesh (b). Texturing the 2-D mesh with the appropriate camera coordinates warps the imagery so it looks correct for the viewer (c). The radial distortion has been corrected.
|[The direct output from the camera with a fish-eye lens.] [height=1.5in]figs/Planar-NO-correction2.ps [The triangle mesh rendered in wire-frame mode. (Sorry, this mesh is from a different dataset, but gives a good visual for the idea)] [height=1.5in]figs/meshes.ps [Applying a 2-D warp, the new displayed imagery looks correct and the radial distortion has been removed.] [height=1.5in]figs/Planar-corrected2.ps|
[We add a curved bump to the display surface.] [height=1.5in]figs/bump.ps
[Though being corrected for radial distortion, the
image is still distorted due to non-planar display surface geometry. ] [height=1.5in]figs/Bump-NO-correction2.ps
[Corrected imagery based on a new mapping, which compensates for both camera distortion and the non-planar display surface.] [height=1.5in]figs/Bump-correction2.ps [Same image as (c) with the curved surface highlighted.] [height=1.5in]figs/Bump-correction2_overlay.ps
In the second experiment, we use the same camera/display configuration as before, but introduce a curved bump on the display surface, shown in Fig. 5(a). Using the previous capture/display mapping directly as shown in Fig. 4(b) does not yield correct imagery (Fig. 5(b)), because the non-planar display surface causes additional image distortion. After re-running the calibration process, the new mapping produces the correct imagery across the curved surface (Fig. 5(c)).
|[Two projectors without intensity blending. ] [height=2.5in]figs/2p_no_alpha.ps [Two projectors with intensity blending. ] [height=2.5in]figs/2p-alpha.ps|
For the third experiment, we use the same camera with two projectors with side-by-side overlap to create a WFOV display environment. Moving the camera to the desired viewer location, we calibrate each projector individually. Each projector is given its own mesh representing the capture/display mapping. The camera's video input is fed to two PCs that drive the projectors. Fig. 6(a) shows the resulting imagery, which is correct and geometrically seamless over the overlapped region of the projectors. Although our method produces geometrically seamless imagery in the overlapped region, the photometric seam in this region is visible because of double illumination, which leads to a bright strip. We use known techniques [#!OOTF!#,#!japan_paper!#] to compensate for this artifact. The intensity of the overlapped regions are attenuated in each projector, such that a point illuminated by two projectors has the correct intensity value. Fig. 6(b) shows the result with blending.
The fourth experiment uses a multi-camera and multi-projector system. The capture device is a common center of projection multi-camera rig known as the Camera-Cluster [#!CC!#], shown in fig 7. This cluster has twelve cameras which all share a common (virtual) COP. Fig. 7 (right) shows four outputs from four of the cameras. Only a portion of the output image (the region inside the grey partitions) is of interest.
In the camera cluster experiment, we use 10 cameras to cover a horizontal FOV and vertical FOV. We use five light projectors to create a panoramic display environment. There are overlaps between projectors. Each projector is driven by a PC with multiple incoming video streams.
We positioned the Camera-Cluster at the desired sweet spot within the display setting, and performed the calibration one projector at a time. There is a slight overlap between the cameras, so we increase our sample rate to every eight pixel. The calibration proces took roughly an hour.
From the design of the Camera-Cluster and its positioning in our environment, one projector may need input from up to four cameras. Because of the large amount of data passed from video capture card to graphics card (up to four 32 bit video streams), the output imagery frame rate takes a performance hit. Table 6 describes the different setups and their corresponding performances. The frame rate is the average performance based on PCs with Pentium II 400 - 450 Mhz CPUs and Intergraph 3410 OpenGL-accelerated Graphic Card. We used one capture cards (Matrox Meteor or Meteor II) per input channel.
Fig. 8 shows a partial panoramic view of the captured conference room (there are still two projectors in the far left and right being clipped). This picture does not look geometrically corrected because it was taken away from the sweet spot. Fig 9 is a picture taken in the sweet spot. Fig 9 is the geometrically corrected image take from the sweet spot. Compared to the panoramic view, note that at the sweet spot the edges of the doors in the middle become straight.
Our technique has several limitations, the first of which is that in order to produce correct imagery, we must use the exact capture device for calibration and then for subsequent capture/display. This implies that for use with teleconferencing, we must calibrate with a camera and physically send it to the remote-site. This is not a practical solution.
We have performed experiments where we calibrate with one camera and then capture/display using a second, similar camera (i.e., same brand, model, and lens). Using the capture/display mapping obtained from the first camera, viewers are unable to distinguish the difference when the system runs using the second ``similar'' camera. Further experiments are necessary to determine to just how ``similar'' two devices need to be before they produce noticeable distortion. Our experiments indicate that it is very reasonable to simply have participants use the same model camera.
Second, our technique produces correct imagery for only one viewer location. This limits the technique to a system that supports one static viewer. However, many participants have viewed the setup outlined in our last experiment (Camera Cluster and multiple projectors). Many of those participants were unaware that they were not in the correct location until we informed them of the ``sweet spot''. Depending on the capture-to-display configuration during calibration, distortion as a result of moving away from the sweet spot varies greatly.
We have found that many panoramic sensing devices that use multiple sensors to act like one logical device do not in fact provide a single common center of projection. The imagery produced in these cases will not be truly perspectively correct. In future work we will use several commercially available WFOV capture devices to get user reactions on the imagery they produce. We would like to determine how far from a common center of projection multi-camera rig can be before the effects become noticable to the viewers.
We have presented a direct and efficient technique for producing geometrically correct imagery for a stationary viewer. Using an explicit capture-to-display device mapping, we compensate for the complex geometries and distortions of the capture/display model. Our technique allows for the construction of realistic and compelling teleconferencing environments. We present results from several prototype systems we have implemented.
This document was generated using the LaTeX2HTML translator Version 98.1 release (February 19th, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 main.tex.
The translation was initiated by Ruigang Yang on 1999-07-07