Graphics Vision Technology Laboratory

 

 


slideshow-01


ICCV’15 Workshop Proposal:

3D Reconstruction and Understanding with Video and Sound Proposers

Dinesh Manocha, University of North Carolina at Chapel Hill, dm@cs.unc.edu
Marc Pollefeys, ETH­Zurich, marc.pollefeys@inf.ethz.ch
Rif A. Saurous, Google, rif@google.com
Rahul Sukthankar, Google, sukthankar@google.com
Ruigang Yang, University of Kentucky, ryang@cs.uky.edu
 

 Duration: One­day


Abstract and justification

One of the driving factors for innovation in computer vision is the availability of new types of input sensors and algorithms to process the sensor data. With the proliferation of commodity RGBD cameras, it is increasingly common to see new techniques that take advantage of the additional depth channel for tasks such as reconstruction, segmentation, and understanding.

For the scope of this workshop, we aim to investigate (or reintroduce) another readily available, but often ignored source of information ­­­ audio or acoustic sensors ­­­ that can be combined with visual cameras, and thereby result in use of audio­visual sensors and a new generation of algorithms and applications.

There are two major thrusts for this workshop. The first is the multimodal analysis of videos with sound for enhanced recognition accuracy, including application areas such as audio­visual speech recognition, video categorization or classification, and event detection in videos, as well as technical areas such as early vs. late fusion and end­to­end training of models. The second is to explore the use of acoustic sensors to facilitate the reconstruction and understanding of 3D objects/models beyond the capability of current 3D RGBD sensors. This could include robust handling of scenes with specular/transparent objects, or even reconstruction around corners (i.e. non line­of­sight) and through obstacles or capturing other material characteristics (e.g. acoustic material properties for aural rendering). In this context, acoustic sensors refer to a broad frequency range of sound from subsonic to ultrasound.

We plan to invite speakers from a broad spectrum of research areas, from traditional multimodal visual analysis, to digital signal processing, to inverse acoustic simulation. It is our hope to seed the vision community with a variety of promising ideas from other disciplines, and result in new class of algorithms.

In the spirit of advancing sensing capability, we also plan to organize an industrial panel with leading companies in 3D sensors and mobile handsets, presenting an opportunity to exchange ideas between academic research and industrial development. We hope the discussion will lead to a roadmap for the next­generation of 3D sensors.


Tentative invited speakers

Zhengyou Zhang [Confirmed], MSR, http://research.microsoft.com/en­us/um/people/zhang/, 3D computer vision, Audio processing and rendering, speech processing, spatial audio, multichannel AEC; audio­visual fusion, active object detection and tracking.

Ming Lin [Confirmed], UNC­Chapel Hill, http://www.cs.unc.edu/~lin/, acoustic simulation, sound synthesis and propagation, acoustic feature extraction.

Ramesh Raskar [Confirmed], MIT, http://www.media.mit.edu/people/raskar, Multi­modal cameras, non line­of­sight scene reconstruction.

Bhiksha Raj [Confirmed], CMU, http://mlsp.cs.cmu.edu/people/bhiksha/index.php, machine learning for audio.

Yu­Gang Jiang [Confirmed], Fudan University, http://www.yugangjiang.info/, mulitimedia content analysis

Alex Hauptmann, CMU, http://www.cs.cmu.edu/~alex/, multimedia analysis, indexing and interfaces.

Radu Horaud, INRIA, https://team.inria.fr/perception/team­members/radu­patrice­horaud/, audio­visual fusion and recognition techniques in conjunction with human­robot interaction.

Martin Vetterli, EPFL, http://lcav.epfl.ch/people/martin.vetterli, audio­visual communications, mathematical signal processing, plenoptic imaging, digital acoustics.

Jie Yang, US NSF Program Director for vision­related programs (on leave from CMU), http://www.cs.cmu.edu/~yang/, audio­video tracking, multimedia.

Andrew Senior, Google Research, http://research.google.com/pubs/author37792.html, audio­visual speech recognition, deep learning.

Dan Ellis, Columbia, https://www.ee.columbia.edu/~dpwe/, auditory scene analysis, extracting information from sound in many domains and guises.


Potential Industry Panelists

Giora Yahav, MS Israel, founder of 3DV system

Jonhny Lee, Google, Project Tango

Chang Yun, SensingTime, Hong Kong

Avner Sander (former TYZX), MS

Arrigo Benedetti (formerly Canesta), MS

Gerard Medioni, Amazon/USC

Bennett Wilburn, Huawei Research USA (formerly at Lytro)

Greg Leeming, Intel


Structure of the workshop

We solicit papers in all areas that can benefit from the combined use of audio and visual signals. Sample topics include, but are not limited to:

● Multimodal sensing with both visual and aural sensors

● Abnormality detection

● Audio­visual speech recognition

● Video categorization or classification (with an audio component)

● Audio­visual communications

We particularly encourage position and forward­thinking papers. All papers will be reviewed by the conference organizers.

We also realize that the emphasis on combining audio and video is relatively new from the traditional computer vision perspective, and we therefore plan to have 6­7 invited speakers and one panel discussion. One of the main purposes of this workshop is to broaden the horizon of visual computing by presentations from experts from outside the vision community, exploring opportunities beyond using visual signals alone.


Intended audience and relation to recent ECCV/CVPR/ICCV workshops

We invite anyone who is interested in using multimodal sensors to enhance the accuracy/capability of traditional visual image based reconstruction and understanding. The proposed workshop falls into the broad concept of sensor fusion. In this regard, there is IEEE/ISPRS 2nd Joint Workshop on Multi­Sensor Fusion for Dynamic Scene Understanding that will be held at CVPR 2015. However, the focus of that workshop is on multi­sensory dynamic spatial information fusion from stereo sequences, visual and infared sequences, video and lidar sequences, stereo and laser sequences, etc. On the other hand, our proposed workshop’s focus is on dealing with issues related to combining acoustic sensors with cameras as well as combined analysis of audio and visual cues in the videos. Furthermore, we also borrow recent research results from acoustic simulation, inverse acoustic reconstruction. and their applications.


Is this a continuation of a workshop series?

No


Estimated attendance

We estimate that there will be 50­100 people attending this workshop.