One of the driving factors for innovation in computer vision is the availability of new types of input sensors and algorithms to process the sensor data. With the proliferation of commodity RGBD cameras, it is increasingly common to see new techniques that take advantage of the additional depth channel for tasks such as reconstruction, segmentation, and understanding.
For the scope of this workshop, we aim to investigate (or reintroduce) another readily available, but often ignored source of information audio or acoustic sensors that can be combined with visual cameras, and thereby result in use of audiovisual sensors and a new generation of algorithms and applications.
There are two major thrusts for this workshop. The first is the multimodal analysis of videos with sound for enhanced recognition accuracy, including application areas such as audiovisual speech recognition, video categorization or classification, and event detection in videos, as well as technical areas such as early vs. late fusion and endtoend training of models. The second is to explore the use of acoustic sensors to facilitate the reconstruction and understanding of 3D objects/models beyond the capability of current 3D RGBD sensors. This could include robust handling of scenes with specular/transparent objects, or even reconstruction around corners (i.e. non lineofsight) and through obstacles or capturing other material characteristics (e.g. acoustic material properties for aural rendering). In this context, acoustic sensors refer to a broad frequency range of sound from subsonic to ultrasound.
We plan to invite speakers from a broad spectrum of research areas, from traditional multimodal visual analysis, to digital signal processing, to inverse acoustic simulation. It is our hope to seed the vision community with a variety of promising ideas from other disciplines, and result in new class of algorithms.
In the spirit of advancing sensing capability, we also plan to organize an industrial panel with leading companies in 3D sensors and mobile handsets, presenting an opportunity to exchange ideas between academic research and industrial development. We hope the discussion will lead to a roadmap for the nextgeneration of 3D sensors.
● Zhengyou Zhang [Confirmed], MSR, http://research.microsoft.com/enus/um/people/zhang/, 3D computer vision, Audio processing and rendering, speech processing, spatial audio, multichannel AEC; audiovisual fusion, active object detection and tracking.
● Ming Lin [Confirmed], UNCChapel Hill, http://www.cs.unc.edu/~lin/, acoustic simulation, sound synthesis and propagation, acoustic feature extraction.
● Ramesh Raskar [Confirmed], MIT, http://www.media.mit.edu/people/raskar, Multimodal cameras, non lineofsight scene reconstruction.
● Bhiksha Raj [Confirmed], CMU, http://mlsp.cs.cmu.edu/people/bhiksha/index.php, machine learning for audio.
● YuGang Jiang [Confirmed], Fudan University, http://www.yugangjiang.info/, mulitimedia content analysis
● Alex Hauptmann, CMU, http://www.cs.cmu.edu/~alex/, multimedia analysis, indexing and interfaces.
● Radu Horaud, INRIA, https://team.inria.fr/perception/teammembers/radupatricehoraud/, audiovisual fusion and recognition techniques in conjunction with humanrobot interaction.
● Martin Vetterli, EPFL, http://lcav.epfl.ch/people/martin.vetterli, audiovisual communications, mathematical signal processing, plenoptic imaging, digital acoustics.
● Jie Yang, US NSF Program Director for visionrelated programs (on leave from CMU), http://www.cs.cmu.edu/~yang/, audiovideo tracking, multimedia.
● Andrew Senior, Google Research, http://research.google.com/pubs/author37792.html, audiovisual speech recognition, deep learning.
● Dan Ellis, Columbia, https://www.ee.columbia.edu/~dpwe/, auditory scene analysis, extracting information from sound in many domains and guises.
● Giora Yahav, MS Israel, founder of 3DV system
● Jonhny Lee, Google, Project Tango
● Chang Yun, SensingTime, Hong Kong
● Avner Sander (former TYZX), MS
● Arrigo Benedetti (formerly Canesta), MS
● Gerard Medioni, Amazon/USC
● Bennett Wilburn, Huawei Research USA (formerly at Lytro)
● Greg Leeming, Intel
We solicit papers in all areas that can benefit from the combined use of audio and visual signals. Sample topics include, but are not limited to:
● Multimodal sensing with both visual and aural sensors
● Abnormality detection
● Audiovisual speech recognition
● Video categorization or classification (with an audio component)
● Audiovisual communications
We particularly encourage position and forwardthinking papers. All papers will be reviewed by the conference organizers.
We also realize that the emphasis on combining audio and video is relatively new from the traditional computer vision perspective, and we therefore plan to have 67 invited speakers and one panel discussion. One of the main purposes of this workshop is to broaden the horizon of visual computing by presentations from experts from outside the vision community, exploring opportunities beyond using visual signals alone.
We invite anyone who is interested in using multimodal sensors to enhance the accuracy/capability of traditional visual image based reconstruction and understanding. The proposed workshop falls into the broad concept of sensor fusion. In this regard, there is IEEE/ISPRS 2nd Joint Workshop on MultiSensor Fusion for Dynamic Scene Understanding that will be held at CVPR 2015. However, the focus of that workshop is on multisensory dynamic spatial information fusion from stereo sequences, visual and infared sequences, video and lidar sequences, stereo and laser sequences, etc. On the other hand, our proposed workshop’s focus is on dealing with issues related to combining acoustic sensors with cameras as well as combined analysis of audio and visual cues in the videos. Furthermore, we also borrow recent research results from acoustic simulation, inverse acoustic reconstruction. and their applications.
We estimate that there will be 50100 people attending this workshop.