Tutorials

The role of the tutorials is to provide a platform for a more intensive scientific exchange amongst researchers interested in a particular topic and as a meeting point for the community. Tutorials complement the depth-oriented technical sessions by providing participants with broad overviews of emerging fields. A tutorial can be scheduled for 1.5 or 3 hours.

TUTORIALS LIST

Tutorial on Building the Next Generation of Virtual Personal Assistants with First Person (Egocentric) Vision: From Visual Intelligence to AI and Future Predictions (VISIGRAPP)
Instructor : Antonino Furnari

Tutorial on
Building the Next Generation of Virtual Personal Assistants with First Person (Egocentric) Vision: From Visual Intelligence to AI and Future Predictions

Instructor

	Antonino Furnari University of Catania Italy

Brief Bio Antonino Furnari is an Assistant Professor at the University of Catania. He received his PhD in Mathematics and Computer Science in 2017 from the University of Catania and authored one patent and more than 50 papers in international book chapters, journals and conference proceedings. Antonino Furnari is involved in the organization of different international events, such as the Assistive Computer Vision and Robotics (ACVR) workshop series (since 2016), the International Computer Vision Summer School (ICVSS) (since 2017), and the Egocentric Perception Interaction and Computing (EPIC) workshop series (since 2018) and the EGO4D workshop series (since 2022). Since 2018, he has been involved in the collection, release, and maintenance of the EPIC-KITCHENS dataset series, and in particular in the egocentric action anticipation and action detection challenges. Since 2021, he has been involved in the collection and benchmarking of the EGO4D dataset. He is co-founder of NEXT VISION s.r.l., an academic spin-off the the University of Catania since 2021. His research interests concern Computer Vision, Pattern Recognition, and Machine Learning, with focus on First Person Vision. More information is available at http://www.antoninofurnari.it/.

Abstract

In recent years, the market has witnessed the appearance of several wearable devices equipped with sensing, processing, and display abilities such as Microsoft HoloLens2, Vuzix Blade, Google Glass and Magic Leap One. Due to their intrinsic mobility and the ability to mix the real and digital worlds through Augmented Reality, such devices are perfect platforms to develop virtual personal assistants capable of seeing the world from the user’s perspective and augment their abilities. In the considered context, sensing goes beyond the collection and analysis of RGB images, with modalities such as depth, IMU, and gaze usually available. Nevertheless, Computer Vision plays a fundamental role in the egocentric perception pipelines of such systems.
Unlike standard “third person vision”, according to which the processed images and video are acquired from a static point of view neutral to the events, first person (egocentric) vision assumes that images and video are acquired from the non-static and rather “personal” point of view of the user by means of a wearable device. These unique properties make first person (egocentric) vision different from standard third person vision. Most notably, the visual information collected using wearable cameras always provide useful information about the users, revealing what they do, what they pay attention to and how they interact with the world.
In this tutorial, we will discuss the challenges and opportunities offered by first person (egocentric) vision. We will cover the historical background and seminal works in the field, present the main technological tools (including devices and algorithms) which can be used to analyze first person visual data and discuss challenges and open problems. The last part of the tutorial will specifically focus on an emergent trend of works which focus on the prediction of future events from first person vision, which is paramount for the development of virtual personal assistants.

Keywords

Wearable, First Person Vision, Egocentric Vision, Augmented Reality, Visual Localization, Action Recognition, Action Anticipation.

Aims and Learning Objectives

The participants will understand the main advantages of first person (egocentric) vision over third person vision to analyze the user’s behavior, build personalized applications and predict future events. Specifically, the participants will learn about: 1) the main differences between third person and first person (egocentric) vision, including the way in which the data is collected and processed, 2) the devices which can be used to collect data and provide services to the users, 3) the algorithms which can be used to manage first person visual data for instance to perform localization, indexing, object detection, action recognition, and the prediction of future events.

Target Audience

First year PhD students, graduate students, researchers, practitioners.

Prerequisite Knowledge of Audience

Fundamentals of Computer Vision and Machine Learning (including Deep Learning).

Detailed Outline

The tutorial will cover the following topics:

• Outline of the tutorial;
• Definitions, motivations, history and research trends of First Person (egocentric) Vision;
• Differences between third person and first person vision;
• First Person Vision datasets;
• Wearable devices to acquire/process first person visual data;
• Fundamental tasks for first person vision systems:
o Localization;
o Hand/Object detection;
o Attention;
o Action/Activity recognition;
o Action anticipation;
• Technological tools (devices and algorithms) which can be used to build first person vision applications;
• Challenges and open problems;
• Conclusions and insights for research in the field.

Keywords

wearable, first person vision, egocentric vision, augmented reality, visual localization, action recognition, action anticipation

Aims and Learning Objectives

The participants will understand the main advantages of first person (egocentric) vision over third person vision to analyze the user’s behavior, build personalized applications and predict future events. Specifically, the participants will learn about: 1) the main differences between third person and first person (egocentric) vision, including the way in which the data is collected and processed, 2) the devices which can be used to collect data and provide services to the users, 3) the algorithms which can be used to manage first person visual data for instance to perform localization, indexing, object detection, action recognition, and the prediction of future events.

Target Audience

First year PhD students, graduate students, researchers, practitioners.

Prerequisite Knowledge of Audience

Fundamentals of Computer Vision and Machine Learning (including Deep Learning)

Detailed Outline

The tutorial will cover the following topics:
• Outline of the tutorial;
• Definitions, motivations, history and research trends of First Person (egocentric) Vision;
• Differences between third person and first person vision;
• First Person Vision datasets;
• Wearable devices to acquire/process first person visual data;
• Fundamental tasks for first person vision systems:
o Localization;
o Hand/Object detection;
o Attention;
o Action/Activity recognition;
o Action anticipation;
• Technological tools (devices and algorithms) which can be used to build first person vision applications;
• Challenges and open problems;
• Conclusions and insights for research in the field.

Secretariat Contacts
e-mail: visigrapp.secretariat@insticc.org