About Research Travel Music CV

Research

Computer vision, video understanding, multimodal learning, collective intelligence, and biologically-inspired neural mechanisms.

Google Scholar

28
Publications
850+
Citations
2
Patents
10
Top Venues

Video Understanding & Visual Planning

Procedural video understanding, action forecasting, egocentric perception, and visual planning for human assistance.

7 papers
ICCV WACV IROS

Multimodal & Cross-Modal Learning

Fusing information across vision, audio, language, and gaze modalities for richer representations and generation.

8 papers
CVPR ICASSP ICCV IUI

Representation Learning & Neural Dynamics

Anytime prediction, few-shot and continual learning, attractor networks, and biologically-inspired temporal processing.

7 papers
NeurIPS Patent

Collective Intelligence & Bio-Inspired Systems

Honeybee swarm communication, firefly classification, and collective behavior modeling through agent-based simulation.

6 papers
PNAS Sci. Reports ACM CI

Other Projects

Computer graphics, space habitat design, and other creative explorations.

2 projects

 Video Understanding & Visual Planning

WACV 2026

Enhancing Visual Planning with Auxiliary Tasks and Multi-Token Prediction

Ce Zhang, Yale Song, Ruta Desai, Michael L. Iuzzolino, Joseph Tighe, Gedas Bertasius, Satwik Kottur
VideoPlan addresses two key challenges in visual planning: limited procedural training data and inefficient prediction methods. The approach introduces auxiliary task augmentation and multi-token prediction techniques, achieving improved performance on standard benchmarks like COIN and CrossTask datasets.
ICCV 2025

EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

Yuxuan Li, Vijay Veerabadran, Michael L. Iuzzolino, Bradley D. Roads, Asli Celikyilmaz, Karl Ridgeway
A benchmark for evaluating AI systems' abilities to understand human mental states through egocentric video analysis — reasoning about beliefs, intentions, and goals from first-person perspectives.
ICCV 2023

Pretrained Language Models as Visual Planners for Human Assistance

Dhruvesh Patel, Hamid Eghbalzadeh, Nitin Kamra, Michael L. Iuzzolino, Unnat Jain, Ruta Desai
VLaMP leverages pre-trained language models as sequence models for visual planning, generating action sequences to guide users toward complex multi-step goals using both video analysis and natural language input.
arXiv 2023

Multiscale Video Pretraining for Long-Term Activity Forecasting

Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer, Kate Saenko, Karl Ridgeway, Lorenzo Torresani
MVP is a self-supervised learning method that helps models understand temporal patterns in videos across different time scales. The approach learns to forecast future video representations and demonstrates significant improvements on activity prediction tasks, with gains exceeding 20% accuracy on certain benchmarks.
arXiv 2023

Action Dynamics Task Graphs for Learning Plannable Representations of Procedural Tasks

Weichao Mao, Ruta Desai, Michael L. Iuzzolino, Nitin Kamra
A method for extracting task structure from video demonstrations paired with narrations of procedural activities. Creates structured representations enabling task understanding in unseen videos, action recommendations, and sequence planning.
arXiv 2023

EgoAdapt: A Multi-Stream Evaluation Study of Adaptation to Real-World Egocentric User Video

Matthias De Lange, Hamid Eghbalzadeh, Reuben Tan, Michael Iuzzolino, Franziska Meier, Karl Ridgeway
A multi-stream evaluation study examining how models adapt to real-world egocentric user video, benchmarking adaptation strategies across different data streams and user scenarios.
IROS 2018

Virtual-to-Real-World Transfer Learning for Robots on Wilderness Trails

Michael L. Iuzzolino, Michael E. Walker, Daniel Szafir
Explores virtual-to-real-world transfer learning using deep learning models trained to classify trail direction from images. Utilizes synthetic data from virtual environments, bypassing the need to collect large amounts of real outdoor images. Demonstrates classification accuracies upwards of 95% on synthetic data.

 Multimodal & Cross-Modal Learning

arXiv 2026

Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals

Moritz Vandenhirtz, Kaveh Hassani, Sima Ghasemlou, Shawn Shao, Hamid Eghbalzadeh, Michael Iuzzolino, et al.
A multimodal generative approach for recommendation systems that fuses semantic understanding with collaborative filtering signals for improved recommendations.
ICCV 2025

Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths

Soham Mondal, Naveen Sendhilnathan, Ting Zhang, Yupei Liu, Michael Proulx, Michael L. Iuzzolino, et al.
Aligns gaze scanpath data with language representations for zero-shot prediction of what users are searching for visually, bridging eye-tracking and natural language understanding.
ICCV 2025

Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation

Thomas Lebailly, Vijay Veerabadran, Satwik Kottur, Karl Ridgeway, Michael L. Iuzzolino
Leverages synthetically generated captions for open-vocabulary zero-shot segmentation, enabling models to segment objects from categories not seen during training.
IUI 2025

Less or More: Towards Glanceable Explanations for LLM Recommendations Using Ultra-Small Devices

Xinru Wang, Mengjie Yu, Hannah Nguyen, Michael Iuzzolino, Tianyi Wang, Peiqi Tang, Natasha Lynova, Co Tran, Ting Zhang, Naveen Sendhilnathan, Hrvoje Benko, Haijun Xia, Tanya Jonker
Explores how to present AI-generated explanations effectively on devices with limited screen space, such as smartwatches, through structured text formatting and adaptive presentation strategies.
CVPR 2020 Microsoft Research

MMTM: Multimodal Transfer Module for CNN Fusion

Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L. Iuzzolino, Kazuhito Koishida
A simple neural network module for leveraging knowledge from multiple modalities in convolutional neural networks. The Multimodal Transfer Module (MMTM) can be added at different levels of the feature hierarchy, enabling slow modality fusion. Using squeeze and excitation operations, MMTM recalibrates channel-wise features. State-of-the-art performance on gesture recognition, speech enhancement, and action recognition. 534+ citations.
MMTM architecture
ICASSP 2020 Microsoft Research

AV(SE)2: Audio-Visual Squeeze-Excite Speech Enhancement

Michael L. Iuzzolino, Kazuhito Koishida
A new mechanism for audio-visual fusion that leverages a cross-modal squeeze-excitation block for speech enhancement. The fusion block is adaptable to any feature layer and significantly reduces model parameters compared to standard AV fusion methods without loss of performance.
Audio-Visual Speech Enhancement architecture
US Patent 11,244,696 · 2022

Audio-Visual Speech Enhancement

Kazuhito Koishida, Michael Iuzzolino
Patent for audio-visual speech enhancement methods using cross-modal squeeze-excitation mechanisms to improve speech intelligibility in noisy environments.
Microsoft Research

Real-Time 3D Object Retrieval and Pose Estimation from 2D RGB Images

Fast, simultaneous 3D model retrieval and pose estimation from single RGB images utilizing multi-view deep metric learning and fully convolutional networks. The method is category-agnostic, scalable, and allows for real-time 3D-model-to-real-object alignment.
3D bottle pose estimation 3D car pose estimation

 Representation Learning & Neural Dynamics

NeurIPS 2021

Improving Anytime Prediction with Parallel Cascaded Networks and a Temporal-Difference Loss

Michael Iuzzolino, Michael C. Mozer, Samy Bengio
A biologically-inspired architecture where information propagates from neurons at all layers in parallel but transmission occurs gradually over time, leading to speed-accuracy trade-offs. The cascaded ResNet approach uses a novel temporal-difference training loss and enables predictions that improve as internal processing time increases.
US Patent App. 17/560,139 · 2022

Parallel Cascaded Neural Networks

Michael C. Mozer, Michael L. Iuzzolino, Samy Bengio
Patent application for the parallel cascaded neural network architecture enabling anytime prediction with speed-accuracy trade-offs.
arXiv 2021

Online Unsupervised Learning of Visual Representations and Categories

Mengye Ren, Tyler R. Scott, Michael L. Iuzzolino, Michael C. Mozer, Richard Zemel
A prototype-based memory network that concurrently learns visual representations and identifies new object categories from sequential, nonstationary data streams without class labels. Uses contrastive learning to group different views while forming categorical prototypes from minimal examples.
arXiv 2020

Wandering Within a World: Online Contextualized Few-Shot Learning

Mengye Ren, Michael L. Iuzzolino, Michael C. Mozer, Richard S. Zemel
Extends few-shot learning to an online, continual setting where an underlying context changes throughout time. Introduces a new few-shot learning dataset based on large-scale indoor imagery that mimics the visual experience of an agent wandering within a world.
arXiv 2020

In Automation We Trust: Investigating the Role of Uncertainty in Active Learning Systems

Michael L. Iuzzolino, Takeo Umada, Nisar R. Ahmed, Daniel A. Szafir
An empirical study evaluating how active learning query policies and uncertainty visualizations influence analyst trust in automated classification of image data. Found that query policy significantly influences trust in image classification systems.
arXiv 2019

Convolutional Bipartite Attractor Networks

Michael Iuzzolino, Yehonatan Singer, Michael C. Mozer
Revisits attractor networks in light of modern deep learning methods and proposes a convolutional bipartite architecture with a novel training loss, activation function, and connectivity constraints. Demonstrates potential for image completion and super-resolution.
AIAA 2019 · J. Aerospace Info. Systems 2021

Fully Bayesian Human-Machine Data Fusion for Robust Dynamic Target Surveillance

Jacob Muesing, Luke Burks, Michael Iuzzolino, Daniel Szafir, Nisar R. Ahmed
A hierarchical fully Bayesian probabilistic model that explicitly accounts for uncertainties in 'human sensor' quality and Markovian conditional dependencies. Performs online Bayesian inference via Gibbs sampling to simultaneously calibrate human sensor characteristics and estimate target states.

 Collective Intelligence & Bio-Inspired Systems

Scientific Reports 2024

Embracing Firefly Flash Pattern Variability with Data-Driven Species Classification

Orit Martin, Canh Nguyen, Raphaël Sarfati, Manaswi Chowdhury, Michael L. Iuzzolino, Dieu My T. Nguyen, et al.
Data-driven approaches for classifying firefly species based on their bioluminescent flash patterns, embracing natural variability rather than relying on idealized templates.
ACM Collective Intelligence 2023

Gone With the Wind: Honey Bee Collective Scenting in the Presence of External Wind

Dieu My T. Nguyen, Golnar Gharooni Fard, Michael L. Iuzzolino, Orit Peleg
Investigates how external wind conditions affect collective scenting behavior in honeybee swarms and the resilience of pheromone-based communication.
Artificial Life and Robotics 2023

Honey Bees Find the Shortest Path: A Collective Flow-Mediated Approach

Dieu My T. Nguyen, Golnar G. Fard, Autumn Atkins, Peter Bontempo, Michael L. Iuzzolino, Orit Peleg
Investigates how honeybee swarms collectively solve shortest-path problems through flow-mediated pheromone communication mechanisms.
Artificial Life and Robotics 2022

Robustness of Collective Scenting in the Presence of Physical Obstacles

Dieu My T. Nguyen, Golnar G. Fard, Michael L. Iuzzolino, Orit Peleg
Examines how physical obstacles affect the robustness of collective scenting behavior in honeybee swarms and the resilience of pheromone-based communication networks.
ALIFE 2022

Physical Obstacles Constrain Behavioral Parameter Space of Successful Localization in Honey Bee Swarms

Dieu My T. Nguyen, Michael L. Iuzzolino, Orit Peleg
Explores how physical obstacles constrain the behavioral parameter space available to honeybee swarms for successful scent-based localization.
PNAS 2021

Flow-Mediated Olfactory Communication in Honeybee Swarms

Dieu My T. Nguyen, Michael L. Iuzzolino, A. Mankel, K. Bozek, G.J. Stephens, Orit Peleg
Bee detection and classification using deep learning, combined with agent-based modeling of honeybee swarm communication patterns through flow-mediated pheromone signaling. Published in the Proceedings of the National Academy of Sciences.

 Other Projects

Space Habitat Design

Space habitat design 1 Space habitat design 2