Research

Publications

850+

Citations

Patents

Top Venues

Video Understanding & Visual Planning

Procedural video understanding, action forecasting, egocentric perception, and visual planning for human assistance.

7 papers

ICCV WACV IROS

Multimodal & Cross-Modal Learning

Fusing information across vision, audio, language, and gaze modalities for richer representations and generation.

8 papers

CVPR ICASSP ICCV IUI

Representation Learning & Neural Dynamics

Anytime prediction, few-shot and continual learning, attractor networks, and biologically-inspired temporal processing.

7 papers

NeurIPS Patent

Collective Intelligence & Bio-Inspired Systems

Honeybee swarm communication, firefly classification, and collective behavior modeling through agent-based simulation.

6 papers

PNAS Sci. Reports ACM CI

Other Projects

Computer graphics, space habitat design, and other creative explorations.

2 projects

Video Understanding & Visual Planning

WACV 2026

Enhancing Visual Planning with Auxiliary Tasks and Multi-Token Prediction

Ce Zhang, Yale Song, Ruta Desai, Michael L. Iuzzolino, Joseph Tighe, Gedas Bertasius, Satwik Kottur

VideoPlan addresses two key challenges in visual planning: limited procedural training data and inefficient prediction methods. The approach introduces auxiliary task augmentation and multi-token prediction techniques, achieving improved performance on standard benchmarks like COIN and CrossTask datasets.

arXiv

ICCV 2025

EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

Yuxuan Li, Vijay Veerabadran, Michael L. Iuzzolino, Bradley D. Roads, Asli Celikyilmaz, Karl Ridgeway

A benchmark for evaluating AI systems' abilities to understand human mental states through egocentric video analysis — reasoning about beliefs, intentions, and goals from first-person perspectives.

arXiv

ICCV 2023

Pretrained Language Models as Visual Planners for Human Assistance

Dhruvesh Patel, Hamid Eghbalzadeh, Nitin Kamra, Michael L. Iuzzolino, Unnat Jain, Ruta Desai

VLaMP leverages pre-trained language models as sequence models for visual planning, generating action sequences to guide users toward complex multi-step goals using both video analysis and natural language input.

arXiv

arXiv 2023

Multiscale Video Pretraining for Long-Term Activity Forecasting

Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer, Kate Saenko, Karl Ridgeway, Lorenzo Torresani

MVP is a self-supervised learning method that helps models understand temporal patterns in videos across different time scales. The approach learns to forecast future video representations and demonstrates significant improvements on activity prediction tasks, with gains exceeding 20% accuracy on certain benchmarks.

arXiv

arXiv 2023

Action Dynamics Task Graphs for Learning Plannable Representations of Procedural Tasks

Weichao Mao, Ruta Desai, Michael L. Iuzzolino, Nitin Kamra

A method for extracting task structure from video demonstrations paired with narrations of procedural activities. Creates structured representations enabling task understanding in unseen videos, action recommendations, and sequence planning.

arXiv

arXiv 2023

EgoAdapt: A Multi-Stream Evaluation Study of Adaptation to Real-World Egocentric User Video

Matthias De Lange, Hamid Eghbalzadeh, Reuben Tan, Michael Iuzzolino, Franziska Meier, Karl Ridgeway

A multi-stream evaluation study examining how models adapt to real-world egocentric user video, benchmarking adaptation strategies across different data streams and user scenarios.

arXiv

IROS 2018

Virtual-to-Real-World Transfer Learning for Robots on Wilderness Trails

Michael L. Iuzzolino, Michael E. Walker, Daniel Szafir

Explores virtual-to-real-world transfer learning using deep learning models trained to classify trail direction from images. Utilizes synthetic data from virtual environments, bypassing the need to collect large amounts of real outdoor images. Demonstrates classification accuracies upwards of 95% on synthetic data.

arXiv Video

Multimodal & Cross-Modal Learning

arXiv 2026

Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals

Moritz Vandenhirtz, Kaveh Hassani, Sima Ghasemlou, Shawn Shao, Hamid Eghbalzadeh, Michael Iuzzolino, et al.

A multimodal generative approach for recommendation systems that fuses semantic understanding with collaborative filtering signals for improved recommendations.

arXiv

ICCV 2025

Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths

Soham Mondal, Naveen Sendhilnathan, Ting Zhang, Yupei Liu, Michael Proulx, Michael L. Iuzzolino, et al.

Aligns gaze scanpath data with language representations for zero-shot prediction of what users are searching for visually, bridging eye-tracking and natural language understanding.

ICCV 2025

Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation

Thomas Lebailly, Vijay Veerabadran, Satwik Kottur, Karl Ridgeway, Michael L. Iuzzolino

Leverages synthetically generated captions for open-vocabulary zero-shot segmentation, enabling models to segment objects from categories not seen during training.

IUI 2025

Less or More: Towards Glanceable Explanations for LLM Recommendations Using Ultra-Small Devices

Xinru Wang, Mengjie Yu, Hannah Nguyen, Michael Iuzzolino, Tianyi Wang, Peiqi Tang, Natasha Lynova, Co Tran, Ting Zhang, Naveen Sendhilnathan, Hrvoje Benko, Haijun Xia, Tanya Jonker

Explores how to present AI-generated explanations effectively on devices with limited screen space, such as smartwatches, through structured text formatting and adaptive presentation strategies.

arXiv

CVPR 2020 Microsoft Research

MMTM: Multimodal Transfer Module for CNN Fusion

Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L. Iuzzolino, Kazuhito Koishida

A simple neural network module for leveraging knowledge from multiple modalities in convolutional neural networks. The Multimodal Transfer Module (MMTM) can be added at different levels of the feature hierarchy, enabling slow modality fusion. Using squeeze and excitation operations, MMTM recalibrates channel-wise features. State-of-the-art performance on gesture recognition, speech enhancement, and action recognition. 534+ citations.

arXiv

ICASSP 2020 Microsoft Research

AV(SE)²: Audio-Visual Squeeze-Excite Speech Enhancement

Michael L. Iuzzolino, Kazuhito Koishida

A new mechanism for audio-visual fusion that leverages a cross-modal squeeze-excitation block for speech enhancement. The fusion block is adaptable to any feature layer and significantly reduces model parameters compared to standard AV fusion methods without loss of performance.

Audio-Visual Speech Enhancement architecture

Paper

US Patent 11,244,696 · 2022

Audio-Visual Speech Enhancement

Kazuhito Koishida, Michael Iuzzolino

Patent for audio-visual speech enhancement methods using cross-modal squeeze-excitation mechanisms to improve speech intelligibility in noisy environments.

Microsoft Research

Real-Time 3D Object Retrieval and Pose Estimation from 2D RGB Images

Fast, simultaneous 3D model retrieval and pose estimation from single RGB images utilizing multi-view deep metric learning and fully convolutional networks. The method is category-agnostic, scalable, and allows for real-time 3D-model-to-real-object alignment.

Representation Learning & Neural Dynamics

NeurIPS 2021

Improving Anytime Prediction with Parallel Cascaded Networks and a Temporal-Difference Loss

Michael Iuzzolino, Michael C. Mozer, Samy Bengio

A biologically-inspired architecture where information propagates from neurons at all layers in parallel but transmission occurs gradually over time, leading to speed-accuracy trade-offs. The cascaded ResNet approach uses a novel temporal-difference training loss and enables predictions that improve as internal processing time increases.

arXiv

US Patent App. 17/560,139 · 2022

Parallel Cascaded Neural Networks

Michael C. Mozer, Michael L. Iuzzolino, Samy Bengio

Patent application for the parallel cascaded neural network architecture enabling anytime prediction with speed-accuracy trade-offs.

arXiv 2021

Online Unsupervised Learning of Visual Representations and Categories

Mengye Ren, Tyler R. Scott, Michael L. Iuzzolino, Michael C. Mozer, Richard Zemel

A prototype-based memory network that concurrently learns visual representations and identifies new object categories from sequential, nonstationary data streams without class labels. Uses contrastive learning to group different views while forming categorical prototypes from minimal examples.

arXiv

arXiv 2020

Wandering Within a World: Online Contextualized Few-Shot Learning

Mengye Ren, Michael L. Iuzzolino, Michael C. Mozer, Richard S. Zemel

Extends few-shot learning to an online, continual setting where an underlying context changes throughout time. Introduces a new few-shot learning dataset based on large-scale indoor imagery that mimics the visual experience of an agent wandering within a world.

arXiv

arXiv 2020

In Automation We Trust: Investigating the Role of Uncertainty in Active Learning Systems

Michael L. Iuzzolino, Takeo Umada, Nisar R. Ahmed, Daniel A. Szafir

An empirical study evaluating how active learning query policies and uncertainty visualizations influence analyst trust in automated classification of image data. Found that query policy significantly influences trust in image classification systems.

Paper

arXiv 2019

Convolutional Bipartite Attractor Networks

Michael Iuzzolino, Yehonatan Singer, Michael C. Mozer

Revisits attractor networks in light of modern deep learning methods and proposes a convolutional bipartite architecture with a novel training loss, activation function, and connectivity constraints. Demonstrates potential for image completion and super-resolution.

arXiv

AIAA 2019 · J. Aerospace Info. Systems 2021

Fully Bayesian Human-Machine Data Fusion for Robust Dynamic Target Surveillance

Jacob Muesing, Luke Burks, Michael Iuzzolino, Daniel Szafir, Nisar R. Ahmed

A hierarchical fully Bayesian probabilistic model that explicitly accounts for uncertainties in 'human sensor' quality and Markovian conditional dependencies. Performs online Bayesian inference via Gibbs sampling to simultaneously calibrate human sensor characteristics and estimate target states.

Paper Video

Collective Intelligence & Bio-Inspired Systems

Scientific Reports 2024

Embracing Firefly Flash Pattern Variability with Data-Driven Species Classification

Orit Martin, Canh Nguyen, Raphaël Sarfati, Manaswi Chowdhury, Michael L. Iuzzolino, Dieu My T. Nguyen, et al.

Data-driven approaches for classifying firefly species based on their bioluminescent flash patterns, embracing natural variability rather than relying on idealized templates.

ACM Collective Intelligence 2023

Gone With the Wind: Honey Bee Collective Scenting in the Presence of External Wind

Dieu My T. Nguyen, Golnar Gharooni Fard, Michael L. Iuzzolino, Orit Peleg

Investigates how external wind conditions affect collective scenting behavior in honeybee swarms and the resilience of pheromone-based communication.

Artificial Life and Robotics 2023

Honey Bees Find the Shortest Path: A Collective Flow-Mediated Approach

Dieu My T. Nguyen, Golnar G. Fard, Autumn Atkins, Peter Bontempo, Michael L. Iuzzolino, Orit Peleg

Investigates how honeybee swarms collectively solve shortest-path problems through flow-mediated pheromone communication mechanisms.

Artificial Life and Robotics 2022

Robustness of Collective Scenting in the Presence of Physical Obstacles

Dieu My T. Nguyen, Golnar G. Fard, Michael L. Iuzzolino, Orit Peleg

Examines how physical obstacles affect the robustness of collective scenting behavior in honeybee swarms and the resilience of pheromone-based communication networks.

ALIFE 2022

Physical Obstacles Constrain Behavioral Parameter Space of Successful Localization in Honey Bee Swarms

Dieu My T. Nguyen, Michael L. Iuzzolino, Orit Peleg

Explores how physical obstacles constrain the behavioral parameter space available to honeybee swarms for successful scent-based localization.

PNAS 2021

Flow-Mediated Olfactory Communication in Honeybee Swarms

Dieu My T. Nguyen, Michael L. Iuzzolino, A. Mankel, K. Bozek, G.J. Stephens, Orit Peleg

Bee detection and classification using deep learning, combined with agent-based modeling of honeybee swarm communication patterns through flow-mediated pheromone signaling. Published in the Proceedings of the National Academy of Sciences.

Paper Detection Video Agent Model

Video Understanding & Visual Planning

Multimodal & Cross-Modal Learning

Representation Learning & Neural Dynamics

Collective Intelligence & Bio-Inspired Systems

Other Projects

Video Understanding & Visual Planning

Enhancing Visual Planning with Auxiliary Tasks and Multi-Token Prediction

EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

Pretrained Language Models as Visual Planners for Human Assistance

Multiscale Video Pretraining for Long-Term Activity Forecasting

Action Dynamics Task Graphs for Learning Plannable Representations of Procedural Tasks

EgoAdapt: A Multi-Stream Evaluation Study of Adaptation to Real-World Egocentric User Video

Virtual-to-Real-World Transfer Learning for Robots on Wilderness Trails

Multimodal & Cross-Modal Learning

Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals

Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths

Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation

Less or More: Towards Glanceable Explanations for LLM Recommendations Using Ultra-Small Devices

MMTM: Multimodal Transfer Module for CNN Fusion

AV(SE)2: Audio-Visual Squeeze-Excite Speech Enhancement

Audio-Visual Speech Enhancement

Real-Time 3D Object Retrieval and Pose Estimation from 2D RGB Images

Representation Learning & Neural Dynamics

Improving Anytime Prediction with Parallel Cascaded Networks and a Temporal-Difference Loss

Parallel Cascaded Neural Networks

Online Unsupervised Learning of Visual Representations and Categories

Wandering Within a World: Online Contextualized Few-Shot Learning

In Automation We Trust: Investigating the Role of Uncertainty in Active Learning Systems

Convolutional Bipartite Attractor Networks

Fully Bayesian Human-Machine Data Fusion for Robust Dynamic Target Surveillance

Collective Intelligence & Bio-Inspired Systems

Embracing Firefly Flash Pattern Variability with Data-Driven Species Classification

Gone With the Wind: Honey Bee Collective Scenting in the Presence of External Wind

Honey Bees Find the Shortest Path: A Collective Flow-Mediated Approach

Robustness of Collective Scenting in the Presence of Physical Obstacles

Physical Obstacles Constrain Behavioral Parameter Space of Successful Localization in Honey Bee Swarms

Flow-Mediated Olfactory Communication in Honeybee Swarms

Other Projects

Computer Graphics — Natural Scene Generator

Space Habitat Design

AV(SE)²: Audio-Visual Squeeze-Excite Speech Enhancement