Publications | Yiming Wang

2025

Training-free Online Video Step Grounding

Luca Zanella, Massimiliano Mancini, Yiming Wang, Alessio Tonioni, and Elisa Ricci

In Proceedings of Neural Information Processing Systems (NeurIPS), 2025

Abs Bib

Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BAGLM), further injecting knowledge of past frames into the LMM-based predictions. BAGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BAGLM over state-of-the-art training-based offline methods.
@inproceedings{zanella2025baglm, title = {Training-free Online Video Step Grounding}, author = {Zanella, Luca and Mancini, Massimiliano and Wang, Yiming and Tonioni, Alessio and Ricci, Elisa}, booktitle = {Proceedings of Neural Information Processing Systems (NeurIPS)}, year = {2025}, }
ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Yiming Wang, Elisa Ricci, and Paolo Rota

In Proceedings of Neural Information Processing Systems (NeurIPS), 2025

Abs Bib

What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains. Each pair comes with concept-level similarity scores and textual descriptions of both differences and similarities. Additionally, we benchmark several state-of-the-art models on ConViS, providing insights into their alignment with human judgments. Our results reveal significant performance differences on ConViS, indicating that some concepts present greater challenges for estimating video similarity. We believe that ConViS-Bench will serve as a valuable resource for advancing research in language-driven video understanding.
@inproceedings{liberatori2025convis, title = {ConViS-Bench: Estimating Video Similarity Through Semantic Concepts}, author = {Liberatori, Benedetta and Conti, Alessandro and Vaquero, Lorenzo and Wang, Yiming and Ricci, Elisa and Rota, Paolo}, booktitle = {Proceedings of Neural Information Processing Systems (NeurIPS)}, year = {2025}, }
LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, and Marco Cristani

In Proceedings of International Conference on Computer Vision (ICCV), 2025

Abs Bib HTML PDF Code

Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel multistep-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple sketch-text pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.
@inproceedings{girella2025lots, title = {LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing}, author = {Girella, Federico and Talon, Davide and Liu, Ziyue and Ruan, Zanxi and Wang, Yiming and Cristani, Marco}, booktitle = {Proceedings of International Conference on Computer Vision (ICCV)}, year = {2025}, }
On large multimodal models as open-world image classifiers

Alessandro Conti, Massimiliano Mancini, Enrico Fini, Yiming Wang, Paolo Rota, and Elisa Ricci

In Proceedings of International Conference on Computer Vision (ICCV), 2025

Abs Bib PDF Code

Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.
@inproceedings{conti2025large, title = {On large multimodal models as open-world image classifiers}, author = {Conti, Alessandro and Mancini, Massimiliano and Fini, Enrico and Wang, Yiming and Rota, Paolo and Ricci, Elisa}, booktitle = {Proceedings of International Conference on Computer Vision (ICCV)}, year = {2025}, }
Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Deepayan Das, Davide Talon, Yiming Wang, Massimiliano Mancini, and Elisa Ricci

In Proceedings of International Conference on Computer Vision (ICCV), 2025

Abs Bib HTML PDF Code

Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.
@inproceedings{das2025training, title = {Training-Free Personalization via Retrieval and Reasoning on Fingerprints}, author = {Das, Deepayan and Talon, Davide and Wang, Yiming and Mancini, Massimiliano and Ricci, Elisa}, booktitle = {Proceedings of International Conference on Computer Vision (ICCV)}, year = {2025}, }
Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

Francesco Taioli, Edoardo Zorzi, Gianni Franchi, Alberto Castellini, Alessandro Farinelli, Marco Cristani, and Yiming Wang

In Proceedings of International Conference on Computer Vision (ICCV), 2025

Abs Bib HTML PDF Code

Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent. While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human. To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, while existing language-driven instance navigation methods struggle in complex multi-instance scenes.
@inproceedings{taioli2025collaborative, title = {Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues}, author = {Taioli, Francesco and Zorzi, Edoardo and Franchi, Gianni and Castellini, Alberto and Farinelli, Alessandro and Cristani, Marco and Wang, Yiming}, booktitle = {Proceedings of International Conference on Computer Vision (ICCV)}, year = {2025}, }
Free-form language-based robotic reasoning and grasping

Runyu Jiao, Alice Fasoli, Francesco Giuliari, Matteo Bortolon, Sergio Povoli, Guofeng Mei, Yiming Wang, and Fabio Poiesi

In Proceedings of International Conference on Intelligent Robots and Systems (IROS), 2025

Abs Bib HTML PDF Code

Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs’ world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o’s zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution.
@inproceedings{jiao2025free, title = {Free-form language-based robotic reasoning and grasping}, author = {Jiao, Runyu and Fasoli, Alice and Giuliari, Francesco and Bortolon, Matteo and Povoli, Sergio and Mei, Guofeng and Wang, Yiming and Poiesi, Fabio}, booktitle = {Proceedings of International Conference on Intelligent Robots and Systems (IROS)}, year = {2025}, }
Automatic benchmarking of Large Multimodal Models via iterative experiment programming

Alessandro Conti, Enrico Fini, Paolo Rota, Yiming Wang, Massimiliano Mancini, and Elisa Ricci

In Proceedings of International Conference on Image Analysis and Processing (ICIAP), 2025

Abs Bib PDF Code

Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the user in natural language. Thanks to its modularity, our framework is flexible and extensible as new tools become available. Empirically, APEx reproduces the findings of existing studies while allowing for arbitrary analyses and hypothesis testing.
@inproceedings{conti2025automatic, title = {Automatic benchmarking of Large Multimodal Models via iterative experiment programming}, author = {Conti, Alessandro and Fini, Enrico and Rota, Paolo and Wang, Yiming and Mancini, Massimiliano and Ricci, Elisa}, booktitle = {Proceedings of International Conference on Image Analysis and Processing (ICIAP)}, year = {2025}, }
Evaluating Attribute Confusion in Fashion Text-to-Image Generation

Ziyue Liu, Federico Girella, Yiming Wang, and Davide Talon

In Proceedings of International Conference on Image Analysis and Processing (ICIAP), 2025

Abs Bib HTML PDF Code

Despite the rapid advances in Text-to-Image (T2I) generation models, their evaluation remains challenging in domains like fashion, involving complex compositional generation. Recent automated T2I evaluation methods leverage pre-trained vision-language models to measure cross-modal alignment. However, our preliminary study reveals that they are still limited in assessing rich entity-attribute semantics, facing challenges in attribute confusion, i.e., when attributes are correctly depicted but associated to the wrong entities. To address this, we build on a Visual Question Answering (VQA) localization strategy targeting one single entity at a time across both visual and textual modalities. We propose a localized human evaluation protocol and introduce a novel automatic metric, Localized VQAScore (L-VQAScore), that combines visual localization with VQA probing both correct (reflection) and miss-localized (leakage) attribute generation. On a newly curated dataset featuring challenging compositional alignment scenarios, L-VQAScore outperforms state-of-the-art T2I evaluation methods in terms of correlation with human judgments, demonstrating its strength in capturing fine-grained entity-attribute associations. We believe L-VQAScore can be a reliable and scalable alternative to subjective evaluations.
@inproceedings{liu2025evaluating, title = {Evaluating Attribute Confusion in Fashion Text-to-Image Generation}, author = {Liu, Ziyue and Girella, Federico and Wang, Yiming and Talon, Davide}, booktitle = {Proceedings of International Conference on Image Analysis and Processing (ICIAP)}, year = {2025}, }
Diversified in-domain synthesis with efficient fine-tuning for few-shot classification

Nicola Dall’Asen, Victor G Turrisi Costa, Nicu Sebe, Yiming Wang, and Elisa Ricci

In Proceedings of International Conference on Image Analysis and Processing (ICIAP), 2025

Abs Bib PDF Code

Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class. A recent research direction for improving few-shot classifiers involves augmenting the labelled samples with synthetic images created by state-of-the-art text-to-image generation models. Following this trend, we propose Diversified In-domain Synthesis with Efficient Fine-tuning (DISEF), a novel approach which addresses the generalization challenge in few-shot learning using synthetic data. DISEF consists of two main components. First, we propose a novel text-to-image augmentation pipeline that, by leveraging the real samples and their rich semantics coming from an advanced captioning model, promotes in-domain sample diversity for better generalization. Second, we emphasize the importance of effective model fine-tuning in few-shot recognition, proposing to use Low-Rank Adaptation (LoRA) for joint adaptation of the text and image encoders in a Vision Language Model. We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
@inproceedings{dall2025diversified, title = {Diversified in-domain synthesis with efficient fine-tuning for few-shot classification}, author = {Dall'Asen, Nicola and da Costa, Victor G Turrisi and Sebe, Nicu and Wang, Yiming and Ricci, Elisa}, booktitle = {Proceedings of International Conference on Image Analysis and Processing (ICIAP)}, year = {2025}, }
PerLA: Perceptive 3D language assistant

Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Fabio Poiesi, and Yiming Wang

In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Abs Bib HTML PDF Code

Enabling Large Language Models (LLMs) to understand the 3D physical world is an emerging yet challenging research direction. Current strategies for processing point clouds typically downsample the scene or divide it into smaller parts for separate analysis. However, both approaches risk losing key local details or global contextual information. In this paper, we introduce PerLA, a 3D language assistant designed to be more perceptive to both details and context, making visual representations more informative for the LLM. PerLA captures high-resolution (local) details in parallel from different point cloud areas and integrates them with (global) context obtained from a lower-resolution whole point cloud. We present a novel algorithm that preserves point cloud locality through the Hilbert curve and effectively aggregates local-to-global information via cross-attention and a graph neural network. Lastly, we introduce a novel loss for local representation consensus to promote training stability. PerLA outperforms state-of-the-art 3D language assistants, with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on ScanRefer and +3.88 on Nr3D for dense captioning.
@inproceedings{mei2025perla, title = {PerLA: Perceptive 3D language assistant}, author = {Mei, Guofeng and Lin, Wei and Riz, Luigi and Wu, Yujiao and Poiesi, Fabio and Wang, Yiming}, booktitle = {Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2025}, }
Collaborative Neural Painting

Nicola Dall’Asen, Willi Menapace, Elia Peruzzo, Enver Sangineto, Yiming Wang, and Elisa Ricci

Computer Vision and Image Understanding (CVIU), 2025

Abs Bib HTML PDF Code

The process of painting fosters creativity and rational planning. However, existing generative AI mostly focuses on producing visually pleasant artworks, without emphasizing the painting process. We introduce a novel task, Collaborative Neural Painting (CNP), to facilitate collaborative art painting generation between humans and machines. Given any number of user-input brushstrokes as the context or just the desired object class, CNP should produce a sequence of strokes supporting the completion of a coherent painting. Importantly, the process can be gradual and iterative, so allowing users’ modifications at any phase until the completion. Moreover, we propose to solve this task using a painting representation based on a sequence of parametrized strokes, which makes it easy both editing and composition operations. These parametrized strokes are processed by a Transformer-based architecture with a novel attention mechanism to model the relationship between the input strokes and the strokes to complete. We also propose a new masking scheme to reflect the interactive nature of CNP and adopt diffusion models as the basic learning process for its effectiveness and diversity in the generative field. Finally, to develop and validate methods on the novel task, we introduce a new dataset of painted objects and an evaluation protocol to benchmark CNP both quantitatively and qualitatively. We demonstrate the effectiveness of our approach and the potential of the CNP task as a promising avenue for future research.
@article{dall2025collaborative, title = {Collaborative Neural Painting}, author = {Dall'Asen, Nicola and Menapace, Willi and Peruzzo, Elia and Sangineto, Enver and Wang, Yiming and Ricci, Elisa}, journal = {Computer Vision and Image Understanding (CVIU)}, year = {2025}, }
Can Text-to-Video Generation help Video-Language Alignment?

Luca Zanella, Massimiliano Mancini, Willi Menapace, Sergey Tulyakov, Yiming Wang, and Elisa Ricci

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

Abs Bib HTML PDF Code

Recent video-language alignment models are trained on sets of videos, each with an associated positive caption and a negative caption generated by large language models. A problem with this procedure is that negative captions may introduce linguistic biases, i.e., concepts are seen only as negatives and never associated with a video. While a solution would be to collect videos for the negative captions, existing databases lack the fine-grained variations needed to cover all possible negatives. In this work, we study whether synthetic videos can help to overcome this issue. Our preliminary analysis with multiple generators shows that, while promising on some tasks, synthetic videos harm the performance of the model on others. We hypothesize this issue is linked to noise (semantic and visual) in the generated videos and develop a method, SynViTA, that accounts for those. SynViTA dynamically weights the contribution of each synthetic video based on how similar its target caption is w.r.t. the real counterpart. Moreover, a semantic consistency loss makes the model focus on fine-grained differences across captions, rather than differences in video appearance. Experiments show that, on average, SynViTA improves over existing methods on VideoCon test sets and SSv2-Temporal, SSv2-Events, and ATP-Hard benchmarks, being a first promising step for using synthetic videos when learning video-language models.
@inproceedings{zanella2025synvita, title = {Can Text-to-Video Generation help Video-Language Alignment?}, author = {Zanella, Luca and Mancini, Massimiliano and Menapace, Willi and Tulyakov, Sergey and Wang, Yiming and Ricci, Elisa}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2025}, }
Seeing the Abstract: Translating the Abstract Language for Vision Language Models

Davide Talon, Federico Girella, Ziyue Liu, Marco Cristani, and Yiming Wang

In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Abs Bib HTML PDF Code

Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.
@inproceedings{talon2025seeing, title = {Seeing the Abstract: Translating the Abstract Language for Vision Language Models}, author = {Talon, Davide and Girella, Federico and Liu, Ziyue and Cristani, Marco and Wang, Yiming}, booktitle = {Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2025}, }
One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering

Deepayan Das, Davide Talon, Massimiliano Mancini, Yiming Wang, and Elisa Ricci

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

Abs Bib PDF Code

Vision-Language Models (VLMs) have shown significant promise in Visual Question Answering (VQA) tasks by leveraging web-scale multimodal datasets. However, these models often struggle with continual learning due to catastrophic forgetting when adapting to new tasks. As an effective remedy to mitigate catastrophic forgetting, rehearsal strategy uses the data of past tasks upon learning new task. However, such strategy incurs the need of storing past data, which might not be feasible due to hardware constraints or privacy concerns. In this work, we propose the first data-free method that leverages the language generation capability of a VLM, instead of relying on external models, to produce pseudo-rehearsal data for addressing continual VQA. Our proposal, named as GaB, generates pseudo-rehearsal data by posing previous task questions on new task data. Yet, despite being effective, the distribution of generated questions skews towards the most frequently posed questions due to the limited and task-specific training data. To mitigate this issue, we introduce a pseudo-rehearsal balancing module that aligns the generated data towards the ground-truth data distribution using either the question meta-statistics or an unsupervised clustering method. We evaluate our proposed method on two recent benchmarks, i.e., VQACL-VQAv2 and CLOVE-function benchmarks. GaB outperforms all the data-free baselines with substantial improvement in maintaining VQA performance across evolving tasks, while being on-par with methods with access to the past data.
@inproceedings{das2025onevlm, title = {One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering}, author = {Das, Deepayan and Talon, Davide and Mancini, Massimiliano and Wang, Yiming and Ricci, Elisa}, booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2025}, }
Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

Guofeng Mei, Luigi Riz, Yiming Wang, and Fabio Poiesi

In Proceedings of International Conference on 3D Vision (3DV), 2025

Abs Bib HTML PDF Code

Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering "List the objects in the scene.”. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings.
@inproceedings{mei2025vocabulary, title = {Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant}, author = {Mei, Guofeng and Riz, Luigi and Wang, Yiming and Poiesi, Fabio}, booktitle = {Proceedings of International Conference on 3D Vision (3DV)}, year = {2025}, }
Towards a Decentralised Application-Centric Orchestration Framework in the Cloud-Edge Continuum

In Proceedings of IEEE International Conference on Fog and Edge Computing (ICFEC), 2025

Abs Bib PDF

The efficient management of complex distributed applications in the Cloud-Edge continuum, including their deployment on heterogeneous computing resources and run-time operations, presents significant challenges. Resource management solutions – also called orchestrators – play a pivotal role by automating and managing tasks such as resource discovery, optimisation, application deployment, and lifecycle management, whilst ensuring the desired system performance. This paper introduces Swarmchestrate, a decentralised, application-centric orchestration framework inspired by the self-organising principles of Swarms. Swarmchestrate addresses the end-to-end management of distributed applications, from submission to optimal resource allocation across cloud and edge providers, as well as dynamic reconfiguration. Our initial findings include the implementation of the application deployment phase within a Cloud-Edge simulation environment, demonstrating the potential of Swarmchestrate. The results offer valuable insight into the coordination of resource offerings between various providers and optimised resource allocation, providing a foundation for designing scalable and efficient infrastructures.
@inproceedings{ullah2025towards, title = {Towards a Decentralised Application-Centric Orchestration Framework in the Cloud-Edge Continuum}, author = {}, booktitle = {Proceedings of IEEE International Conference on Fog and Edge Computing (ICFEC)}, year = {2025}, }

2024

Multimodal Fusion SLAM with Fourier Attention

Youjie Zhou, Guofeng Mei, Yiming Wang, Yi Wan, and Fabio Poiesi

IEEE Robotics and Automation Letters (RA-L), 2024

Abs Bib PDF Code

Visual SLAM is particularly challenging in environments affected by noise, varying lighting conditions, and darkness. Learning-based optical flow algorithms can leverage multiple modalities to address these challenges, but traditional optical flow-based visual SLAM approaches often require significant computational this http URL overcome this limitation, we propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency. Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals. We further enhance the interaction of multimodal features by incorporating multi-scale knowledge distillation across modalities. We also demonstrate the practical feasibility of FMF-SLAM in real-world scenarios with real time performance by integrating it with a security robot by fusing with a global positioning module GNSS-RTK and global Bundle Adjustment. Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark settings.
@article{zhou2024multimodal, title = {Multimodal Fusion SLAM with Fourier Attention}, author = {Zhou, Youjie and Mei, Guofeng and Wang, Yiming and Wan, Yi and Poiesi, Fabio}, journal = {IEEE Robotics and Automation Letters (RA-L)}, year = {2024}, }
Positional Diffusion: Graph-based Diffusion Models for Set Ordering

Francesco Giuliari, Gianluca Scarpellini, Stefano Fiorini, Pietro Morerio, Stuart James, Yiming Wang, and Alessio Del Bue

Pattern Recognition Letters, 2024

Abs Bib HTML PDF Code

Positional reasoning is the process of ordering an unsorted set of parts into a consistent structure. To address this problem, we present Positional Diffusion, a plug-and-play graph formulation with Diffusion Probabilistic Models. Using a diffusion process, we add Gaussian noise to the set elements’ position and map them to a random position in a continuous space. Positional Diffusion learns to reverse the noising process and recover the original positions through an Attention-based Graph Neural Network. To evaluate our method, we conduct extensive experiments on three different tasks and seven datasets, comparing our approach against the state-of-the-art methods for visual puzzle-solving, sentence ordering, and room arrangement, demonstrating that our method outperforms long-lasting research on puzzle solving with up to compared to the second-best deep learning method, and performs on par against the state-of-the-art methods on sentence ordering and room rearrangement. Our work highlights the suitability of diffusion models for ordering problems and proposes a novel formulation and method for solving various ordering tasks.
@article{giuliari2024positional, title = {Positional Diffusion: Graph-based Diffusion Models for Set Ordering}, author = {Giuliari, Francesco and Scarpellini, Gianluca and Fiorini, Stefano and Morerio, Pietro and James, Stuart and Wang, Yiming and Del Bue, Alessio}, journal = {Pattern Recognition Letters}, year = {2024}, }
Retrieval-enriched zero-shot image classification in low-resource domains

Nicola Dall’Asen, Yiming Wang, Enrico Fini, and Elisa Ricci

In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Abs Bib HTML PDF Code

Low-resource domains, characterized by scarce data and annotations, present significant challenges for language and visual understanding tasks, with the latter much under-explored in the literature. Recent advancements in Vision-Language Models (VLM) have shown promising results in high-resource domains but fall short in low-resource concepts that are under-represented (e.g. only a handful of images per category) in the pre-training set. We tackle the challenging task of zero-shot low-resource image classification from a novel perspective. By leveraging a retrieval-based strategy, we achieve this in a training-free fashion. Specifically, our method, named CoRE (Combination of Retrieval Enrichment), enriches the representation of both query images and class prototypes by retrieving relevant textual information from large web-crawled databases. This retrieval-based enrichment significantly boosts classification performance by incorporating the broader contextual information relevant to the specific class. We validate our method on a newly established benchmark covering diverse low-resource domains, including medical imaging, rare plants, and circuits. Our experiments demonstrate that CoRE outperforms existing state-of-the-art methods that rely on synthetic data generation and model fine-tuning.
@inproceedings{dall2024retrieval, title = {Retrieval-enriched zero-shot image classification in low-resource domains}, author = {Dall'Asen, Nicola and Wang, Yiming and Fini, Enrico and Ricci, Elisa}, booktitle = {Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2024}, }
Unsupervised active visual search with monte carlo planning under uncertain detections

Francesco Taioli, Francesco Giuliari, Yiming Wang, Riccardo Berra, Alberto Castellini, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, and Francesco Setti

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

Abs Bib PDF

We propose a solution for Active Visual Search of objects in an environment, whose 2D floor map is the only known information. Our solution has three key features that make it more plausible and robust to detector failures compared to state-of-the-art methods: (i) it is unsupervised as it does not need any training sessions. (ii) During the exploration, a probability distribution on the 2D floor map is updated according to an intuitive mechanism, while an improved belief update increases the effectiveness of the agent’s exploration. (iii) We incorporate the awareness that an object detector may fail into the aforementioned probability modelling by exploiting the success statistics of a specific detector. Our solution is dubbed POMP-BE-PD (Pomcp-based Online Motion Planning with Belief by Exploration and Probabilistic Detection). It uses the current pose of an agent and an RGB-D observation to learn an optimal search policy, exploiting a POMDP solved by a Monte-Carlo planning approach. On the Active Vision Database benchmark, we increase the average success rate over all the environments by a significant 35% while decreasing the average path length by 4% with respect to competing methods. Thus, our results are state-of-the-art, even without using any training procedure.
@article{taioli2024unsupervised, title = {Unsupervised active visual search with monte carlo planning under uncertain detections}, author = {Taioli, Francesco and Giuliari, Francesco and Wang, Yiming and Berra, Riccardo and Castellini, Alberto and Del Bue, Alessio and Farinelli, Alessandro and Cristani, Marco and Setti, Francesco}, journal = {Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, year = {2024}, }
I2EDL: Interactive Instruction Error Detection and Localization

Francesco Taioli, Stefano Rosa, Alberto Castellini, Lorenzo Natale, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, and Yiming Wang

In Proceedings of IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 2024

Abs Bib PDF

In the Vision-and-Language Navigation in Continuous Environments (VLN-CE) task, the human user guides an autonomous agent to reach a target goal via a series of low-level actions following a textual instruction in natural language. However, most existing methods do not address the likely case where users may make mistakes when providing such instruction (e.g. "turn left" instead of "turn right"). In this work, we address a novel task of Interactive VLN in Continuous Environments (IVLN-CE), which allows the agent to interact with the user during the VLN-CE navigation to verify any doubts regarding the instruction errors. We propose an Interactive Instruction Error Detector and Localizer (I2EDL) that triggers the user-agent interaction upon the detection of instruction errors during the navigation. We leverage a pre-trained module to detect instruction errors and pinpoint them in the instruction by cross-referencing the textual input and past observations. In such way, the agent is able to query the user for a timely correction, without demanding the user’s cognitive load, as we locate the probable errors to a precise part of the instruction. We evaluate the proposed I2EDL on a dataset of instructions containing errors, and further devise a novel metric, the Success weighted by Interaction Number (SIN), to reflect both the navigation performance and the interaction effectiveness. We show how the proposed method can ask focused requests for corrections to the user, which in turn increases the navigation success, while minimizing the interactions.
@inproceedings{taioli2024i2edl, title = {I2EDL: Interactive Instruction Error Detection and Localization}, author = {Taioli, Francesco and Rosa, Stefano and Castellini, Alberto and Natale, Lorenzo and Del Bue, Alessio and Farinelli, Alessandro and Cristani, Marco and Wang, Yiming}, booktitle = {Proceedings of IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)}, year = {2024}, }
Mind the error! detection and localization of instruction errors in vision-and-language navigation

Francesco Taioli, Stefano Rosa, Alberto Castellini, Lorenzo Natale, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, and Yiming Wang

In Proceedings of International Conference on Intelligent Robots and Systems (IROS), 2024

Abs Bib HTML PDF Code

Low-resource domains, characterized by scarce data and annotations, present significant challenges for language and visual understanding tasks, with the latter much under-explored in the literature. Recent advancements in Vision-Language Models (VLM) have shown promising results in high-resource domains but fall short in low-resource concepts that are under-represented (e.g. only a handful of images per category) in the pre-training set. We tackle the challenging task of zero-shot low-resource image classification from a novel perspective. By leveraging a retrieval-based strategy, we achieve this in a training-free fashion. Specifically, our method, named CoRE (Combination of Retrieval Enrichment), enriches the representation of both query images and class prototypes by retrieving relevant textual information from large web-crawled databases. This retrieval-based enrichment significantly boosts classification performance by incorporating the broader contextual information relevant to the specific class. We validate our method on a newly established benchmark covering diverse low-resource domains, including medical imaging, rare plants, and circuits. Our experiments demonstrate that CoRE outperforms existing state-of-the-art methods that rely on synthetic data generation and model fine-tuning.
@inproceedings{taioli2024mind, title = {Mind the error! detection and localization of instruction errors in vision-and-language navigation}, author = {Taioli, Francesco and Rosa, Stefano and Castellini, Alberto and Natale, Lorenzo and Del Bue, Alessio and Farinelli, Alessandro and Cristani, Marco and Wang, Yiming}, booktitle = {Proceedings of International Conference on Intelligent Robots and Systems (IROS)}, year = {2024}, }
Harnessing Large Language Models for Training-free Video Anomaly Detection

Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci

In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

Abs Bib HTML PDF Code

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.
@inproceedings{zanella2024harnessing, title = {Harnessing Large Language Models for Training-free Video Anomaly Detection}, author = {Zanella, Luca and Menapace, Willi and Mancini, Massimiliano and Wang, Yiming and Ricci, Elisa}, booktitle = {Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)}, year = {2024}, }
Test-Time Zero-Shot Temporal Action Localization

Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, and Elisa Ricci

In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

Abs Bib HTML PDF Code

Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model’s generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM) at inference time on a sample basis. T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.
@inproceedings{liberatori2024test, title = {Test-Time Zero-Shot Temporal Action Localization}, author = {Liberatori, Benedetta and Conti, Alessandro and Rota, Paolo and Wang, Yiming and Ricci, Elisa}, booktitle = {Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)}, year = {2024}, }
Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding

Guofeng Mei, Luigi Riz, Yiming Wang, and Fabio Poiesi

In Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024

Abs Bib HTML PDF Code

Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language Models (VLMs). Existing strategies directly map VLM representations from 2D pixels of rendered or captured views to 3D points, overlooking the inherent and expressible point cloud geometric structure. Geometrically similar or close regions can be exploited for bolstering point cloud understanding as they are likely to share semantic information. To this end, we introduce the first training-free aggregation technique that leverages the point cloud’s 3D geometric structure to improve the quality of the transferred VLM representations. Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning. We benchmark our approach on three downstream tasks, including classification, part segmentation, and semantic segmentation, with a variety of datasets representing both synthetic/real-world, and indoor/outdoor scenarios. Our approach achieves new state-of-the-art results in all benchmarks.
@inproceedings{mei2024geometrically, title = {Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding}, author = {Mei, Guofeng and Riz, Luigi and Wang, Yiming and Poiesi, Fabio}, booktitle = {Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)}, year = {2024}, }
Light-weight Retinal Layer Segmentation with Global Reasoning

Xiang He, Weiye Song, Yiming Wang, Fabio Poiesi, Ji Yi, Manishi Desai, Quanqing Xu, Kongzheng Yang, and Yi Wan

IEEE Transactions on Instrumentation and Measurement, 2024

Abs Bib PDF

Automatic retinal layer segmentation with medical images, such as optical coherence tomography (OCT) images, serves as an important tool for diagnosing ophthalmic diseases. However, it is challenging to achieve accurate segmentation due to low contrast and blood flow noises presented in the images. In addition, the algorithm should be light-weight to be deployed for practical clinical applications. Therefore, it is desired to design a light-weight network with high performance for retinal layer segmentation. In this paper, we propose LightReSeg for retinal layer segmentation which can be applied to OCT images. Specifically, our approach follows an encoder-decoder structure, where the encoder part employs multi-scale feature extraction and a Transformer block for fully exploiting the semantic information of feature maps at all scales and making the features have better global reasoning capabilities, while the decoder part, we design a multi-scale asymmetric attention (MAA) module for preserving the semantic information at each encoder scale. The experiments show that our approach achieves a better segmentation performance compared to the current state-of-the-art method TransUnet with 105.7M parameters on both our collected dataset and two other public datasets, with only 3.3M parameters.
@article{he2024light, title = {Light-weight Retinal Layer Segmentation with Global Reasoning}, author = {He, Xiang and Song, Weiye and Wang, Yiming and Poiesi, Fabio and Yi, Ji and Desai, Manishi and Xu, Quanqing and Yang, Kongzheng and Wan, Yi}, journal = {IEEE Transactions on Instrumentation and Measurement}, year = {2024}, }
Delving into clip latent space for video anomaly recognition

Luca Zanella, Benedetta Liberatori, Willi Menapace, Fabio Poiesi, Yiming Wang, and Elisa Ricci

Computer Vision and Image Understanding (CVIU), 2024

Abs Bib HTML PDF Code

We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP, the first to combine Large Language and Vision (LLV) models, such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also introduce a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering three major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies.
@article{zanella2024delving, title = {Delving into clip latent space for video anomaly recognition}, author = {Zanella, Luca and Liberatori, Benedetta and Menapace, Willi and Poiesi, Fabio and Wang, Yiming and Ricci, Elisa}, journal = {Computer Vision and Image Understanding (CVIU)}, year = {2024}, }
MAVAD: Audio-Visual Dataset and Method for Anomaly Detection in Traffic Videos

Błażej Leporowski, Arian Bakhtiarnia, Nicole Bonnici, Adrian Muscat, Luca Zanella, Yiming Wang, and Alexandros Iosifidis

In Proceedings of IEEE International Conference on Image Processing (ICIP), 2024

Abs Bib PDF

We introduce the first audio-visual dataset for traffic anomaly detection taken from real-world scenes, called MAVAD, with a diverse range of weather and illumination conditions. In addition, we propose a novel method named AVACA that combines visual and audio features extracted from video sequences by means of cross-attention to detect anomalies. We demonstrate that the addition of audio improves the performance of AVACA by up to 5.2%. We also evaluate the impact of image anonymization, showing only a minor decrease in performance averaging at 1.7%.
@inproceedings{leporowski2024mavad, title = {MAVAD: Audio-Visual Dataset and Method for Anomaly Detection in Traffic Videos}, author = {Leporowski, B{\l}a{\.z}ej and Bakhtiarnia, Arian and Bonnici, Nicole and Muscat, Adrian and Zanella, Luca and Wang, Yiming and Iosifidis, Alexandros}, booktitle = {Proceedings of IEEE International Conference on Image Processing (ICIP)}, year = {2024}, }

2023

Vocabulary-free Image Classification

Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, and Elisa Ricci

In Proceedings of Conference on Neural Information Processing Systems (NeurIPS), 2023

Abs Bib HTML PDF Code

Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.
@inproceedings{conti2023vocabulary, title = {Vocabulary-free Image Classification}, author = {Conti, Alessandro and Fini, Enrico and Mancini, Massimiliano and Rota, Paolo and Wang, Yiming and Ricci, Elisa}, booktitle = {Proceedings of Conference on Neural Information Processing Systems (NeurIPS)}, year = {2023}, }
Survey on video anomaly detection in dynamic scenes with moving cameras

Runyu Jiao, Yi Wan, Fabio Poiesi, and Yiming Wang

Artificial Intelligence Review, 2023

Abs Bib PDF

The increasing popularity of compact and inexpensive cameras, e.g. dash cameras, body cameras, and cameras equipped on robots, has sparked a growing interest in detecting anomalies within dynamic scenes recorded by moving cameras. However, existing reviews primarily concentrate on Video Anomaly Detection (VAD) methods assuming static cameras. The VAD literature with moving cameras remains fragmented, lacking comprehensive reviews to date. To address this gap, we endeavor to present the first comprehensive survey on Moving Camera Video Anomaly Detection (MC-VAD). We delve into the research papers related to MC-VAD, critically assessing their limitations and highlighting associated challenges. Our exploration encompasses three application domains: security, urban transportation, and marine environments, which in turn cover six specific tasks. We compile an extensive list of 25 publicly-available datasets spanning four distinct environments: underwater, water surface, ground, and aerial. We summarize the types of anomalies these datasets correspond to or contain, and present five main categories of approaches for detecting such anomalies. Lastly, we identify future research directions and discuss novel contributions that could advance the field of MC-VAD. With this survey, we aim to offer a valuable reference for researchers and practitioners striving to develop and advance state-of-the-art MC-VAD methods.
@article{jiao2023survey, title = {Survey on video anomaly detection in dynamic scenes with moving cameras}, author = {Jiao, Runyu and Wan, Yi and Poiesi, Fabio and Wang, Yiming}, journal = {Artificial Intelligence Review}, number = {https://doi.org/10.1007/s10462-023-10609}, year = {2023}, }
Attentive Multimodal Fusion for Optical and Scene Flow

Youjie Zhou, Guofeng Mei, Yiming Wang, Fabio Poiesi, and Yi Wan

IEEE Robotics and Automation Letters (RA-L), 2023

Abs Bib PDF Code

This paper presents an investigation into the estimation of optical and scene flow using RGBD information in scenarios where the RGB modality is affected by noise or captured in dark environments. Existing methods typically rely solely on RGB images or fuse the modalities at later stages, which can result in lower accuracy when the RGB information is unreliable. To address this issue, we propose a novel deep neural network approach named FusionRAFT, which enables early-stage information fusion between sensor modalities (RGB and depth). Our approach incorporates self- and cross-attention layers at different network levels to construct informative features that leverage the strengths of both modalities. Through comparative experiments, we demonstrate that our approach outperforms recent methods in terms of performance on the synthetic dataset Flyingthings3D, as well as the generalization on the real-world dataset KITTI. We illustrate that our approach exhibits improved robustness in the presence of noise and low-lighting conditions that affect the RGB images.
@article{zhou2023attentive, title = {Attentive Multimodal Fusion for Optical and Scene Flow}, author = {Zhou, Youjie and Mei, Guofeng and Wang, Yiming and Poiesi, Fabio and Wan, Yi}, journal = {IEEE Robotics and Automation Letters (RA-L)}, year = {2023}, }
Exploiting multi-granularity visual features for retinal layer segmentation in human eyes

Xiang He, Yiming Wang, Fabio Poiesi, Weiye Song, Quanqing Xu, Zixuan Feng, and Yi Wan

Frontiers in Bioengineering and Biotechnology, 2023

Abs Bib PDF Code

Accurate segmentation of retinal layer boundaries can facilitate the detection of patients with early ophthalmic disease. Typical segmentation algorithms operate at low resolutions without fully exploiting multi-granularity visual features. Moreover, several related studies do not release their datasets that are key for the research on deep learning-based solutions. We propose a novel end-to-end retinal layer segmentation network based on ConvNeXt, which can retain more feature map details by using a new depth-efficient attention module and multi-scale structures. In addition, we provide a semantic segmentation dataset containing 206 retinal images of healthy human eyes (named NR206 dataset), which is easy to use as it does not require any additional transcoding processing. We experimentally show that our segmentation approach outperforms state-of-the-art approaches on this new dataset, achieving, on average, a Dice score of 91.3% and mIoU of 84.4%. Moreover, our approach achieves state-of-the-art performance on a glaucoma dataset and a diabetic macular edema (DME) dataset, showing that our model is also suitable for other applications.
@article{he2023exploiting, title = {Exploiting multi-granularity visual features for retinal layer segmentation in human eyes}, author = {He, Xiang and Wang, Yiming and Poiesi, Fabio and Song, Weiye and Xu, Quanqing and Feng, Zixuan and Wan, Yi}, journal = {Frontiers in Bioengineering and Biotechnology}, year = {2023}, }
3DSGrasp: 3D Shape-Completion for Robotic Grasp

Saber Seyed Mohammadi, Nuno Ferreira Duarte, Dimitrios Dimou, Yiming Wang, Matteo Taiana, Pietro Morerio, Atabak Dehban, Plinio Moreno, Alexandre Bernardino, Alessio Del Bue, and others

In Proceedings of IEEE International Conference on Robotics and Automation (ICRA), 2023

Abs Bib PDF Code

Real-world robotic grasping can be done robustly if a complete 3D Point Cloud Data (PCD) of an object is available. However, in practice, PCDs are often incomplete when objects are viewed from few and sparse viewpoints before the grasping action, leading to the generation of wrong or inaccurate grasp poses. We propose a novel grasping strategy, named 3DSGrasp, that predicts the missing geometry from the partial PCD to produce reliable grasp poses. Our proposed PCD completion network is a Transformer-based encoder-decoder network with an Offset-Attention layer. Our network is inherently invariant to the object pose and point’s permutation, which generates PCDs that are geometrically consistent and completed properly. Experiments on a wide range of partial PCD show that 3DSGrasp outperforms the best state-of-the-art method on PCD completion tasks and largely improves the grasping success rate in real-world scenarios.
@inproceedings{seyed20233dsgrasp, title = {3DSGrasp: 3D Shape-Completion for Robotic Grasp}, author = {Seyed Mohammadi, Saber and Ferreira Duarte, Nuno and Dimou, Dimitrios and Wang, Yiming and Taiana, Matteo and Morerio, Pietro and Dehban, Atabak and Moreno, Plinio and Bernardino, Alexandre and Del Bue, Alessio and others}, booktitle = {Proceedings of IEEE International Conference on Robotics and Automation (ICRA)}, year = {2023}, }
Leveraging commonsense for object localisation in partial scenes

Francesco Giuliari, Geri Skenderi, Marco Cristani, Alessio Del Bue, and Yiming Wang

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Abs Bib PDF

We propose an end-to-end solution to address the problem of object localisation in partial scenes, where we aim to estimate the position of an object in an unknown area given only a partial 3D scan of the scene. We propose a novel scene representation to facilitate the geometric reasoning, Directed Spatial Commonsense Graph (D-SCG), a spatial scene graph that is enriched with additional concept nodes from a commonsense knowledge base. Specifically, the nodes of D-SCG represent the scene objects and the edges are their relative positions. Each object node is then connected via different commonsense relationships to a set of concept nodes. With the proposed graph-based scene representation, we estimate the unknown position of the target object using a Graph Neural Network that implements a novel attentional message passing mechanism. The network first predicts the relative positions between the target object and each visible object by learning a rich representation of the objects via aggregating both the object nodes and the concept nodes in D-SCG. These relative positions then are merged to obtain the final position. We evaluate our method using Partial ScanNet, improving the state-of-the-art by 5.9% in terms of the localisation accuracy at a 8x faster training speed.
@article{giuliari2023leveraging, title = {Leveraging commonsense for object localisation in partial scenes}, author = {Giuliari, Francesco and Skenderi, Geri and Cristani, Marco and Del Bue, Alessio and Wang, Yiming}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, year = {2023}, }
ConfMix: Unsupervised Domain Adaptation for Object Detection via Confidence-based Mixing

Giulio Mattolin, Luca Zanella, Elisa Ricci, and Yiming Wang

In Winter Conference on Applications of Computer Vision (WACV), 2023

Abs Bib PDF Code

Unsupervised Domain Adaptation (UDA) for object detection aims to adapt a model trained on a source domain to detect instances from a new target domain for which annotations are not available. Different from traditional approaches, we propose ConfMix, the first method that introduces a sample mixing strategy based on region-level detection confidence for adaptive object detector learning. We mix the local region of the target sample that corresponds to the most confident pseudo detections with a source image, and apply an additional consistency loss term to gradually adapt towards the target data distribution. In order to robustly define a confidence score for a region, we exploit the confidence score per pseudo detection that accounts for both the detector-dependent confidence and the bounding box uncertainty. Moreover, we propose a novel pseudo labelling scheme that progressively filters the pseudo target detections using the confidence metric that varies from a loose to strict manner along the training. We perform extensive experiments with three datasets, achieving state-of-the-art performance in two of them and approaching the supervised target model performance in the other.
@inproceedings{mattolin2023confmix, title = {ConfMix: Unsupervised Domain Adaptation for Object Detection via Confidence-based Mixing}, author = {Mattolin, Giulio and Zanella, Luca and Ricci, Elisa and Wang, Yiming}, booktitle = {Winter Conference on Applications of Computer Vision (WACV)}, year = {2023}, }
PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation

Bin Ren, Hao Tang, Yiming Wang, Xia Li, Wei Wang, and Nicu Sebe

In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Abs Bib PDF Code

For semantic-guided cross-view image translation, it is crucial to learn where to sample pixels from the source view image and where to reallocate them guided by the target view semantic map, especially when there is little overlap or drastic view difference between the source and target images. Hence, one not only needs to encode the long-range dependencies among pixels in both the source view image and target view semantic map but also needs to translate these learned dependencies. To this end, we propose a novel generative adversarial network, PI-Trans, which mainly consists of a novel Parallel-ConvMLP module and an Implicit Transformation module at multiple semantic levels. Extensive experimental results show that PI-Trans achieves the best qualitative and quantitative performance by a large margin compared to the state-of-the-art methods on two challenging datasets.
@inproceedings{ren2023pi, title = {PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation}, author = {Ren, Bin and Tang, Hao and Wang, Yiming and Li, Xia and Wang, Wei and Sebe, Nicu}, booktitle = {Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}, year = {2023}, }

2022

Responsible AI at the edge: towards privacy-preserving smart cities

Luca Zanella, Yiming Wang, Nicola Dall’Asen, Alberto Ancilotto, Francesco Paissan, Elisa Ricci, Elisabetta Farella, Alessio Brutti, Marco Pistore, and others

In Ital-IA 2022, Secondo Convegno del Laboratorio nazionale CINI-AIIS, 2022

Abs Bib PDF

With the massive amount of data produced by ambient environmental sensors, many AI-based solutions are emerging to support new smart cities’ applications. However, these data may contain sensitive personal information, calling for responsible AI solutions. FBK proposes a privacy-preserving subsystem with a set of technological components that enable responsible AI and prevent unauthorised usage of personal data at the data storage and during data transmission under the context of Smart Cities. We demonstrate the proposed solution under an EU project MARVEL, where both video and audio anonymisation components are deployed at the edge level, enabled by a model compression component for complexity reduction. We discuss each component’s technical challenges, current progress, and future directions.
@incollection{zanella2022responsible, title = {Responsible AI at the edge: towards privacy-preserving smart cities}, author = {Zanella, Luca and Wang, Yiming and Dall’Asen, Nicola and Ancilotto, Alberto and Paissan, Francesco and Ricci, Elisa and Farella, Elisabetta and Brutti, Alessio and Pistore, Marco and others}, booktitle = {Ital-IA 2022, Secondo Convegno del Laboratorio nazionale CINI-AIIS}, year = {2022}, }
Fast re-OBJ: Real-time object re-identification in rigid scenes

Ertuğrul Bayraktar, Yiming Wang, and Alessio Del Bue

Machine Vision and Applications, 2022

Abs Bib PDF

Unsupervised Domain Adaptation (UDA) for object detection aims to adapt a model trained on a source domain to detect instances from a new target domain for which annotations are not available. Different from traditional approaches, we propose ConfMix, the first method that introduces a sample mixing strategy based on region-level detection confidence for adaptive object detector learning. We mix the local region of the target sample that corresponds to the most confident pseudo detections with a source image, and apply an additional consistency loss term to gradually adapt towards the target data distribution. In order to robustly define a confidence score for a region, we exploit the confidence score per pseudo detection that accounts for both the detector-dependent confidence and the bounding box uncertainty. Moreover, we propose a novel pseudo labelling scheme that progressively filters the pseudo target detections using the confidence metric that varies from a loose to strict manner along the training. We perform extensive experiments with three datasets, achieving state-of-the-art performance in two of them and approaching the supervised target model performance in the other.
@article{bayraktar2022fast, title = {Fast re-OBJ: Real-time object re-identification in rigid scenes}, author = {Bayraktar, Ertu{\u{g}}rul and Wang, Yiming and Del Bue, Alessio}, journal = {Machine Vision and Applications}, year = {2022}, }
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition

Alessandro Conti, Paolo Rota, Yiming Wang, and Elisa Ricci

In Proceedings of British Machine Vision Conference (BMVC), 2022

Abs Bib PDF Code

Automatically understanding emotions from visual data is a fundamental task for human behaviour understanding. While models devised for Facial Expression Recognition (FER) have demonstrated excellent performances on many datasets, they often suffer from severe performance degradation when trained and tested on different datasets due to domain shift. In addition, as face images are considered highly sensitive data, the accessibility to large-scale datasets for model training is often denied. In this work, we tackle the above-mentioned problems by proposing the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for FER. Our method exploits self-supervised pretraining to learn good feature representations from the target data and proposes a novel and robust cluster-level pseudo-labelling strategy that accounts for in-cluster statistics. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER, and is on par with methods addressing FER in the UDA setting.
@inproceedings{conti2022cluster, title = {Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition}, author = {Conti, Alessandro and Rota, Paolo and Wang, Yiming and Ricci, Elisa}, booktitle = {Proceedings of British Machine Vision Conference (BMVC)}, year = {2022}, }
Spatial Commonsense Graph for Object Localisation in Partial Scenes

Francesco Giuliari, Geri Skender, Marco Cristani, Yiming Wang, and Alessio Del Bue

In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2022

Abs Bib PDF Code

We solve object localisation in partial scenes, a new problem of estimating the unknown position of an object (e.g. where is the bag?) given a partial 3D scan of a scene. The proposed solution is based on a novel scene graph model, the Spatial Commonsense Graph (SCG), where objects are the nodes and edges define pairwise distances between them, enriched by concept nodes and relationships from a commonsense knowledge base. This allows SCG to better generalise its spatial inference over unknown 3D scenes. The SCG is used to estimate the unknown position of the target object in two steps: first, we feed the SCG into a novel Proximity Prediction Network, a graph neural network that uses attention to perform distance prediction between the node representing the target object and the nodes representing the observed objects in the SCG; second, we propose a Localisation Module based on circular intersection to estimate the object position using all the predicted pairwise distances in order to be independent of any reference system. We create a new dataset of partially reconstructed scenes to benchmark our method and baselines for object localisation in partial scenes, where our proposed method achieves the best localisation performance.
@inproceedings{giuliari2022spatial, title = {Spatial Commonsense Graph for Object Localisation in Partial Scenes}, author = {Giuliari, Francesco and Skender, Geri and Cristani, Marco and Wang, Yiming and Del Bue, Alessio}, booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2022}, }
SVP-Classifier: Single-View Point cloud data classifier with multi-view hallucination

Seyed Saber Mohammadi, Yiming Wang, Matteo Taiana, Pietro Morerio, and Alessio Del Bue

In Proceedings of International Conference on Image Analysis and Processing (ICIAP), 2022

Abs Bib PDF Code

We address single-view 3D shape classification with partial Point Cloud Data (PCD) inputs. Conventional PCD classifiers achieve the best performance when trained and evaluated with complete 3D object scans. However, they all experience a performance drop when trained and evaluated on partial single-view PCD. We propose a Single-View PCD Classifier (SVP-Classifier), which first hallucinates the features of other viewpoints covering the unseen part of the object with a Conditional Variational Auto-Encoder (CVAE). It then aggregates the hallucinated multi-view features with a multi-level Graph Convolutional Network (GCN) to form a global shape representation that helps to improve the single-view PCD classification performance. With experiments on the single-view PCDs generated from ModelNet40 and ScanObjectNN, we prove that the proposed SVP-Classifier outperforms the best single-view PCD-based methods, after they have been retrained on single-view PCDs, thus reducing the gap between single-view methods and methods that employ complete PCDs.
@inproceedings{mohammadi2022svp, title = {SVP-Classifier: Single-View Point cloud data classifier with multi-view hallucination}, author = {Mohammadi, Seyed Saber and Wang, Yiming and Taiana, Matteo and Morerio, Pietro and Del Bue, Alessio}, booktitle = {Proceedings of International Conference on Image Analysis and Processing (ICIAP)}, year = {2022}, }
Graph-based Generative Face Anonymisation with Pose Preservation

Nicola Dall’Asen, Yiming Wang, Hao Tang, Luca Zanella, and Elisa Ricci

In Proceedings of International Conference on Image Analysis and Processing (ICIAP), 2022

Abs Bib PDF Code

We propose AnonyGAN, a GAN-based solution for face anonymisation which replaces the visual information corresponding to a source identity with a condition identity provided as any single image. With the goal to maintain the geometric attributes of the source face, i.e., the facial pose and expression, and to promote more natural face generation, we propose to exploit a Bipartite Graph to explicitly model the relations between the facial landmarks of the source identity and the ones of the condition identity through a deep model. We further propose a landmark attention model to relax the manual selection of facial landmarks, allowing the network to weight the landmarks for the best visual naturalness and pose preservation. Finally, to facilitate the appearance learning, we propose a hybrid training strategy to address the challenge caused by the lack of direct pixel-level supervision. We evaluate our method and its variants on two public datasets, CelebA and LFW, in terms of visual naturalness, facial pose preservation and of its impacts on face detection and re-identification. We prove that AnonyGAN significantly outperforms the state-of-the-art methods in terms of visual naturalness, face detection and pose preservation.
@inproceedings{dall2022graph, title = {Graph-based Generative Face Anonymisation with Pose Preservation}, author = {Dall’Asen, Nicola and Wang, Yiming and Tang, Hao and Zanella, Luca and Ricci, Elisa}, booktitle = {Proceedings of International Conference on Image Analysis and Processing (ICIAP)}, year = {2022}, }
Loop closure detection using local 3D deep descriptors

Youjie Zhou, Yiming Wang, Fabio Poiesi, Qi Qin, and Yi Wan

IEEE Robotics and Automation Letters (RA-L), 2022

Abs Bib PDF Code

We propose a simple yet effective method to address loop closure detection in simultaneous localisation and mapping using local 3D deep descriptors (L3Ds). L3Ds are emerging compact representations of patches extracted from point clouds that are learnt from data using a deep learning algorithm. We propose a novel overlap measure for loop detection by computing the metric error between points that correspond to mutually-nearest-neighbour descriptors after registering the loop candidate point cloud by its estimated relative pose. This novel approach enables us to accurately detect loops and estimate six degrees-of-freedom poses in the case of small overlaps. We compare our L3D-based loop closure approach with recent approaches on LiDAR data and achieve state-of-the-art loop closure detection accuracy. Additionally, we embed our loop closure approach in RESLAM, a recent edge-based SLAM system, and perform the evaluation on real-world RGBD-TUM and synthetic ICL datasets. Our approach enables RESLAM to achieve a better localisation accuracy compared to its original loop closure strategy.
@article{zhou2022loop, title = {Loop closure detection using local 3D deep descriptors}, author = {Zhou, Youjie and Wang, Yiming and Poiesi, Fabio and Qin, Qi and Wan, Yi}, journal = {IEEE Robotics and Automation Letters (RA-L)}, year = {2022}, }

2021

Marvel: Multimodal extreme scale data analytics for smart cities environments

Dragana Bajovic, Arian Bakhtiarnia, George Bravos, Alessio Brutti, Felix Burkhardt, Daniel Cauchi, Antony Chazapis, Claire Cianco, Nicola Dall’Asen, Vlado Delic, and others

In Proceedings of International Balkan Conference on Communications and Networking (BalkanCom), 2021

Abs Bib PDF

A Smart City based on data acquisition, handling and intelligent analysis requires efficient design and implementation of the respective AI technologies and the underlying infrastructure for seamlessly analyzing the large amounts of data in real-time. The EU project MARVEL will research solutions that can improve the integration of multiple data sources in a Smart City environment for harnessing the advantages rooted in multimodal perception of the surrounding environment.
@inproceedings{bajovic2021marvel, title = {Marvel: Multimodal extreme scale data analytics for smart cities environments}, author = {Bajovic, Dragana and Bakhtiarnia, Arian and Bravos, George and Brutti, Alessio and Burkhardt, Felix and Cauchi, Daniel and Chazapis, Antony and Cianco, Claire and Dall’Asen, Nicola and Delic, Vlado and others}, booktitle = {Proceedings of International Balkan Conference on Communications and Networking (BalkanCom)}, pages = {143--147}, year = {2021}, }
Pointview-gcn: 3d shape classification with multi-view point clouds

Seyed Saber Mohammadi, Yiming Wang, and Alessio Del Bue

In Proceedings of IEEE International Conference on Image Processing (ICIP), 2021

Abs Bib PDF Code

We address 3D shape classification with partial point cloud inputs captured from multiple viewpoints around the object. Different from existing methods that perform classification on the complete point cloud by first registering multi-view capturing, we propose PointView-GCN with multi-level Graph Convolutional Networks (GCNs) to hierarchically aggregate the shape features of single-view point clouds, in order to encode both the geometrical cues of an object and their multi-view relations. With experiments on our novel single-view datasets, we prove that PointView-GCN produces a more descriptive global shape feature which stably improves the classification accuracy by ∼5% compared to the classifiers with single-view point clouds, and outperforms the state-of-the-art methods with the complete point clouds on ModelNet40.
@inproceedings{mohammadi2021pointview, title = {Pointview-gcn: 3d shape classification with multi-view point clouds}, author = {Mohammadi, Seyed Saber and Wang, Yiming and Del Bue, Alessio}, booktitle = {Proceedings of IEEE International Conference on Image Processing (ICIP)}, pages = {3103--3107}, year = {2021}, }
End-to-end pairwise human proxemics from uncalibrated single images

Pietro Morerio, Matteo Bustreo, Yiming Wang, and Alessio Del Bue

In Proceedings of IEEE International Conference on Image Processing (ICIP), 2021

Abs Bib PDF

In this work, we address the ill-posed problem of estimating pairwise metric distances between people using only a single uncalibrated image. We propose an end-to-end model, DeepProx, that takes as inputs two skeletal joints as a set of 2D image coordinates and outputs the metric distance between them. We show that an increased performance is achieved by a geometrical loss over simplified camera parameters provided at training time. Further, DeepProx achieves a remarkable generalisation over novel viewpoints through domain generalisation techniques. We validate our proposed method quantitatively and qualitatively against baselines on public datasets for which we provided groundtruth on interpersonal distances.
@inproceedings{morerio2021end, title = {End-to-end pairwise human proxemics from uncalibrated single images}, author = {Morerio, Pietro and Bustreo, Matteo and Wang, Yiming and Del Bue, Alessio}, booktitle = {Proceedings of IEEE International Conference on Image Processing (ICIP)}, pages = {3058--3062}, year = {2021}, }
POMP++: Pomcp-based Active Visual Search in unknown indoor environments

Francesco Giuliari, Alberto Castellini, Riccardo Berra, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, Francesco Setti, and Yiming Wang

In Proceedings of International Conference on Intelligent Robots and Systems (IROS), 2021

Abs Bib PDF

In this paper we focus on the problem of learning online an optimal policy for Active Visual Search (AVS) of objects in unknown indoor environments. We propose POMP++, a planning strategy that introduces a novel formulation on top of the classic Partially Observable Monte Carlo Planning (POMCP) framework, to allow training-free online policy learning in unknown environments. We present a new belief reinvigoration strategy which allows to use POMCP with a dynamically growing state space to address the online generation of the floor map. We evaluate our method on two public benchmark datasets, AVD that is acquired by real robotic platforms and Habitat ObjectNav that is rendered from real 3D scene scans, achieving the best success rate with an improvement of >10% over the state-of-the-art methods.
@inproceedings{giuliari2021pomp++, title = {POMP++: Pomcp-based Active Visual Search in unknown indoor environments}, author = {Giuliari, Francesco and Castellini, Alberto and Berra, Riccardo and Del Bue, Alessio and Farinelli, Alessandro and Cristani, Marco and Setti, Francesco and Wang, Yiming}, booktitle = {Proceedings of International Conference on Intelligent Robots and Systems (IROS)}, year = {2021}, }
Single Image Human Proxemics Estimation for Visual Social Distancing

Maya Aghaei, Matteo Bustreo, Yiming Wang, Pietro Morerio, and Alessio Del Bue

In Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), 2021

Abs Bib PDF Code

In this work, we address the problem of estimating the so-called "Social Distancing" given a single uncalibrated image in unconstrained scenarios. Our approach proposes a semi-automatic solution to approximate the homography matrix between the scene ground and image plane. With the estimated homography, we then leverage an off-the-shelf pose detector to detect body poses on the image and to reason upon their inter-personal distances using the length of their body-parts. Inter-personal distances are further locally inspected to detect possible violations of the social distancing rules. We validate our proposed method quantitatively and qualitatively against baselines on public domain datasets for which we provided groundtruth on inter-personal distances. Besides, we demonstrate the application of our method deployed in a real testing scenario where statistics on the inter-personal distances are currently used to improve the safety in a critical environment.
@inproceedings{aghaei2021single, title = {Single Image Human Proxemics Estimation for Visual Social Distancing}, author = {Aghaei, Maya and Bustreo, Matteo and Wang, Yiming and Morerio, Pietro and Del Bue, Alessio}, booktitle = {Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV)}, year = {2021}, }

2020

POMP: Pomcp-based Online Motion Planning for active visual search in indoor environments

Yiming Wang, Francesco Giuliari, Ricardo Berra, Alberto Castellini, Alessio Del Bue, Alessandro Farinelli, Cristani Marco, and Francesco Setti

In Proceedings of British Machine Vision Conference (BMVC), 2020

Abs Bib PDF

In this paper we focus on the problem of learning an optimal policy for Active Visual Search (AVS) of objects in known indoor environments with an online setup. Our POMP method uses as input the current pose of an agent (e.g. a robot) and a RGB-D frame. The task is to plan the next move that brings the agent closer to the target object. We model this problem as a Partially Observable Markov Decision Process solved by a Monte-Carlo planning approach. This allows us to make decisions on the next moves by iterating over the known scenario at hand, exploring the environment and searching for the object at the same time. Differently from the current state of the art in Reinforcement Learning, POMP does not require extensive and expensive (in time and computation) labelled data so being very agile in solving AVS in small and medium real scenarios. We only require the information of the floormap of the environment, an information usually available or that can be easily extracted from an a priori single exploration run. We validate our method on the publicly available AVD benchmark, achieving an average success rate of 0.76 with an average path length of 17.1, performing close to the state of the art but without any training needed. Additionally, we show experimentally the robustness of our method when the quality of the object detection goes from ideal to faulty.
@inproceedings{wang2020pomp, title = {POMP: Pomcp-based Online Motion Planning for active visual search in indoor environments}, author = {Wang, Yiming and Giuliari, Francesco and Berra, Ricardo and Castellini, Alberto and Del Bue, Alessio and Farinelli, Alessandro and Marco, Cristani and Setti, Francesco}, booktitle = {Proceedings of British Machine Vision Conference (BMVC)}, year = {2020}, }
Where to Explore Next? ExHistCNN for History-aware Autonomous 3D Exploration

Yiming Wang and Alessio Del Bue

In Proceedings of European Conference on Computer Vision (ECCV), 2020

Abs Bib PDF Code

In this work we address the problem of autonomous 3D exploration of an unknown indoor environment using a depth camera. We cast the problem as the estimation of the Next Best View (NBV) that maximises the coverage of the unknown area. We do this by re-formulating NBV estimation as a classification problem and we propose a novel learning-based metric that encodes both, the current 3D observation (a depth frame) and the history of the ongoing reconstruction. One of the major contributions of this work is about introducing a new representation for the 3D reconstruction history as an auxiliary utility map which is efficiently coupled with the current depth observation. With both pieces of information, we train a light-weight CNN, named ExHistCNN, that estimates the NBV as a set of directions towards which the depth sensor finds most unexplored areas. We perform extensive evaluation on both synthetic and real room scans demonstrating that the proposed ExHistCNN is able to approach the exploration performance of an oracle using the complete knowledge of the 3D environment.
@inproceedings{wang2020explore, title = {Where to Explore Next? ExHistCNN for History-aware Autonomous 3D Exploration}, author = {Wang, Yiming and Del Bue, Alessio}, booktitle = {Proceedings of European Conference on Computer Vision (ECCV)}, year = {2020}, }

2019

Active 3D Classification of Multiple Objects in Cluttered Scenes

Yiming Wang, Marco Carletti, Francesco Setti, Marco Cristani, and Alessio Del Bue

In Proceedings of International Workshop on Assistive Computer Vision and Robotics, in Conjunction With ICCV (ICCVW), 2019

Abs Bib PDF

Autonomous agents that need to effectively move and interact in a realistic environment have to be endowed with robust perception skills. Among many, accurate object classification is an essential supporting element for assistive robotics. However, realistic scenarios often present scenes with severe clutter, that dramatically degrades the performance of current object classification methods. This paper presents an active vision approach that improves the accuracy of 3D object classification through a next-best-view (NBV) paradigm to perform this complex task with ease. The next camera motion is chosen with the criteria that aim to avoid object self-occlusions while exploring as much as possible the surrounding area. An online 3D reconstruction module is exploited in our system in order to obtain a better canonical 3D representation of the scene while moving the sensor. By reducing the impact of occlusions, we show with both synthetic and real-world data that in a few moves the approach can surpass a state-of-the-art method, PointNet with single view object classification from depth data. In addition, we demonstrate our system in a practical scenario where depth sensor moves to search and classify a set of objects in cluttered scenes.
@inproceedings{wang2019active, title = {Active 3D Classification of Multiple Objects in Cluttered Scenes}, author = {Wang, Yiming and Carletti, Marco and Setti, Francesco and Cristani, Marco and Del Bue, Alessio}, booktitle = {Proceedings of International Workshop on Assistive Computer Vision and Robotics, in Conjunction With ICCV (ICCVW)}, year = {2019}, }
Autonomous 3D reconstruction, mapping and exploration of indoor environments with a robotic arm

Yiming Wang, Stuart James, Elisavet Konstantina Stathopoulou, Carlos Beltrán-González, Yoshinori Konishi, and Alessio Del Bue

IEEE Robotics and Automation Letters (RA-L), 2019

Abs Bib PDF

We propose a novel information gain metric that combines hand-crafted and data-driven metrics to address the next best view problem for autonomous 3-D mapping of unknown indoor environments. For the hand-crafted metric, we propose an entropy-based information gain that accounts for the previous view points to avoid the camera to revisit the same location and to promote the motion toward unexplored or occluded areas. However, for the learnt metric, we adopt a convolutional neural network (CNN) architecture and formulate the problem as a classification problem. The CNN takes the current depth image as input and outputs the motion direction that suggests the largest unexplored surface. We train and test the CNN using a new synthetic dataset based on the SUNCG dataset. The learnt motion direction is then combined with the proposed hand-crafted metric to help handle situations where using only the hand-crafted metric tends to face ambiguities. We finally evaluate the autonomous paths over several real and synthetic indoor scenes including complex industrial and domestic settings and prove that our combined metric is able to further improve the exploration coverage compared to using only the proposed hand-crafted metric.
@article{wang2019autonomous, title = {Autonomous 3D reconstruction, mapping and exploration of indoor environments with a robotic arm}, author = {Wang, Yiming and James, Stuart and Stathopoulou, Elisavet Konstantina and Beltrán-González, Carlos and Konishi, Yoshinori and Del Bue, Alessio}, journal = {IEEE Robotics and Automation Letters (RA-L)}, year = {2019}, }

2018

Concurrent target following with active directional sensors

Yiming Wang and Andrea Cavallaro

In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

Abs Bib PDF

We propose a collision-avoidance tracker for agents with a directional sensor that aim to maintain a moving target in their field of view. The proposed tracker addresses the view maintenance issue within an Optimal Reciprocal Collision Avoidance (ORCA) framework. Our tracking agents adaptively share the responsibility of avoiding each other and minimise with a smooth actuation the deviation angle from their heading direction to their target. Experimental results with real people trajectories from public datasets show that the proposed method improves view maintenance.
@inproceedings{wang2018concurrent, title = {Concurrent target following with active directional sensors}, author = {Wang, Yiming and Cavallaro, Andrea}, booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages = {6603--6607}, year = {2018}, }

2017

Active visual tracking in multi-agent scenarios

Yiming Wang and Andrea Cavallaro

In Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017

Abs Bib PDF Code

We propose an active visual tracker with collision avoidance for camera-equipped robots in dense multi-agent scenarios. The objective of each tracking agent (robot) is to maintain visual fixation on its moving target while updating its velocity to avoid other agents. However, when multiple robots are present or targets intensively intersect each other, robots may have no accessible collision-avoiding paths. We address this problem with an adaptive mechanism that sets the pair-wise responsibilities to increase the total accessible collision-avoiding controls. The final collision-avoiding control accounts for motion smoothness and view performance, i.e. maintaining the target centered in the field of view and at a certain size. We validate the proposed approach under different target-intersecting scenarios and compare it with the Optimal Reciprocal Collision Avoidance and the Reciprocal Velocity Obstacle methods.
@inproceedings{wang2017active, title = {Active visual tracking in multi-agent scenarios}, author = {Wang, Yiming and Cavallaro, Andrea}, booktitle = {Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)}, pages = {1--6}, year = {2017}, }

2016

Prioritized target tracking with active collaborative cameras

Yiming Wang and Andrea Cavallaro

In Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2016

Abs Bib PDF Code

Mobile cameras on robotic platforms can support fixed multi-camera installations to improve coverage and target localization accuracy. We propose a novel collaborative framework for prioritized target tracking that complement static cameras with mobile cameras, which track targets on demand. Upon receiving a request from static cameras, a mobile camera selects (or switches to) a target to track using a local selection criterion that accounts for target priority, view quality and energy consumption. Mobile cameras use a receding horizon scheme to minimize tracking uncertainty as well as energy consumption when planning their path. We validate the proposed framework in simulated realistic scenarios and show that it improves tracking accuracy and target observation time with reduced energy consumption compared to a framework with only static cameras and compared to a state-of-the-art motion strategy.
@inproceedings{wang2016prioritized, title = {Prioritized target tracking with active collaborative cameras}, author = {Wang, Yiming and Cavallaro, Andrea}, booktitle = {Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)}, pages = {131--137}, year = {2016}, }

2015

Coalition formation for distributed tracking in wireless camera networks

Yiming Wang and Andrea Cavallaro

In IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2015

Abs Bib PDF

We present a fully distributed framework for multi-target tracking with bandwidth-limited (wireless) camera networks. Cameras self-organize into coalitions to perform the task of distributed target tracking via local interactions. Each camera joins the coalitions based on considerations of marginal utility, which takes into account tracking confidence and communication performance in the neighborhood of the camera. The proposed framework achieves higher tracking accuracy and quicker convergence than decentralized tracking or distributed tracking without coalition formation. Moreover, the communication cost of the proposed framework is considerably reduced compared to distributed tracking without coalition formation and comparable to decentralized tracking as the number of targets increases.
@inproceedings{wang2015coalition, title = {Coalition formation for distributed tracking in wireless camera networks}, author = {Wang, Yiming and Cavallaro, Andrea}, booktitle = {IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)}, year = {2015}, }