ICCV 2025


1	Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training	Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé	[2507.13260](https://huggingface.co/papers/2507.13260)	138	12	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Hsiao_TF-TI2I_Training-Free_Text-and-Image-to-Image_Generation_via_Multi-Modal_Implicit-Context_Learning_In_Text-to-Image_ICCV_2025_paper.html)	[link](https://deepayan137.github.io/papers/training-free-personalization.html)	[link](https://github.com/YaoChengTang/Diving-into-the-Fusion-of-Monocular-Priors-for-Generalized-Stereo-Matching)	[AdamYao/Diving-into-the-Fusion-of-Monocular-Priors-for-Generalized-Stereo-Matching](https://huggingface.co/spaces/AdamYao/Diving-into-the-Fusion-of-Monocular-Priors-for-Generalized-Stereo-Matching)	[egorchistov/optical-flow-MEMFOF-Tartan-T-TSKH](https://huggingface.co/egorchistov/optical-flow-MEMFOF-Tartan-T-TSKH)	[kevinzzz8866/ByteDance_Synthetic_Videos](https://huggingface.co/datasets/kevinzzz8866/ByteDance_Synthetic_Videos)	10/14 ✅	We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement. We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement. Project page: https://gohyojun15.github.io/VideoRFSplat/	1000


1	kh: Symmetry Understanding of 3D Shapes via Chirality Disentanglement	Weikang Wang, Tobias Weißberg, Nafie El Amrani, Florian Bernard		0	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Wang_kh_Symmetry_Understanding_of_3D_Shapes_via_Chirality_Disentanglement_ICCV_2025_paper.html)							Chirality information (i.e. information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. While chirality has been extensively studied in the image domain, its exploration in shape analysis (such as point clouds and meshes) remains underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robustness to rigid-body transformations), they are often not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. Based on the recent Diff3F framework, we propose an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. We evaluated the extracted chirality features through quantitative and qualitative experiments across diverse datasets. Results from downstream tasks including left-right disentanglement, shape matching, and part segmentation demonstrate their effectiveness and practical utility. Project page: https://wei-kang-wang.github.io/chirality/	0
2	Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy	Yiting Yang, Hao Luo, Yuan Sun, Qingsen Yan, Haokui Zhang, Wei Dong, Guoqing Wang, Peng Wang, Yang Yang, Hengtao Shen	[2507.13260](https://huggingface.co/papers/2507.13260)	0	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Yang_Efficient_Adaptation_of_Pre-trained_Vision_Transformer_underpinned_by_Approximately_Orthogonal_ICCV_2025_paper.html)							A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this study, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices. Our code is available at link.	1
3	MM-IFEngine: Towards Multimodal Instruction Following	Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang		0	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Ding_MM-IFEngine_Towards_Multimodal_Instruction_Following_ICCV_2025_paper.html)							The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and doing it right.Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints.To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs.Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO).We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both textual constraints for output responses and visual constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating rule-based assessment and LLM-as-a-Judge evaluation.We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieve notable gains on various IF benchmarks, such as MM-IFEval (+11.8%), MIA (+7.7%), and IFEval (+10.5%).	2
4	Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads	Yingjie Zhou, Jiezhang Cao, Zicheng Zhang, Farong Wen, Yanwei Jiang, Jun Jia, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai	[2507.23343](https://huggingface.co/papers/2507.23343)	0	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Zhou_Who_is_a_Better_Talker_Subjective_and_Objective_Quality_Assessment_ICCV_2025_paper.html)		[link](https://github.com/zyj-2000/Talker)					Speech-driven methods for portraits are figuratively known as "Talkers" because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs, which provides rich material for AGTH quality assessment. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work is released at https://github.com/zyj-2000/Talker.	3
5	LayerAnimate: Layer-level Control for Animation	Yuxue Yang, Lue Fan, Zuzeng Lin, Feng Wang, Zhaoxiang Zhang	[2501.08295](https://huggingface.co/papers/2501.08295)	1	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Yang_LayerAnimate_Layer-level_Control_for_Animation_ICCV_2025_paper.html)	[link](https://layeranimate.github.io)	[link](https://github.com/IamCreateAI/LayerAnimate)	[IamCreateAI/LayerAnimate](https://huggingface.co/spaces/IamCreateAI/LayerAnimate)	[Yuppie1204/LayerAnimate-Mix](https://huggingface.co/Yuppie1204/LayerAnimate-Mix)		1/5 ✅	Traditional animation production decomposes visual elements into discrete layers to enable independent processing for sketching, refining, coloring, and in-betweening. Existing anime generation video methods typically treat animation as a distinct data domain different from real-world videos, lacking fine-grained control at the layer level. To bridge this gap, we introduce LayerAnimate, a novel video diffusion framework with layer-aware architecture that empowers the manipulation of layers through layer-level controls. The development of a layer-aware framework faces a significant data scarcity challenge due to the commercial sensitivity of professional animation assets. To address the limitation, we propose a data curation pipeline featuring Automated Element Segmentation and Motion-based Hierarchical Merging. Through quantitative and qualitative comparisons and user study, we demonstrate that LayerAnimate outperforms current methods in terms of animation quality, control precision, and usability, making it an effective tool for both professional animators and amateur enthusiasts. This framework opens up new possibilities for layer-level animation applications and creative flexibility. Our code is available at https://layeranimate.github.io.	4
6	Towards a Unified Copernicus Foundation Model for Earth Vision	Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taixé, Xiao Xiang Zhu	[2503.11849](https://huggingface.co/papers/2503.11849)	4	3	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Wang_Towards_a_Unified_Copernicus_Foundation_Model_for_Earth_Vision_ICCV_2025_paper.html)		[link](https://github.com/zhu-xlab/Copernicus-FM)		[wangyi111/Copernicus-FM](https://huggingface.co/wangyi111/Copernicus-FM)	[wangyi111/Copernicus-Pretrain](https://huggingface.co/datasets/wangyi111/Copernicus-Pretrain)	3/11 ✅	Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth's surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth's surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research. Codes at https://github.com/zhu-xlab/Copernicus-FM.	5
7	ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones	Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, Srinivasa G. Narasimhan		0	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Ghosh_ROADWork_A_Dataset_and_Benchmark_for_Learning_to_Recognize_Observe_ICCV_2025_paper.html)							Perceiving and autonomously navigating through work zones is a challenging and under-explored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8x) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP).Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via crop-scaling improves performance (+14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 (+9.9%) and 75.3% pathways have AE < 0.5 (+8.1%).	6
8	Gradient Decomposition and Alignment for Incremental Object Detection	Wenlong Luo, Shizhou Zhang, De Cheng, Yinghui Xing, Guoqiang Liang, Peng Wang, Yanning Zhang		0	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Luo_Gradient_Decomposition_and_Alignment_for_Incremental_Object_Detection_ICCV_2025_paper.html)							Incremental object detection (IOD) is crucial for enabling AI systems to continuously learn new object classes over time while retaining knowledge of previously learned categories, allowing model to adapt to dynamic environments without forgetting prior information.Existing IOD methods primarily employ knowledge distillation to mitigate catastrophic forgetting, yet these approaches overlook class overlap issues, often resulting in suboptimal performance. In this paper, we propose a novel framework for IOD that leverages a decoupled gradient alignment technique on top of the specially proposed pseudo-labeling strategy. Our method employs a Gaussian Mixture Model to accurately estimate pseudo-labels of previously learned objects in current training images, effectively functioning as a knowledge-replay mechanism. This strategy reinforces prior knowledge retention and prevents the misclassification of unannotated foreground objects from earlier classes as background. Furthermore, we introduce an adaptive gradient decomposition and alignment method to maintain model stability while facilitating positive knowledge transfer. By aligning gradients from both old and new classes, our approach preserves previously learned knowledge while enhancing plasticity for new tasks. Extensive experiments on two IOD benchmarks demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art methods.	7
9	One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution	Xinyu Mao, Xiaohan Xing, Fei Meng, Jianbang Liu, Fan Bai, Qiang Nie, Max Meng	[2507.16337](https://huggingface.co/papers/2507.16337)	0	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Mao_One_Polyp_Identifies_All_One-Shot_Polyp_Segmentation_with_SAM_via_ICCV_2025_paper.html)							Polyp segmentation is vital for early colorectal cancer detection, yet traditional fully supervised methods struggle with morphological variability and domain shifts, requiring frequent retraining. Additionally, reliance on large-scale annotations is a major bottleneck due to the time-consuming and error-prone nature of polyp boundary labeling. Recently, vision foundation models like Segment Anything Model (SAM) have demonstrated strong generalizability and fine-grained boundary detection with sparse prompts, effectively addressing key polyp segmentation challenges. However, SAM's prompt-dependent nature limits automation in medical applications, since manually inputting prompts for each image is labor-intensive and time-consuming. We propose OP-SAM, a One-shot Polyp segmentation framework based on SAM that automatically generates prompts from a single annotated image, ensuring accurate and generalizable segmentation without additional annotation burdens. Our method introduces Correlation-based Prior Generation (CPG) for semantic label transfer and Scale-cascaded Prior Fusion (SPF) to adapt to polyp size variations as well as filter out noisy transfers. Instead of dumping all prompts at once, we devise Euclidean Prompt Evolution (EPE) for iterative prompt refinement, progressively enhancing segmentation quality. Extensive evaluations across five datasets validate OP-SAM's effectiveness. Notably, on Kvasir, it achieves 76.93% IoU, surpassing the state-of-the-art by 11.44%.	8
10	Gradient Extrapolation for Debiased Representation Learning	Ihab Asaad, Maha Shadaydeh, Joachim Denzler	[2503.13236](https://huggingface.co/papers/2503.13236)	0	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Asaad_Gradient_Extrapolation_for_Debiased_Representation_Learning_ICCV_2025_paper.html)	[link](https://gerne-debias.github.io/)						Machine learning classification models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations and defines the target gradient as a linear extrapolation of the gradients computed from each batch's loss. Our analysis shows that when the extrapolated gradient points toward the batch gradient with fewer spurious correlations, it effectively guides training toward learning a debiased model. GERNE serves as a general framework for debiasing, encompassing ERM and Resampling methods as special cases. We derive the theoretical upper and lower bounds of the extrapolation factor employed by GERNE. By tuning this factor, GERNE can adapt to maximize either Group-Balanced Accuracy (GBA) or Worst-Group Accuracy (WGA). We validate GERNE on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baselines. The project page is available at: https://gerne-debias.github.io/.	9
11	From Gaze to Movement: Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning	Yexin Huang, Yongbin Lin, Lishengsa Yue, Zhihong Yao, Jie Wang		0	0	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Huang_From_Gaze_to_Movement_Predicting_Visual_Attention_for_Autonomous_Driving_ICCV_2025_paper.html)							Human-machine interaction technology requires not only the distribution of human visual attention but also the prediction of the gaze point trajectory. We introduce PILOT, a programmatic imitation learning approach that predicts a driver's eye movements based on a set of rule-based conditions. These conditions--derived from driving operations and traffic flow characteristics--define how gaze shifts occur. They are initially identified through incremental synthesis, a heuristic search method, and then refined via L-BFGS, a numerical optimization technique. These human-readable rules enable us to understand drivers' eye movement patterns and make efficient and explainable predictions. We also propose DATAD, a dataset that covers 12 types of autonomous driving takeover scenarios, collected from 60 participants and comprising approximately 600,000 frames of gaze point data. Compared to existing eye-tracking datasets, DATAD includes additional driving metrics and surrounding traffic flow characteristics, providing richer contextual information for modeling gaze behavior. Experimental evaluations of PILOT on DATAD demonstrate superior accuracy and faster prediction speeds compared to four baseline models. Specifically, PILOT reduces the MSE of predicted trajectories by 38.59% to 88.02% and improves the accuracy of gaze object predictions by 6.90% to 55.06%. Moreover, PILOT achieves these gains with approximately 30% lower prediction time, offering both more accurate and more efficient eye movement prediction.	10
12	Less-to-More Generalization: Unlocking More Controllability by In-Context Generation	Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He	[2504.02160](https://huggingface.co/papers/2504.02160)	37	3	[link](https://openaccess.thecvf.com/content/ICCV2025/html/Wu_Less-to-More_Generalization_Unlocking_More_Controllability_by_In-Context_Generation_ICCV_2025_paper.html)		[link](https://github.com/bytedance/UNO)	[bytedance-research/UNO-FLUX](https://huggingface.co/spaces/bytedance-research/UNO-FLUX)	[bytedance-research/UNO](https://huggingface.co/bytedance-research/UNO)	[bytedance-research/UNO-1M](https://huggingface.co/datasets/bytedance-research/UNO-1M)	4/6 ✅	Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle these challenges. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistent multi-subject paired data. Additionally, we introduce UNO, a multi-subject driven customization architecture based on a diffusion transformer. UNO incorporates a progressive cross-modal alignment training paradigm that progresses from simpler single-subject conditioning to more complex multi-subject conditioning. Along with this, a universal rotary position embedding (UnoPE) adjusts the position indices. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation. Code and model: https://github.com/bytedance/UNO.	11