Week Ending 3.9.2025
RESEARCH WATCH: 3.9.2025
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
TrajectoryCrafter introduces a groundbreaking approach for redirecting camera movements in monocular videos through diffusion models. By separating view transformations from content generation, this technique enables precise control over camera trajectories. The dual-stream diffusion model integrates point cloud renders with source videos to ensure accurate transformations and coherent content generation. Instead of relying on scarce multi-view videos, the researchers created a hybrid dataset using web-scale monocular videos combined with static multi-view data. This technology could revolutionize video editing, virtual reality experiences, and filmmaking by allowing creators to reimagine camera movements in existing footage without reshooting.
Authors: Mark YU, Wenbo Hu, Jinbo Xing, Ying Shan
Link: https://arxiv.org/abs/2503.05638v1
Date: 2025-03-07
Summary:
We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multi-view videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method.
--------------------------------------------------------------------------------------------------------
This research addresses a fundamental challenge in material science: creating constitutive models that balance expressivity with interpretability. The proposed Input-Convex Kolmogorov-Arnold Networks (ICKANs) learn polyconvex hyperelastic constitutive laws by leveraging trainable univariate spline-based activation functions. These networks ensure physically admissible models while remaining compact and interpretable, allowing extraction of analytical constitutive relationships through symbolic regression. Trained on strain data and limited force measurements, ICKANs accurately capture nonlinear stress-strain behavior across diverse strain states. Applications include improved material modeling for engineering design, medical implants, and advanced manufacturing, where accurate prediction of material behavior under complex loading conditions is crucial.
Authors: Prakash Thakolkaran, Yaqi Guo, Shivam Saini, Mathias Peirlinck, Benjamin Alheit, Siddhant Kumar
Link: https://arxiv.org/abs/2503.05617v1
Date: 2025-03-07
Summary:
Traditional constitutive models rely on hand-crafted parametric forms with limited expressivity and generalizability, while neural network-based models can capture complex material behavior but often lack interpretability. To balance these trade-offs, we present Input-Convex Kolmogorov-Arnold Networks (ICKANs) for learning polyconvex hyperelastic constitutive laws. ICKANs leverage the Kolmogorov-Arnold representation, decomposing the model into compositions of trainable univariate spline-based activation functions for rich expressivity. We introduce trainable input-convex splines within the KAN architecture, ensuring physically admissible polyconvex hyperelastic models. The resulting models are both compact and interpretable, enabling explicit extraction of analytical constitutive relationships through an input-convex symbolic regression techinque. Through unsupervised training on full-field strain data and limited global force measurements, ICKANs accurately capture nonlinear stress-strain behavior across diverse strain states. Finite element simulations of unseen geometries with trained ICKAN hyperelastic constitutive models confirm the framework's robustness and generalization capability.
--------------------------------------------------------------------------------------------------------
Ontology Generation using Large Language Models
This paper explores how Large Language Models (LLMs) can streamline the complex, time-consuming process of ontology engineering. The researchers introduce two novel prompting techniques—Memoryless CQbyCQ and Ontogenia—to automatically generate OWL ontologies from user stories and competency questions. Their evaluation framework emphasizes structural criteria alongside expert assessment to comprehensively evaluate ontology quality. Testing on a benchmark of ten ontologies with 100 competency questions, they found that OpenAI's o1-preview with Ontogenia produces high-quality ontologies that outperform novice engineers. This technology could dramatically accelerate knowledge representation in fields like healthcare, finance, and science by automating ontology creation while maintaining quality standards.
Authors: Anna Sofia Lippolis, Mohammad Javad Saeedizade, Robin Keskisärkkä, Sara Zuppiroli, Miguel Ceriani, Aldo Gangemi, Eva Blomqvist, Andrea Giovanni Nuzzolese
Link: https://arxiv.org/abs/2503.05388v1
Date: 2025-03-07
Summary:
The ontology engineering process is complex, time-consuming, and error-prone, even for experienced ontology engineers. In this work, we investigate the potential of Large Language Models (LLMs) to provide effective OWL ontology drafts directly from ontological requirements described using user stories and competency questions. Our main contribution is the presentation and evaluation of two new prompting techniques for automated ontology development: Memoryless CQbyCQ and Ontogenia. We also emphasize the importance of three structural criteria for ontology assessment, alongside expert qualitative evaluation, highlighting the need for a multi-dimensional evaluation in order to capture the quality and usability of the generated ontologies. Our experiments, conducted on a benchmark dataset of ten ontologies with 100 distinct CQs and 29 different user stories, compare the performance of three LLMs using the two prompting techniques. The results demonstrate improvements over the current state-of-the-art in LLM-supported ontology engineering. More specifically, the model OpenAI o1-preview with Ontogenia produces ontologies of sufficient quality to meet the requirements of ontology engineers, significantly outperforming novice ontology engineers in modelling ability. However, we still note some common mistakes and variability of result quality, which is important to take into account when using LLMs for ontology authoring support. We discuss these limitations and propose directions for future research.
--------------------------------------------------------------------------------------------------------
PALo: Learning Posture-Aware Locomotion for Quadruped Robots
PALo represents a significant advancement in quadruped robot control, focusing on balancing agility and robustness on complex terrains. This end-to-end deep reinforcement learning framework enables simultaneous velocity tracking and real-time body posture adjustments. By formulating locomotion control as a partially observable Markov decision process with an asymmetric actor-critic architecture, PALo overcomes simulation-to-reality challenges. The system achieves agile posture-aware locomotion in simulations and successfully transfers to real-world environments without fine-tuning. Applications include search and rescue operations in disaster zones, planetary exploration, industrial inspection in hazardous environments, and agricultural assistance where adaptable movement across challenging surfaces is essential.
Authors: Xiangyu Miao, Jun Sun, Hang Lai, Xinpeng Di, Jiahang Cao, Yong Yu, Weinan Zhang
Link: https://arxiv.org/abs/2503.04462v1
Date: 2025-03-06
Summary:
With the rapid development of embodied intelligence, locomotion control of quadruped robots on complex terrains has become a research hotspot. Unlike traditional locomotion control approaches focusing solely on velocity tracking, we pursue to balance the agility and robustness of quadruped robots on diverse and complex terrains. To this end, we propose an end-to-end deep reinforcement learning framework for posture-aware locomotion named PALo, which manages to handle simultaneous linear and angular velocity tracking and real-time adjustments of body height, pitch, and roll angles. In PALo, the locomotion control problem is formulated as a partially observable Markov decision process, and an asymmetric actor-critic architecture is adopted to overcome the sim-to-real challenge. Further, by incorporating customized training curricula, PALo achieves agile posture-aware locomotion control in simulated environments and successfully transfers to real-world settings without fine-tuning, allowing real-time control of the quadruped robot's locomotion and body posture across challenging terrains. Through in-depth experimental analysis, we identify the key components of PALo that contribute to its performance, further validating the effectiveness of the proposed method. The results of this study provide new possibilities for the low-level locomotion control of quadruped robots in higher dimensional command spaces and lay the foundation for future research on upper-level modules for embodied intelligence.
--------------------------------------------------------------------------------------------------------
TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records
TIMER addresses a critical challenge in healthcare AI: enabling large language models to reason over longitudinal electronic health records (EHRs). While LLMs show promise for medical tasks, their ability to process temporal dependencies across multiple patient visits remains limited. This framework introduces time-aware instruction evaluation and tuning specifically designed for clinical records. TIMER-Bench evaluates temporal reasoning capabilities, while TIMER-Instruct fine-tunes LLMs to better reason across time. Models using this approach showed significant performance improvements of 7-9% on benchmarks. This technology could transform clinical decision support, predictive analytics, and personalized medicine by enhancing AI's ability to interpret patient histories over time.
Authors: Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, Nigam Shah
Link: https://arxiv.org/abs/2503.04176v1
Date: 2025-03-06
Summary:
Large language models (LLMs) have emerged as promising tools for assisting in medical tasks, yet processing Electronic Health Records (EHRs) presents unique challenges due to their longitudinal nature. While LLMs' capabilities to perform medical tasks continue to improve, their ability to reason over temporal dependencies across multiple patient visits and time frames remains unexplored. We introduce TIMER (Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records), a framework that incorporate instruction-response pairs grounding to different parts of a patient's record as a critical dimension in both instruction evaluation and tuning for longitudinal clinical records. We develop TIMER-Bench, the first time-aware benchmark that evaluates temporal reasoning capabilities over longitudinal EHRs, as well as TIMER-Instruct, an instruction-tuning methodology for LLMs to learn reasoning over time. We demonstrate that models fine-tuned with TIMER-Instruct improve performance by 7.3% on human-generated benchmarks and 9.2% on TIMER-Bench, indicating that temporal instruction-tuning improves model performance for reasoning over EHR.
--------------------------------------------------------------------------------------------------------
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning
SafeVLA tackles the urgent safety challenges posed by vision-language-action models (VLAs) in robotics. These generalist robot policies risk physical harm to environments, robots, and humans during deployment. The researchers propose a novel algorithm that integrates safety constraints through large-scale constrained learning in simulated environments. Their approach outperforms state-of-the-art methods in both safety (83.58% improvement) and task performance (3.85% improvement), effectively eliminating high-risk behaviors and reducing unsafe behaviors to 1/35 of current levels. This technology is essential for deploying robotic systems in homes, healthcare, manufacturing, and public spaces where human safety is paramount.
Authors: Borong Zhang, Yuhao Zhang, Jiaming Ji, Yingshan Lei, Josef Dai, Yuanpei Chen, Yaodong Yang
Link: https://arxiv.org/abs/2503.03480v1
Date: 2025-03-05
Summary:
Vision-language-action models (VLAs) have shown great potential as generalist robot policies. However, these models pose urgent safety challenges during deployment, including the risk of physical harm to the environment, the robot itself, and humans. How can safety be explicitly incorporated into VLAs? In this work, we propose SafeVLA, a novel algorithm designed to integrate safety into VLAs, ensuring the protection of the environment, robot hardware and humans in real-world settings. SafeVLA effectively balances safety and task performance by employing large-scale constrained learning within simulated environments. We demonstrate that SafeVLA outperforms the current state-of-the-art method in both safety and task performance, achieving average improvements of 83.58% and 3.85%, respectively, in simulation. By prioritizing safety, our approach eliminates high-risk behaviors and reduces the upper bound of unsafe behaviors to 1/35 of that in the current state-of-the-art, thereby significantly mitigating long-tail risks. Furthermore, the learned safety constraints generalize to diverse, unseen scenarios, including multiple out-of-distribution perturbations and tasks. Our data, models and newly proposed benchmark environment are available at https://sites.google.com/view/pku-safevla.
--------------------------------------------------------------------------------------------------------
RTFusion: A depth estimation network based on multimodal fusion in challenging scenarios
RTFusion addresses the challenge of accurate depth estimation in complex real-world environments by intelligently combining RGB and thermal infrared (THR) imagery. The model leverages RGB data for texture and color information while using THR to ensure stability in adverse lighting conditions. At its core, the EGFusion mechanism employs Mutual Complementary Attention for cross-modal alignment and Edge Saliency Enhancement to preserve edge details. Testing on MS2 and ViViD++ datasets demonstrates superior performance across challenging scenarios including nighttime, rain, and high-glare conditions. This technology has significant applications in autonomous driving, robotics, security systems, and augmented reality where reliable depth perception is critical.
Authors: Zelin Meng, Takanori Fukao
Link: https://arxiv.org/abs/2503.04821v1
Date: 2025-03-05
Summary:
Depth estimation in complex real-world scenarios is a challenging task, especially when relying solely on a single modality such as visible light or thermal infrared (THR) imagery. This paper proposes a novel multimodal depth estimation model, RTFusion, which enhances depth estimation accuracy and robustness by integrating the complementary strengths of RGB and THR data. The RGB modality provides rich texture and color information, while the THR modality captures thermal patterns, ensuring stability under adverse lighting conditions such as extreme illumination. The model incorporates a unique fusion mechanism, EGFusion, consisting of the Mutual Complementary Attention (MCA) module for cross-modal feature alignment and the Edge Saliency Enhancement Module (ESEM) to improve edge detail preservation. Comprehensive experiments on the MS2 and ViViD++ datasets demonstrate that the proposed model consistently produces high-quality depth maps across various challenging environments, including nighttime, rainy, and high-glare conditions. The experimental results highlight the potential of the proposed method in applications requiring reliable depth estimation, such as autonomous driving, robotics, and augmented reality.
--------------------------------------------------------------------------------------------------------
This research tackles the ambitious goal of creating a general-purpose event extraction system capable of handling events with thousands of different types. The researchers developed a collaborative annotation method using multiple Large Language Models to create EEMT, the largest event extraction dataset to date, with over 200,000 samples and thousands of event and role types. Their proposed LLM-based Partitioning Event Extraction method (LLM-PEE) overcomes context length limitations by recalling candidate event types and dividing them into manageable partitions. This technology could transform information extraction in news monitoring, intelligence analysis, scientific research, and business intelligence where automated understanding of diverse event types is valuable.
Authors: Wenxuan Liu, Zixuan Li, Long Bai, Yuxin Zuo, Daozhu Xu, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng
Link: https://arxiv.org/abs/2503.02628v1
Date: 2025-03-04
Summary:
Developing a general-purpose extraction system that can extract events with massive types is a long-standing target in Event Extraction (EE). In doing so, the challenge comes from two aspects: 1) The absence of an efficient and effective annotation method. 2) The absence of a powerful extraction method can handle massive types. For the first challenge, we propose a collaborative annotation method based on Large Language Models (LLMs). Through collaboration among multiple LLMs, it first refines annotations of trigger words from distant supervision and then carries out argument annotation. Next, a voting phase consolidates the annotation preferences across different LLMs. Finally, we create the EEMT dataset, the largest EE dataset to date, featuring over 200,000 samples, 3,465 event types, and 6,297 role types. For the second challenge, we propose an LLM-based Partitioning EE method called LLM-PEE. To overcome the limited context length of LLMs, LLM-PEE first recalls candidate event types and then splits them into multiple partitions for LLMs to extract events. The results in the supervised setting show that LLM-PEE outperforms the state-of-the-art methods by 5.4 in event detection and 6.1 in argument extraction. In the zero-shot setting, LLM-PEE achieves up to 12.9 improvement compared to mainstream LLMs, demonstrating its strong generalization capabilities.
--------------------------------------------------------------------------------------------------------
Privacy Preservation Techniques (PPTs) in IoT Systems: A Scoping Review and Future Directions
This comprehensive scoping review examines privacy preservation techniques (PPTs) deployed in Internet of Things (IoT) systems between 2010 and early 2023. The researchers analyze how various technologies, including cryptography and artificial intelligence, create privacy-enhancing technologies (PETs) that address different privacy concerns in IoT. The study explores privacy goals, implementation technologies, integration into IoT architecture layers, application domains, and privacy types addressed. By identifying prominent privacy goals and research gaps, this work provides valuable direction for future privacy research in IoT. These insights are critical for developing secure smart homes, healthcare monitoring systems, industrial IoT, and smart cities where data privacy is essential.
Authors: Emmanuel Alalade, Ashraf Matrawy
Link: https://arxiv.org/abs/2503.02455v1
Date: 2025-03-04
Summary:
Privacy preservation in Internet of Things (IoT) systems requires the use of privacy-enhancing technologies (PETs) built from innovative technologies such as cryptography and artificial intelligence (AI) to create techniques called privacy preservation techniques (PPTs). These PPTs achieve various privacy goals and address different privacy concerns by mitigating potential privacy threats within IoT systems. This study carried out a scoping review of different types of PPTs used in previous research works on IoT systems between 2010 and early 2023 to further explore the advantages of privacy preservation in these systems. This scoping review looks at privacy goals, possible technologies used for building PET, the integration of PPTs into the computing layer of the IoT architecture, different IoT applications in which PPTs are deployed, and the different privacy types addressed by these techniques within IoT systems. Key findings, such as the prominent privacy goal and privacy type in IoT, are discussed in this survey, along with identified research gaps that could inform future endeavors in privacy research and benefit the privacy research community and other stakeholders in IoT systems.
--------------------------------------------------------------------------------------------------------
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
DivPrune addresses the computational challenges in Large Multimodal Models (LMMs) by reformulating token pruning as a Max-Min Diversity Problem. This approach reduces redundancy by selecting tokens that maximize diversity, enabling better representation of original visual information even at high pruning ratios. Unlike previous methods requiring extensive calibration or using suboptimal importance metrics, DivPrune achieves state-of-the-art accuracy across 16 image and video datasets without fine-tuning. Additionally, it reduces both end-to-end latency and GPU memory usage. This technology could accelerate multimodal AI deployment in resource-constrained environments like mobile devices, edge computing, and real-time applications where efficiency is crucial without sacrificing performance.
Authors: Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang
Link: https://arxiv.org/abs/2503.02175v1
Date: 2025-03-04
Summary:
Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available $\href{https://github.com/vbdi/divprune}{\text{here}}$.
--------------------------------------------------------------------------------------------------------
SAGE: Steering and Refining Dialog Generation with State-Action Augmentation
SAGE introduces a novel approach to creating emotionally intelligent chatbots capable of engaging in natural, strategic conversations. At its core, the State-Action Chain (SAC) incorporates latent variables that capture emotional states and conversational strategies between dialogue turns. This allows for coarse-grained control over conversation flow while maintaining natural interaction patterns. The self-improvement pipeline leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. The discrete nature of the latent variables facilitates search-based strategies and reinforcement learning at the state level rather than token level. Applications include customer service, mental health support, educational assistants, and entertainment where emotionally responsive conversation is valuable.
Authors: Yizhe Zhang, Navdeep Jaitly
Link: https://arxiv.org/abs/2503.03040v1
Date: 2025-03-04
Summary:
Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Action Chain (SAC), which augments standard language model fine-tuning by introducing latent variables that encapsulate emotional states and conversational strategies between dialogue turns. During inference, these variables are generated before each response, enabling coarse-grained control over dialogue progression while maintaining natural interaction patterns. We also introduce a self-improvement pipeline that leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. Our experimental results show that models trained with this approach demonstrate improved performance in emotional intelligence metrics while maintaining strong capabilities on LLM benchmarks. The discrete nature of our latent variables facilitates search-based strategies and provides a foundation for future applications of reinforcement learning to dialogue systems, where learning can occur at the state level rather than the token level.
--------------------------------------------------------------------------------------------------------
MedHEval introduces a systematic framework for evaluating hallucinations in Medical Large Vision-Language Models (Med-LVLMs). This benchmark uniquely categorizes hallucinations by their underlying causes: visual misinterpretation, knowledge deficiency, and context misalignment. The researchers constructed diverse medical VQA datasets with comprehensive metrics to assess these hallucination types across 11 popular models and 7 state-of-the-art mitigation techniques. Results reveal that existing Med-LVLMs struggle with all hallucination types, while current mitigation methods show limited effectiveness for knowledge and context-based errors. This research is crucial for developing reliable AI-assisted diagnostic tools, medical education platforms, and clinical decision support systems where accuracy and trustworthiness are essential.
Authors: Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, Cao Xiao
Link: https://arxiv.org/abs/2503.02157v1
Date: 2025-03-04
Summary:
Large Vision Language Models (LVLMs) are becoming increasingly important in the medical domain, yet Medical LVLMs (Med-LVLMs) frequently generate hallucinations due to limited expertise and the complexity of medical applications. Existing benchmarks fail to effectively evaluate hallucinations based on their underlying causes and lack assessments of mitigation strategies. To address this gap, we introduce MedHEval, a novel benchmark that systematically evaluates hallucinations and mitigation strategies in Med-LVLMs by categorizing them into three underlying causes: visual misinterpretation, knowledge deficiency, and context misalignment. We construct a diverse set of close- and open-ended medical VQA datasets with comprehensive evaluation metrics to assess these hallucination types. We conduct extensive experiments across 11 popular (Med)-LVLMs and evaluate 7 state-of-the-art hallucination mitigation techniques. Results reveal that Med-LVLMs struggle with hallucinations arising from different causes while existing mitigation methods show limited effectiveness, especially for knowledge- and context-based errors. These findings underscore the need for improved alignment training and specialized mitigation strategies to enhance Med-LVLMs' reliability. MedHEval establishes a standardized framework for evaluating and mitigating medical hallucinations, guiding the development of more trustworthy Med-LVLMs.
--------------------------------------------------------------------------------------------------------
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
This groundbreaking research introduces the first adversarial training paradigm specifically designed to defend Multimodal Large Language Models (MLLMs) against jailbreak attacks. The proposed Projection Layer Against Adversarial Training (ProEAT) framework efficiently handles large-scale parameters by focusing adversarial training on a lightweight projector layer rather than the entire model. It incorporates dynamic weight adjustment and joint optimization across visual and textual modalities to enhance defense performance. Testing on five major jailbreak methods across three mainstream MLLMs demonstrates state-of-the-art defense performance, outperforming existing baselines by 34% while sacrificing only 1% in clean accuracy. This technology is vital for securing AI assistants, content moderation systems, and AI-powered applications against malicious exploitation.
Authors: Liming Lu, Shuchao Pang, Siyuan Liang, Haotian Zhu, Xiyu Zeng, Aishan Liu, Yunhuai Liu, Yongbin Zhou
Link: https://arxiv.org/abs/2503.04833v1
Date: 2025-03-05
Summary:
Multimodal large language models (MLLMs) have made remarkable strides in cross-modal comprehension and generation tasks. However, they remain vulnerable to jailbreak attacks, where crafted perturbations bypass security guardrails and elicit harmful outputs. In this paper, we present the first adversarial training (AT) paradigm tailored to defend against jailbreak attacks during the MLLM training phase. Extending traditional AT to this domain poses two critical challenges: efficiently tuning massive parameters and ensuring robustness against attacks across multiple modalities. To address these challenges, we introduce Projection Layer Against Adversarial Training (ProEAT), an end-to-end AT framework. ProEAT incorporates a projector-based adversarial training architecture that efficiently handles large-scale parameters while maintaining computational feasibility by focusing adversarial training on a lightweight projector layer instead of the entire model; additionally, we design a dynamic weight adjustment mechanism that optimizes the loss function's weight allocation based on task demands, streamlining the tuning process. To enhance defense performance, we propose a joint optimization strategy across visual and textual modalities, ensuring robust resistance to jailbreak attacks originating from either modality. Extensive experiments conducted on five major jailbreak attack methods across three mainstream MLLMs demonstrate the effectiveness of our approach. ProEAT achieves state-of-the-art defense performance, outperforming existing baselines by an average margin of +34% across text and image modalities, while incurring only a 1% reduction in clean accuracy. Furthermore, evaluations on real-world embodied intelligent systems highlight the practical applicability of our framework, paving the way for the development of more secure and reliable multimodal systems.
--------------------------------------------------------------------------------------------------------
One-Shot Clustering for Federated Learning
This research introduces One-Shot Clustered Federated Learning (OCFL), a clustering-agnostic algorithm that automatically identifies the optimal moment for clustering clients in federated learning environments. The approach computes cosine similarity between client gradients and uses a temperature measure to detect when the federated model begins converging. OCFL enables personalized models without requiring hyperparameter adjustments, as demonstrated through extensive testing across thirty different tasks on three benchmark datasets. This technology has significant applications in healthcare, where patient data privacy is paramount, financial services requiring personalized models, IoT networks with heterogeneous devices, and mobile applications where client data distributions vary significantly across user groups.
Authors: Maciej Krzysztof Zuziak, Roberto Pellungrini, Salvatore Rinzivillo
Link: https://arxiv.org/abs/2503.04231v1
Date: 2025-03-06
Summary:
Federated Learning (FL) is a widespread and well adopted paradigm of decentralized learning that allows training one model from multiple sources without the need to directly transfer data between participating clients. Since its inception in 2015, it has been divided into numerous sub-fields that deal with application-specific issues, be it data heterogeneity or resource allocation. One such sub-field, Clustered Federated Learning (CFL), is dealing with the problem of clustering the population of clients into separate cohorts to deliver personalized models. Although few remarkable works have been published in this domain, the problem is still largely unexplored, as its basic assumption and settings are slightly different from standard FL. In this work, we present One-Shot Clustered Federated Learning (OCFL), a clustering-agnostic algorithm that can automatically detect the earliest suitable moment for clustering. Our algorithm is based on the computation of cosine similarity between gradients of the clients and a temperature measure that detects when the federated model starts to converge. We empirically evaluate our methodology by testing various one-shot clustering algorithms for over thirty different tasks on three benchmark datasets. Our experiments showcase the good performance of our approach when used to perform CFL in an automated manner without the need to adjust hyperparameters.
--------------------------------------------------------------------------------------------------------
PAIR: A Novel Large Language Model-Guided Selection Strategy for Evolutionary Algorithms
PAIR (Preference-Aligned Individual Reciprocity) introduces a revolutionary approach to selection in Evolutionary Algorithms by leveraging Large Language Models to mimic human-like mate selection. This novel method prompts LLMs to evaluate individuals based on genetic diversity, fitness level, and crossover compatibility, enabling more intelligent pairing decisions than traditional random or simplistic selection methods. Experimental results show PAIR significantly outperforms baseline methods on Traveling Salesman Problem instances, achieving lower optimality gaps and improved convergence, particularly when combined with flash thinking models. This innovation could transform optimization challenges in logistics, engineering design, resource allocation, and computational biology where evolutionary algorithms are commonly applied, leading to more efficient solutions.
Authors: Shady Ali, Mahmoud Ashraf, Seif Hegazy, Fatty Salem, Hoda Mokhtar, Mohamed Medhat Gaber, Mohamed Taher Alrefaie
Link: https://arxiv.org/abs/2503.03239v1
Date: 2025-03-05
Summary:
Evolutionary Algorithms (EAs) employ random or simplistic selection methods, limiting their exploration of solution spaces and convergence to optimal solutions. The randomness in performing crossover or mutations may limit the model's ability to evolve efficiently. This paper introduces Preference-Aligned Individual Reciprocity (PAIR), a novel selection approach leveraging Large Language Models to emulate human-like mate selection, thereby introducing intelligence to the pairing process in EAs. PAIR prompts an LLM to evaluate individuals within a population based on genetic diversity, fitness level, and crossover compatibility, guiding more informed pairing decisions. We evaluated PAIR against a baseline method called LLM-driven EA (LMEA), published recently. Results indicate that PAIR significantly outperforms LMEA across various TSP instances, achieving lower optimality gaps and improved convergence. This performance is especially noticeable when combined with the flash thinking model, demonstrating increased population diversity to escape local optima. In general, PAIR provides a new strategy in the area of in-context learning for LLM-driven selection in EAs via sophisticated preference modelling, paving the way for improved solutions and further studies into LLM-guided optimization.
--------------------------------------------------------------------------------------------------------
JPDS-NN addresses the Entrance Dependent Vehicle Routing Problem in agriculture by considering field geometry and entrance constraints often overlooked by traditional methods. This Joint Probability Distribution Sampling Neural Network employs an encoder-decoder architecture with graph transformers and attention mechanisms to model routing as a Markov Decision Process. Trained via reinforcement learning, it achieves impressive results: 48-65% reduction in travel distances, 14-18% lower fuel consumption, and computation speeds two orders of magnitude faster than baseline methods. The framework enables intelligent routing for large-scale farming under dynamic constraints, with applications in precision agriculture, autonomous farming equipment coordination, and sustainable agricultural practices requiring optimal resource utilization.
Authors: Yixuan Fan, Haotian Xu, Mengqiao Liu, Qing Zhuo, Tao Zhang
Link: https://arxiv.org/abs/2503.02369v1
Date: 2025-03-04
Summary:
The Entrance Dependent Vehicle Routing Problem (EDVRP) is a variant of the Vehicle Routing Problem (VRP) where the scale of cities influences routing outcomes, necessitating consideration of their entrances. This paper addresses EDVRP in agriculture, focusing on multi-parameter vehicle planning for irregularly shaped fields. To address the limitations of traditional methods, such as heuristic approaches, which often overlook field geometry and entrance constraints, we propose a Joint Probability Distribution Sampling Neural Network (JPDS-NN) to effectively solve the EDVRP. The network uses an encoder-decoder architecture with graph transformers and attention mechanisms to model routing as a Markov Decision Process, and is trained via reinforcement learning for efficient and rapid end-to-end planning. Experimental results indicate that JPDS-NN reduces travel distances by 48.4-65.4%, lowers fuel consumption by 14.0-17.6%, and computes two orders of magnitude faster than baseline methods, while demonstrating 15-25% superior performance in dynamic arrangement scenarios. Ablation studies validate the necessity of cross-attention and pre-training. The framework enables scalable, intelligent routing for large-scale farming under dynamic constraints.
--------------------------------------------------------------------------------------------------------
Multi Agent based Medical Assistant for Edge Devices
This innovative research introduces an on-device, multi-agent healthcare assistant that overcomes limitations of Large Action Models in healthcare settings. By utilizing smaller, task-specific agents, the system optimizes resources while ensuring privacy, reducing latency, and eliminating dependency on internet access. Powered by the Qwen Code Instruct 2.5 7B model, the system achieves impressive performance metrics while remaining lightweight enough for on-device deployment. Features include appointment booking, health monitoring, medication reminders, and daily health reporting. This technology could transform healthcare delivery in remote areas with limited connectivity, enable private health monitoring for sensitive conditions, and provide accessible healthcare assistance for elderly and vulnerable populations.
Authors: Sakharam Gawade, Shivam Akhouri, Chinmay Kulkarni, Jagdish Samant, Pragya Sahu, Aastik, Jai Pahal, Saswat Meher
Link: https://arxiv.org/abs/2503.05397v1
Date: 2025-03-07
Summary:
Large Action Models (LAMs) have revolutionized intelligent automation, but their application in healthcare faces challenges due to privacy concerns, latency, and dependency on internet access. This report introduces an ondevice, multi-agent healthcare assistant that overcomes these limitations. The system utilizes smaller, task-specific agents to optimize resources, ensure scalability and high performance. Our proposed system acts as a one-stop solution for health care needs with features like appointment booking, health monitoring, medication reminders, and daily health reporting. Powered by the Qwen Code Instruct 2.5 7B model, the Planner and Caller Agents achieve an average RougeL score of 85.5 for planning and 96.5 for calling for our tasks while being lightweight for on-device deployment. This innovative approach combines the benefits of ondevice systems with multi-agent architectures, paving the way for user-centric healthcare solutions.
--------------------------------------------------------------------------------------------------------
This research introduces the Surveillance Video-Assisted Federated Digital Twin (SV-FDT) framework to enhance intelligent transportation systems by incorporating pedestrians and vehicles in-the-loop. The three-layer architecture collects multi-source traffic surveillance videos, performs semantic segmentation and interaction modeling at the edge layer, and integrates local digital twin systems into a global model in real-time. Testbed evaluations demonstrate its effectiveness in optimizing traffic management compared to traditional terminal-server frameworks. This technology could revolutionize urban traffic management, enable more realistic transportation simulations for infrastructure planning, enhance pedestrian safety systems, and optimize traffic signal timing in complex urban environments with diverse road users.
Authors: Xiaolong Li, Jianhao Wei, Haidong Wang, Li Dong, Ruoyang Chen, Changyan Yi, Jun Cai, Dusit Niyato, Xuemin, Shen
Link: https://arxiv.org/abs/2503.04170v1
Date: 2025-03-06
Summary:
In intelligent transportation systems (ITSs), incorporating pedestrians and vehicles in-the-loop is crucial for developing realistic and safe traffic management solutions. However, there is falls short of simulating complex real-world ITS scenarios, primarily due to the lack of a digital twin implementation framework for characterizing interactions between pedestrians and vehicles at different locations in different traffic environments. In this article, we propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop. Specifically, SVFDT builds comprehensive pedestrian-vehicle interaction models by leveraging multi-source traffic surveillance videos. Its architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime. We analyze key design requirements and challenges and present core guidelines for SVFDT's system implementation. A testbed evaluation demonstrates its effectiveness in optimizing traffic management. Comparisons with traditional terminal-server frameworks highlight SV-FDT's advantages in mirroring delays, recognition accuracy, and subjective evaluation. Finally, we identify some open challenges and discuss future research directions.
--------------------------------------------------------------------------------------------------------
Position: Model Collapse Does Not Mean What You Think
This thought-provoking position paper challenges widespread concerns about model collapse—the supposed degradation in future generative models' performance when trained on synthetic data. The authors identify eight distinct definitions of model collapse in current literature and argue that inconsistent terminology has hindered comprehensive understanding of the phenomenon. Through rigorous assessment of research methodologies against realistic conditions, they conclude that many predicted collapse scenarios rely on assumptions that poorly match real-world conditions and are readily avoidable. This critical analysis invites researchers, policymakers, and industry leaders to reconsider overhyped threats and focus attention on more probable harms from generative AI that have received disproportionately less attention.
Authors: Rylan Schaeffer, Joshua Kazdan, Alvan Caleb Arulandu, Sanmi Koyejo
Link: https://arxiv.org/abs/2503.03150v1
Date: 2025-03-05
Summary:
The proliferation of AI-generated content online has fueled concerns over \emph{model collapse}, a degradation in future generative models' performance when trained on synthetic data generated by earlier models. Industry leaders, premier research journals and popular science publications alike have prophesied catastrophic societal consequences stemming from model collapse. In this position piece, we contend this widespread narrative fundamentally misunderstands the scientific evidence. We highlight that research on model collapse actually encompasses eight distinct and at times conflicting definitions of model collapse, and argue that inconsistent terminology within and between papers has hindered building a comprehensive understanding of model collapse. To assess how significantly different interpretations of model collapse threaten future generative models, we posit what we believe are realistic conditions for studying model collapse and then conduct a rigorous assessment of the literature's methodologies through this lens. While we leave room for reasonable disagreement, our analysis of research studies, weighted by how faithfully each study matches real-world conditions, leads us to conclude that certain predicted claims of model collapse rely on assumptions and conditions that poorly match real-world conditions, and in fact several prominent collapse scenarios are readily avoidable. Altogether, this position paper argues that model collapse has been warped from a nuanced multifaceted consideration into an oversimplified threat, and that the evidence suggests specific harms more likely under society's current trajectory have received disproportionately less attention.
--------------------------------------------------------------------------------------------------------
This reflective analysis examines how large-scale generative AI techniques have transformed data storytelling tools since becoming publicly available. The researchers compare collaboration patterns in the latest tools with earlier ones using a dedicated framework for understanding human-AI collaboration in data storytelling. They identify persistent patterns, like human-creator + AI-assistant, and emerging ones, such as AI-creator + human-reviewer. The paper reveals benefits of these AI techniques and their implications for human-AI collaboration before proposing future directions to inspire innovation. This research provides valuable insights for journalists, data scientists, educators, and business intelligence professionals seeking to leverage generative AI for more compelling and accessible data narratives.
Authors: Haotian Li, Yun Wang, Huamin Qu
Link: https://arxiv.org/abs/2503.02631v1
Date: 2025-03-04
Summary:
Human-AI collaborative tools attract attentions from the data storytelling community to lower the barrier of expertise and streamline the workflow. The recent advance in large-scale generative AI techniques, e.g., large language models (LLMs) and text-to-image models, has the potential to enhance data storytelling with their power in visual and narration generation. After two years since these techniques were publicly available, it is important to reflect our progress of applying them and have an outlook for future opportunities. To achieve the goal, we compare the collaboration patterns of the latest tools with those of earlier ones using a dedicated framework for understanding human-AI collaboration in data storytelling. Through comparison, we identify persistent collaboration patterns, e.g., human-creator + AI-assistant, and emerging ones, e.g., AI-creator + human-reviewer. The benefits of these AI techniques and other implications to human-AI collaboration are also revealed. We further propose future directions to hopefully ignite innovations.
--------------------------------------------------------------------------------------------------------