Week Ending 3.30.2025
RESEARCH WATCH: 3.30.2025
Monopole current control in artificial spin ice via localized fields
This research explores controlling magnetic monopole currents in artificial spin ice systems through strategically placed vertical control elements. By using Monte Carlo simulations, the researchers demonstrate how localized magnetic fields can guide monopole flow across the lattice—sometimes even against the applied field direction—by reshaping the energy landscape and suppressing monopole nucleation along specific edges. This breakthrough in manipulating collective magnetic behaviors has significant implications for developing advanced magnetic memory devices, physical reservoir computing systems, reconfigurable magnetic logic gates, and spin-based information processing architectures that require directional magnetic charge transport.
Authors: Julia Frank, Johan van Lierop, Robert L. Stamps
Link: https://arxiv.org/abs/2503.20983v1
Date: 2025-03-26
Summary:
Artificial spin ice systems are metamaterials composed of interacting nanomagnets arranged on a lattice, exhibiting geometrical frustration and emergent phenomena such as monopole excitations. We explore magnetization dynamics and monopole current control in square artificial spin ice with added vertical control elements. Using Monte Carlo simulations, we examine how localized magnetic fields from these elements influence vertex configurations and domain propagation, enabling directional and polarity control of monopole currents. The control elements suppress monopole nucleation along one edge, steering monopole flow across the lattice, sometimes even against the applied field direction. These elements also reshape the system's energy landscape, producing tailored hysteresis and guided state transitions. Our results offer a strategy for manipulating collective behaviours in artificial spin ice using localized fields. This has implications for magnetic memory, physical reservoir computing, enabling reconfigurable magnetic logic and spin-based information processing, and device architectures requiring directional magnetic charge transport.
--------------------------------------------------------------------------------------------------------
Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation
The researchers introduce pix2pix-zeroCon, an innovative zero-shot diffusion-based method that addresses two key challenges in image-to-image translation: prompt formulation and content preservation. Unlike previous approaches, this method eliminates the need for additional training by leveraging patch-wise contrastive loss. It automatically determines editing direction in the text embedding space and preserves content through cross-attention guiding loss and patch-wise contrastive loss between generated and original image embeddings. Operating directly on pre-trained text-to-image diffusion models without retraining, experiments show this approach achieves superior fidelity and controllability in image editing applications.
Authors: Qi Si, Bo Wang, Zhao Zhang
Link: https://arxiv.org/abs/2503.20484v1
Date: 2025-03-26
Summary:
The diffusion model has demonstrated superior performance in synthesizing diverse and high-quality images for text-guided image translation. However, there remains room for improvement in both the formulation of text prompts and the preservation of reference image content. First, variations in target text prompts can significantly influence the quality of the generated images, and it is often challenging for users to craft an optimal prompt that fully captures the content of the input image. Second, while existing models can introduce desired modifications to specific regions of the reference image, they frequently induce unintended alterations in areas that should remain unchanged. To address these challenges, we propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss. Specifically, we automatically determine the editing direction in the text embedding space based on the reference image and target prompts. Furthermore, to ensure precise content and structural preservation in the edited image, we introduce cross-attention guiding loss and patch-wise contrastive loss between the generated and original image embeddings within a pre-trained diffusion model. Notably, our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model. Extensive experiments demonstrate that our method surpasses existing models in image-to-image translation, achieving enhanced fidelity and controllability.
--------------------------------------------------------------------------------------------------------
Analyzing Modern NVIDIA GPU cores
This paper reverse-engineers modern NVIDIA GPU core designs, revealing crucial microarchitectural details that have remained largely unknown to academic researchers. The study uncovers how modern GPUs leverage hardware-compiler techniques, detailing the issue logic, register file structure, memory pipeline, and instruction prefetching mechanisms. By modeling these newly discovered features, the researchers achieved 18.24% lower mean absolute percentage error in execution cycle predictions compared to previous simulators. Their findings demonstrate that NVIDIA's software-based dependence management outperforms hardware-based scoreboards in both performance and area efficiency, providing valuable insights for future GPU design and simulation.
Authors: Rodrigo Huerta, Mojtaba Abaie Shoushtary, José-Lorenzo Cruz, Antonio González
Link: https://arxiv.org/abs/2503.20481v1
Date: 2025-03-26
Summary:
GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures that are more than 15 years old. This paper reverse engineers modern NVIDIA GPU cores, unveiling many key aspects of its design and explaining how GPUs leverage hardware-compiler techniques where the compiler guides hardware during execution. In particular, it reveals how the issue logic works including the policy of the issue scheduler, the structure of the register file and its associated cache, and multiple features of the memory pipeline. Moreover, it analyses how a simple instruction prefetcher based on a stream buffer fits well with modern NVIDIA GPUs and is likely to be used. Furthermore, we investigate the impact of the register file cache and the number of register file read ports on both simulation accuracy and performance. By modeling all these new discovered microarchitectural details, we achieve 18.24% lower mean absolute percentage error (MAPE) in execution cycles than previous state-of-the-art simulators, resulting in an average of 13.98% MAPE with respect to real hardware (NVIDIA RTX A6000). Also, we demonstrate that this new model stands for other NVIDIA architectures, such as Turing. Finally, we show that the software-based dependence management mechanism included in modern NVIDIA GPUs outperforms a hardware mechanism based on scoreboards in terms of performance and area.
--------------------------------------------------------------------------------------------------------
FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies
FastFT introduces an innovative framework for automating feature transformation in machine learning workflows through three advanced strategies. First, it decouples feature transformation evaluation from downstream tasks using a performance predictor, saving considerable time on large datasets. Second, it addresses reward sparsity by developing a method to evaluate the novelty of transformation sequences, accelerating the model's exploration of effective transformations. Third, it combines novelty and performance metrics to create a prioritized memory buffer that ensures essential experiences are effectively revisited during exploration. Experimental results validate the framework's superior performance, efficiency, and traceability in handling complex feature transformation tasks.
Authors: Tianqi He, Xiaohan Huang, Yi Du, Qingqing Long, Ziyue Qiao, Min Wu, Yanjie Fu, Yuanchun Zhou, Meng Xiao
Link: https://arxiv.org/abs/2503.20394v1
Date: 2025-03-26
Summary:
Feature Transformation is crucial for classic machine learning that aims to generate feature combinations to enhance the performance of downstream tasks from a data-centric perspective. Current methodologies, such as manual expert-driven processes, iterative-feedback techniques, and exploration-generative tactics, have shown promise in automating such data engineering workflow by minimizing human involvement. However, three challenges remain in those frameworks: (1) It predominantly depends on downstream task performance metrics, as assessment is time-consuming, especially for large datasets. (2) The diversity of feature combinations will hardly be guaranteed after random exploration ends. (3) Rare significant transformations lead to sparse valuable feedback that hinders the learning processes or leads to less effective results. In response to these challenges, we introduce FastFT, an innovative framework that leverages a trio of advanced strategies.We first decouple the feature transformation evaluation from the outcomes of the generated datasets via the performance predictor. To address the issue of reward sparsity, we developed a method to evaluate the novelty of generated transformation sequences. Incorporating this novelty into the reward function accelerates the model's exploration of effective transformations, thereby improving the search productivity. Additionally, we combine novelty and performance to create a prioritized memory buffer, ensuring that essential experiences are effectively revisited during exploration. Our extensive experimental evaluations validate the performance, efficiency, and traceability of our proposed framework, showcasing its superiority in handling complex feature transformation tasks.
--------------------------------------------------------------------------------------------------------
AI Identity, Empowerment, and Mindfulness in Mitigating Unethical AI Use
This study examines the complex relationship between AI identity, psychological empowerment, and unethical AI behavior among college students. The researchers found that while a strong AI identity enhances psychological empowerment and academic engagement, it can paradoxically lead to increased unethical AI practices. Crucially, the research identifies IT mindfulness as an effective ethical safeguard that promotes awareness of ethical concerns and reduces AI misuse. These findings have significant implications for educators developing AI policies, suggesting a balanced approach that encourages digital engagement while fostering responsibility. The research contributes to discussions of psychological agency and offers strategies for aligning technological advancement with ethical accountability in educational settings.
Authors: Mayssam Tarighi Shaayesteh, Sara Memarian Esfahani, Hossein Mohit
Link: https://arxiv.org/abs/2503.20099v1
Date: 2025-03-25
Summary:
This study examines how AI identity influences psychological empowerment and unethical AI behavior among college students, while also exploring the moderating role of IT mindfulness. Findings show that a strong AI identity enhances psychological empowerment and academic engagement but can also lead to increased unethical AI practices. Crucially, IT mindfulness acts as an ethical safeguard, promoting sensitivity to ethical concerns and reducing misuse of AI. These insights have implications for educators, policymakers, and AI developers, emphasizing For Peer Review the need for a balanced approach that encourages digital engagement without compromising student responsibility. The study also contributes to philosophical discussions of psychological agency, suggesting that empowerment through AI can yield both positive and negative outcomes. Mindfulness emerges as essential in guiding ethical AI interactions. Overall, the research informs ongoing debates on ethics in education and AI, offering strategies to align technological advancement with ethical accountability and responsible use.
--------------------------------------------------------------------------------------------------------
Can Multi-modal (reasoning) LLMs work as deepfake detectors?
This groundbreaking study evaluates the effectiveness of cutting-edge multi-modal large language models (LLMs) as deepfake image detectors. The researchers benchmarked 12 latest multi-modal LLMs—including OpenAI O1/4o, Gemini Flash 2, and Claude 3.5/3.7 Sonnet—against traditional deepfake detection methods across multiple datasets. Their findings reveal that top-performing multi-modal LLMs achieve competitive zero-shot performance, sometimes surpassing traditional methods on out-of-distribution datasets. Interestingly, newer model versions and reasoning capabilities didn't necessarily improve performance in deepfake detection, while model size sometimes helped. This research highlights the potential for integrating multi-modal reasoning into future deepfake detection frameworks and provides insights into model interpretability for real-world applications.
Authors: Simiao Ren, Yao Yao, Kidus Zewde, Zisheng Liang, Tsang, Ng, Ning-Yau Cheng, Xiaoou Zhan, Qinzhe Liu, Yifei Chen, Hengwei Xu
Link: https://arxiv.org/abs/2503.20084v1
Date: 2025-03-25
Summary:
Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.
--------------------------------------------------------------------------------------------------------
BugCraft: End-to-End Crash Bug Reproduction Using LLM Agents in Minecraft
BugCraft introduces a novel framework for automating the reproduction of crash bugs in Minecraft using LLM agents. The system employs a two-stage approach: first, a Step Synthesizer transforms user bug reports into structured reproduction steps using LLMs and Minecraft Wiki knowledge; then, an Action Model powered by a vision-based LLM agent (GPT-4o) executes these steps in-game to trigger the reported crash. Tested on the newly created BugCraft-Bench dataset, the framework successfully reproduced 30.23% of crash bugs end-to-end, with the Step Synthesizer achieving 66.28% accuracy in generating correct reproduction plans. This breakthrough demonstrates the feasibility of automated game bug reproduction and opens new possibilities for game testing.
Authors: Eray Yapağcı, Yavuz Alp Sencer Öztürk, Eray Tüzün
Link: https://arxiv.org/abs/2503.20036v1
Date: 2025-03-25
Summary:
Reproducing game bugs, in our case crash bugs in continuously evolving games like Minecraft, is a notoriously manual, time-consuming, and challenging process to automate. Despite the success of LLM-driven bug reproduction in other software domains, games, with their complex interactive environments, remain largely unaddressed. This paper introduces BugCraft, a novel end-to-end framework designed to automate the reproduction of crash bugs in Minecraft directly from user-submitted bug reports, addressing the critical gap in automated game bug reproduction. BugCraft employs a two-stage approach: first, a Step Synthesizer leverages LLMs and Minecraft Wiki knowledge to transform bug reports into high-quality, structured steps to reproduce (S2R). Second, an Action Model, powered by a vision-based LLM agent (GPT-4o) and a custom macro API, executes these S2R steps within Minecraft to trigger the reported crash. To facilitate evaluation, we introduce BugCraft-Bench, a curated dataset of Minecraft crash bug reports. Evaluated on BugCraft-Bench, our framework successfully reproduced 30.23% of crash bugs end-to-end. The Step Synthesizer demonstrated a 66.28% accuracy in generating correct bug reproduction plans, highlighting its effectiveness in interpreting and structuring bug report information. BugCraft demonstrates the feasibility of automated reproduction of crash bugs in complex game environments using LLMs, opening promising avenues for game testing and development. The framework and the BugCraft-Bench dataset pave the way for future research in automated game bug analysis and hold potential for generalization to other interactive game platforms. Finally, we make our code open at https://bugcraft2025.github.io/
--------------------------------------------------------------------------------------------------------
FedMM-X represents a significant advancement in trustworthy artificial intelligence by unifying federated learning with explainable multi-modal reasoning. The framework addresses the challenges of decentralized, dynamic settings through cross-modal consistency checks, client-level interpretability mechanisms, and dynamic trust calibration. This approach effectively handles data heterogeneity, modality imbalance, and out-of-distribution generalization issues. Evaluations on federated vision-language tasks demonstrate improved accuracy and interpretability while reducing vulnerability to adversarial correlations. The framework's novel trust score aggregation method quantifies global model reliability under changing client participation, advancing the development of robust, interpretable, and socially responsible AI systems for real-world applications.
Authors: Sree Bhargavi Balija
Link: https://arxiv.org/abs/2503.19564v1
Date: 2025-03-25
Summary:
As artificial intelligence systems increasingly operate in Real-world environments, the integration of multi-modal data sources such as vision, language, and audio presents both unprecedented opportunities and critical challenges for achieving trustworthy intelligence. In this paper, we propose a novel framework that unifies federated learning with explainable multi-modal reasoning to ensure trustworthiness in decentralized, dynamic settings. Our approach, called FedMM-X (Federated Multi-Modal Explainable Intelligence), leverages cross-modal consistency checks, client-level interpretability mechanisms, and dynamic trust calibration to address challenges posed by data heterogeneity, modality imbalance, and out-of-distribution generalization. Through rigorous evaluation across federated multi-modal benchmarks involving vision-language tasks, we demonstrate improved performance in both accuracy and interpretability while reducing vulnerabilities to adversarial and spurious correlations. Further, we introduce a novel trust score aggregation method to quantify global model reliability under dynamic client participation. Our findings pave the way toward developing robust, interpretable, and socially responsible AI systems in Real-world environments.
--------------------------------------------------------------------------------------------------------
Video-T1: Test-Time Scaling for Video Generation
Video-T1 explores the potential of Test-Time Scaling (TTS) for video generation, reinterpreting it as a search problem to find better trajectories from Gaussian noise to target video distributions. Instead of expensive training costs to scale up video models, this approach leverages additional inference-time computation to improve generation quality. The researchers developed two key methods: a linear search strategy that increases noise candidates, and a more efficient Tree-of-Frames (ToF) approach that adaptively expands and prunes video branches in an autoregressive manner. Experiments consistently show that increasing test-time computation significantly improves video quality, offering a promising direction for enhancing video generation without the costs of model retraining.
Authors: Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan
Link: https://arxiv.org/abs/2503.18942v1
Date: 2025-03-24
Summary:
With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1
--------------------------------------------------------------------------------------------------------
EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation
EvAnimate introduces a groundbreaking approach to human animation by leveraging event camera data streams instead of traditional video-derived motion cues. Event cameras offer exceptional advantages—high temporal resolution, wide dynamic range, and resistance to motion blur and exposure issues—that overcome the limitations of conventional video data. The framework employs a specialized event representation that transforms asynchronous streams into diffusion model-compatible 3-channel slices with controllable parameters. A dual-branch architecture then generates high-quality videos by harnessing the inherent motion dynamics of event streams. With specialized data augmentation strategies and a new benchmarking system, EvAnimate demonstrates superior temporal fidelity and robust performance even in extreme scenarios where traditional approaches fail.
Authors: Qiang Qu, Ming Li, Xiaoming Chen, Tongliang Liu
Link: https://arxiv.org/abs/2503.18552v1
Date: 2025-03-24
Summary:
Conditional human animation transforms a static reference image into a dynamic sequence by applying motion cues such as poses. These motion cues are typically derived from video data but are susceptible to limitations including low temporal resolution, motion blur, overexposure, and inaccuracies under low-light conditions. In contrast, event cameras provide data streams with exceptionally high temporal resolution, a wide dynamic range, and inherent resistance to motion blur and exposure issues. In this work, we propose EvAnimate, a framework that leverages event streams as motion cues to animate static human images. Our approach employs a specialized event representation that transforms asynchronous event streams into 3-channel slices with controllable slicing rates and appropriate slice density, ensuring compatibility with diffusion models. Subsequently, a dual-branch architecture generates high-quality videos by harnessing the inherent motion dynamics of the event streams, thereby enhancing both video quality and temporal consistency. Specialized data augmentation strategies further enhance cross-person generalization. Finally, we establish a new benchmarking, including simulated event data for training and validation, and a real-world event dataset capturing human actions under normal and extreme scenarios. The experiment results demonstrate that EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.
--------------------------------------------------------------------------------------------------------
Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance
This research presents a mask-guided video generation approach that enables precise motion control while requiring limited training data. The innovative model enhances existing architectures by incorporating foreground masks for accurate text-position matching and motion trajectory control. By guiding the generation process through mask motion sequences, the system maintains consistent foreground objects throughout video sequences. Additionally, a first-frame sharing strategy and autoregressive extension approach enable more stable and longer video generation. Experimental results demonstrate superior performance in various applications, including video editing and artistic video creation, with notable improvements in consistency and quality compared to previous methods.
Authors: Sicong Feng, Jielong Yang, Li Peng
Link: https://arxiv.org/abs/2503.18386v1
Date: 2025-03-24
Summary:
Recent advances in diffusion models bring new vitality to visual content creation. However, current text-to-video generation models still face significant challenges such as high training costs, substantial data requirements, and difficulties in maintaining consistency between given text and motion of the foreground object. To address these challenges, we propose mask-guided video generation, which can control video generation through mask motion sequences, while requiring limited training data. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. Through mask motion sequences, we guide the video generation process to maintain consistent foreground objects throughout the sequence. Additionally, through a first-frame sharing strategy and autoregressive extension approach, we achieve more stable and longer video generation. Extensive qualitative and quantitative experiments demonstrate that this approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality. Our generated results can be viewed in the supplementary materials.
--------------------------------------------------------------------------------------------------------
This innovative approach addresses ethical concerns in text-to-image (T2I) generation by enabling comprehensive control over generative content. Unlike existing methods that handle responsibility concepts individually, this plug-and-play technique simultaneously accounts for an extensive range of concepts for fair and safe content generation. The researchers distill target T2I pipelines with an external mechanism that learns an interpretable composite responsible space, utilizing knowledge distillation and concept whitening. Their approach operates at two plug-in points—the text embedding space and diffusion model latent space—creating modules for both that effectively modulate generative content without compromising model performance, advancing responsible AI in image generation.
Authors: Basim Azam, Naveed Akhtar
Link: https://arxiv.org/abs/2503.18324v1
Date: 2025-03-24
Summary:
Ethical issues around text-to-image (T2I) models demand a comprehensive control over the generative content. Existing techniques addressing these issues for responsible T2I models aim for the generated content to be fair and safe (non-violent/explicit). However, these methods remain bounded to handling the facets of responsibility concepts individually, while also lacking in interpretability. Moreover, they often require alteration to the original model, which compromises the model performance. In this work, we propose a unique technique to enable responsible T2I generation by simultaneously accounting for an extensive range of concepts for fair and safe content generation in a scalable manner. The key idea is to distill the target T2I pipeline with an external plug-and-play mechanism that learns an interpretable composite responsible space for the desired concepts, conditioned on the target T2I pipeline. We use knowledge distillation and concept whitening to enable this. At inference, the learned space is utilized to modulate the generative content. A typical T2I pipeline presents two plug-in points for our approach, namely; the text embedding space and the diffusion model latent space. We develop modules for both points and show the effectiveness of our approach with a range of strong results.
--------------------------------------------------------------------------------------------------------
Manipulation and the AI Act: Large Language Model Chatbots and the Danger of Mirrors
This paper examines the potential dangers of increasingly humanized AI chatbots that adopt human faces, names, voices, personalities, and quirks. While personification may increase user trust, it could enable manipulation through the illusion of intimate relationships with artificial entities. The author analyzes the European Commission's AI Act, which bans manipulative AI systems that cause significant harm to users, but notes its limitations in preventing cumulative harms from prolonged chatbot interactions. Specifically, chatbots could reinforce negative emotional states over extended periods through feedback loops, prolonged conversations, or harmful recommendations, potentially contributing to deteriorating mental health—a risk not adequately addressed by current regulations.
Authors: Joshua Krook
Link: https://arxiv.org/abs/2503.18387v1
Date: 2025-03-24
Summary:
Large Language Model chatbots are increasingly taking the form and visage of human beings, adapting human faces, names, voices, personalities, and quirks, including those of celebrities and well-known political figures. Personifying AI chatbots could foreseeably increase their trust with users. However, it could also make them more capable of manipulation, by creating the illusion of a close and intimate relationship with an artificial entity. The European Commission has finalized the AI Act, with the EU Parliament making amendments banning manipulative and deceptive AI systems that cause significant harm to users. Although the AI Act covers harms that accumulate over time, it is unlikely to prevent harms associated with prolonged discussions with AI chatbots. Specifically, a chatbot could reinforce a person's negative emotional state over weeks, months, or years through negative feedback loops, prolonged conversations, or harmful recommendations, contributing to a user's deteriorating mental health.
--------------------------------------------------------------------------------------------------------
Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
Mist introduces a groundbreaking approach to distributed Large Language Model training by comprehensively co-optimizing memory footprint reduction techniques alongside parallelism strategies. This memory, overlap, and imbalance-aware system implements three key innovations: fine-grained overlap-centric scheduling that orchestrates optimizations in an overlapped manner; symbolic-based performance analysis that predicts runtime and memory usage for fast tuning; and imbalance-aware hierarchical tuning that decouples optimization into interconnected problems. Evaluation results demonstrate that Mist achieves average speedups of 1.28× (up to 1.73×) compared to Megatron-LM and 1.27× (up to 2.04×) compared to Aceso, representing significant advancements in efficient LLM training technology.
Authors: Zhanda Zhu, Christina Giannoula, Muralidhar Andoorveedu, Qidong Su, Karttikeya Mangalam, Bojian Zheng, Gennady Pekhimenko
Link: https://arxiv.org/abs/2503.19050v1
Date: 2025-03-24
Summary:
Various parallelism, such as data, tensor, and pipeline parallelism, along with memory optimizations like activation checkpointing, redundancy elimination, and offloading, have been proposed to accelerate distributed training for Large Language Models. To find the best combination of these techniques, automatic distributed training systems are proposed. However, existing systems only tune a subset of optimizations, due to the lack of overlap awareness, inability to navigate the vast search space, and ignoring the inter-microbatch imbalance, leading to sub-optimal performance. To address these shortcomings, we propose Mist, a memory, overlap, and imbalance-aware automatic distributed training system that comprehensively co-optimizes all memory footprint reduction techniques alongside parallelism. Mist is based on three key ideas: (1) fine-grained overlap-centric scheduling, orchestrating optimizations in an overlapped manner, (2) symbolic-based performance analysis that predicts runtime and memory usage using symbolic expressions for fast tuning, and (3) imbalance-aware hierarchical tuning, decoupling the process into an inter-stage imbalance and overlap aware Mixed Integer Linear Programming problem and an intra-stage Dual-Objective Constrained Optimization problem, and connecting them through Pareto frontier sampling. Our evaluation results show that Mist achieves an average of 1.28$\times$ (up to 1.73$\times$) and 1.27$\times$ (up to 2.04$\times$) speedup compared to state-of-the-art manual system Megatron-LM and state-of-the-art automatic system Aceso, respectively.
--------------------------------------------------------------------------------------------------------
Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories
This research addresses the limitations of traditional value alignment evaluations for large language models (LLMs) that rely on single-sentence adversarial prompts. As models become increasingly adept at circumventing straightforward ethical tests, the researchers propose an enhanced benchmark incorporating multi-turn dialogues and narrative-based scenarios. This approach increases the stealth and adversarial nature of evaluations, making them more robust against superficial safeguards. Their dataset, featuring conversational traps and ethically ambiguous storytelling, effectively exposes latent biases that remain undetected in traditional single-shot evaluations. The findings emphasize the necessity of contextual and dynamic testing for more sophisticated and realistic assessments of AI ethics and safety.
Authors: Yazhou Zhang, Qimeng Liu, Qiuchi Li, Peng Zhang, Jing Qin
Link: https://arxiv.org/abs/2503.22115v1
Date: 2025-03-28
Summary:
Evaluating the value alignment of large language models (LLMs) has traditionally relied on single-sentence adversarial prompts, which directly probe models with ethically sensitive or controversial questions. However, with the rapid advancements in AI safety techniques, models have become increasingly adept at circumventing these straightforward tests, limiting their effectiveness in revealing underlying biases and ethical stances. To address this limitation, we propose an upgraded value alignment benchmark that moves beyond single-sentence prompts by incorporating multi-turn dialogues and narrative-based scenarios. This approach enhances the stealth and adversarial nature of the evaluation, making it more robust against superficial safeguards implemented in modern LLMs. We design and implement a dataset that includes conversational traps and ethically ambiguous storytelling, systematically assessing LLMs' responses in more nuanced and context-rich settings. Experimental results demonstrate that this enhanced methodology can effectively expose latent biases that remain undetected in traditional single-shot evaluations. Our findings highlight the necessity of contextual and dynamic testing for value alignment in LLMs, paving the way for more sophisticated and realistic assessments of AI ethics and safety.
--------------------------------------------------------------------------------------------------------
Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation
This study evaluates multiple deep learning approaches for automated news video segmentation, addressing the challenge of organizing unstructured news video content. The researchers developed and tested various architectures—including ResNet, ViViT, AST, and multimodal models—to classify five distinct segment types in news broadcasts. Using a custom-annotated dataset of 41 news videos, the study found that simpler image-based classifiers outperformed more complex temporal models, with ResNet achieving 84.34% accuracy while requiring fewer computational resources. Binary classification models demonstrated particularly high accuracy for transitions and advertisements. These findings provide practical insights for implementing automated content organization systems in media archiving, personalized content delivery, and intelligent video search applications.
Authors: Jonathan Attard, Dylan Seychell
Link: https://arxiv.org/abs/2503.21848v1
Date: 2025-03-27
Summary:
News videos require efficient content organisation and retrieval systems, but their unstructured nature poses significant challenges for automated processing. This paper presents a comprehensive comparative analysis of image, video, and audio classifiers for automated news video segmentation. This work presents the development and evaluation of multiple deep learning approaches, including ResNet, ViViT, AST, and multimodal architectures, to classify five distinct segment types: advertisements, stories, studio scenes, transitions, and visualisations. Using a custom-annotated dataset of 41 news videos comprising 1,832 scene clips, our experiments demonstrate that image-based classifiers achieve superior performance (84.34\% accuracy) compared to more complex temporal models. Notably, the ResNet architecture outperformed state-of-the-art video classifiers while requiring significantly fewer computational resources. Binary classification models achieved high accuracy for transitions (94.23\%) and advertisements (92.74\%). These findings advance the understanding of effective architectures for news video segmentation and provide practical insights for implementing automated content organisation systems in media applications. These include media archiving, personalised content delivery, and intelligent video search.
--------------------------------------------------------------------------------------------------------
GenFusion: Closing the Loop between Reconstruction and Generation via Videos
This research addresses the conditioning gap between 3D reconstruction and generation, where reconstruction requires densely captured views while generation typically relies on minimal input. The researchers propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Their innovative cyclical fusion pipeline iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion beyond viewpoint limitations of previous approaches. Evaluations on view synthesis from sparse and masked inputs validate the effectiveness of their method, potentially bridging the divide between these two traditionally separate domains and expanding applications for both fields.
Authors: Sibo Wu, Congrong Xu, Binbin Huang, Andreas Geiger, Anpei Chen
Link: https://arxiv.org/abs/2503.21219v1
Date: 2025-03-27
Summary:
Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach.
--------------------------------------------------------------------------------------------------------
VideoGEM: Training-free Action Grounding in Videos
VideoGEM introduces the first training-free spatial action grounding method for videos, adapting the self-self attention formulation of GEM to the challenge of localizing actions that typically lack clear physical outlines. The researchers make three key innovations: implementing layer weighting in the self-attention path to prioritize higher layers where high-level semantic concepts emerge; developing dynamic weighting to automatically tune layer weights based on prompt relevance; and introducing prompt decomposition to process action, verb, and object prompts separately for better spatial localization. Evaluations across three image- and video-language backbones and four video grounding datasets demonstrate that this training-free approach outperforms current trained state-of-the-art methods for spatial video grounding.
Authors: Felix Vogel, Walid Bousselham, Anna Kukleva, Nina Shvetsova, Hilde Kuehne
Link: https://arxiv.org/abs/2503.20348v1
Date: 2025-03-26
Summary:
Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those capabilities to localize actions and events in videos is challenging, as actions have less physical outline and are usually described by higher-level concepts. In this work, we propose VideoGEM, the first training-free spatial action grounding method based on pretrained image- and video-language backbones. Namely, we adapt the self-self attention formulation of GEM to spatial activity grounding. We observe that high-level semantic concepts, such as actions, usually emerge in the higher layers of the image- and video-language models. We, therefore, propose a layer weighting in the self-attention path to prioritize higher layers. Additionally, we introduce a dynamic weighting method to automatically tune layer weights to capture each layer`s relevance to a specific prompt. Finally, we introduce a prompt decomposition, processing action, verb, and object prompts separately, resulting in a better spatial localization of actions. We evaluate the proposed approach on three image- and video-language backbones, CLIP, OpenCLIP, and ViCLIP, and on four video grounding datasets, V-HICO, DALY, YouCook-Interactions, and GroundingYouTube, showing that the proposed training-free approach is able to outperform current trained state-of-the-art approaches for spatial video grounding.
--------------------------------------------------------------------------------------------------------
FACE: Few-shot Adapter with Cross-view Fusion for Cross-subject EEG Emotion Recognition
FACE introduces an innovative solution to the challenging problem of cross-subject EEG emotion recognition, addressing both inter-subject variability and entangled intra-subject variability. The method employs two key components: a cross-view fusion module that dynamically integrates global brain connectivity with localized patterns through subject-specific fusion weights, and a few-shot adapter module that enables rapid adaptation for unseen subjects while reducing overfitting through meta-learning enhanced adapter structures. Experiments on three public EEG emotion recognition benchmarks demonstrate FACE's superior performance over state-of-the-art methods, providing a practical solution for scenarios with limited labeled data and advancing the field of emotion recognition technology.
Authors: Haiqi Liu, C. L. Philip Chen, Tong Zhang
Link: https://arxiv.org/abs/2503.18998v1
Date: 2025-03-24
Summary:
Cross-subject EEG emotion recognition is challenged by significant inter-subject variability and intricately entangled intra-subject variability. Existing works have primarily addressed these challenges through domain adaptation or generalization strategies. However, they typically require extensive target subject data or demonstrate limited generalization performance to unseen subjects. Recent few-shot learning paradigms attempt to address these limitations but often encounter catastrophic overfitting during subject-specific adaptation with limited samples. This article introduces the few-shot adapter with a cross-view fusion method called FACE for cross-subject EEG emotion recognition, which leverages dynamic multi-view fusion and effective subject-specific adaptation. Specifically, FACE incorporates a cross-view fusion module that dynamically integrates global brain connectivity with localized patterns via subject-specific fusion weights to provide complementary emotional information. Moreover, the few-shot adapter module is proposed to enable rapid adaptation for unseen subjects while reducing overfitting by enhancing adapter structures with meta-learning. Experimental results on three public EEG emotion recognition benchmarks demonstrate FACE's superior generalization performance over state-of-the-art methods. FACE provides a practical solution for cross-subject scenarios with limited labeled data.
--------------------------------------------------------------------------------------------------------
From Trust to Truth: Actionable policies for the use of AI in fact-checking in Germany and Ukraine
This policy paper examines the implications of artificial intelligence for journalism and fact-checking, using Germany and Ukraine as case studies. While AI offers powerful tools to combat disinformation and enhance journalistic practices, its implementation currently lacks cohesive regulation, leading to risks of bias, transparency issues, and over-reliance on automated systems. The research proposes establishing an independent media regulatory framework in Ukraine while suggesting Germany can lead EU-wide collaborations and standards development. The paper provides actionable recommendations for responsible AI integration into media ecosystems, emphasizing the need for clear policies and collaborative efforts to ensure AI supports accurate information and public trust in journalism.
Authors: Veronika Solopova
Link: https://arxiv.org/abs/2503.18724v1
Date: 2025-03-24
Summary:
The rise of Artificial Intelligence (AI) presents unprecedented opportunities and challenges for journalism, fact-checking and media regulation. While AI offers tools to combat disinformation and enhance media practices, its unregulated use and associated risks necessitate clear policies and collaborative efforts. This policy paper explores the implications of artificial intelligence (AI) for journalism and fact-checking, with a focus on addressing disinformation and fostering responsible AI integration. Using Germany and Ukraine as key case studies, it identifies the challenges posed by disinformation, proposes regulatory and funding strategies, and outlines technical standards to enhance AI adoption in media. The paper offers actionable recommendations to ensure AI's responsible and effective integration into media ecosystems. AI presents significant opportunities to combat disinformation and enhance journalistic practices. However, its implementation lacks cohesive regulation, leading to risks such as bias, transparency issues, and over-reliance on automated systems. In Ukraine, establishing an independent media regulatory framework adapted to its governance is crucial, while Germany can act as a leader in advancing EU-wide collaborations and standards. Together, these efforts can shape a robust AI-driven media ecosystem that promotes accuracy and trust.
--------------------------------------------------------------------------------------------------------