Week Ending 4.13.2025
RESEARCH WATCH: 4.13.2025
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
This groundbreaking research introduces Seaweed-7B, a cost-efficient video generation foundation model that achieves competitive performance with just 7 billion parameters. Despite using modest computational resources (665,000 H100 GPU hours), it matches or exceeds larger models, demonstrating the importance of strategic design choices in resource-constrained environments. The model's strong generalization capabilities make it adaptable across various downstream applications through lightweight fine-tuning or continued training. This work represents a significant advancement in democratizing video generation technology, potentially enabling broader access to high-quality video synthesis tools for researchers and developers with limited computational resources.
Authors: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yanghua Peng, Yuping Wang, Yuxuan Wang, Zhongkai Zhao, Zhuo Chen, Zuquan Song, Zhenheng Yang, Jiashi Feng, Jianchao Yang, Lu Jiang
Link: https://arxiv.org/abs/2504.08685v1
Date: 2025-04-11
Summary:
This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/
--------------------------------------------------------------------------------------------------------
Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs
Pangu Ultra represents a significant achievement in large language model development, featuring 135 billion parameters trained on Ascend Neural Processing Units (NPUs). The researchers introduce depth-scaled sandwich normalization to stabilize training and pre-train the model on 13.2 trillion diverse tokens. Despite being a dense Transformer model, Pangu Ultra competes with and sometimes outperforms larger models like Llama 405B and DeepSeek-R1. This research demonstrates that Ascend NPUs can efficiently train dense models exceeding 100 billion parameters, potentially expanding access to powerful language models for commercial applications while challenging the assumption that sparse architectures are necessary for top performance.
Authors: Yichun Yin, Wenyong Huang, Kaikai Song, Yehui Tang, Xueyu Wu, Wei Guo, Peng Guo, Yaoyuan Wang, Xiaojun Meng, Yasheng Wang, Dong Li, Can Chen, Dandan Tu, Yin Li, Fisher Yu, Ruiming Tang, Yunhe Wang, Baojun Wang, Bin Wang, Bo Wang, Boxiao Liu, Changzheng Zhang, Duyu Tang, Fei Mi, Hui Jin, Jiansheng Wei, Jiarui Qin, Jinpeng Li, Jun Zhao, Liqun Deng, Lin Li, Minghui Xu, Naifu Zhang, Nianzu Zheng, Qiang Li, Rongju Ruan, Shengjun Cheng, Tianyu Guo, Wei He, Wei Li, Weiwen Liu, Wulong Liu, Xinyi Dai, Yonghan Dong, Yu Pan, Yue Li, Yufei Wang, Yujun Li, Yunsheng Ni, Zhe Liu, Zhenhe Zhang, Zhicheng Liu
Link: https://arxiv.org/abs/2504.07866v2
Date: 2025-04-11
Summary:
We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.
--------------------------------------------------------------------------------------------------------
Exploring a Patch-Wise Approach for Privacy-Preserving Fake ID Detection
This pioneering study addresses the critical challenge of fake ID detection while preserving privacy—a major concern since most research relies on proprietary databases due to privacy restrictions. The researchers propose a novel patch-wise approach exploring two anonymization levels and various patch size configurations to balance privacy protection with detection performance. Using state-of-the-art methods including Vision Transformers and Foundation Models, their solution achieves impressive accuracy even on unseen databases. Most significantly, the team releases the first publicly available dataset containing 48,400 patches from real and fake ID documents, along with experimental frameworks and models, potentially accelerating progress in this security-critical field.
Authors: Javier Muñoz-Haro, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez
Link: https://arxiv.org/abs/2504.07761v1
Date: 2025-04-10
Summary:
In an increasingly digitalized world, verifying the authenticity of ID documents has become a critical challenge for real-life applications such as digital banking, crypto-exchanges, renting, etc. This study focuses on the topic of fake ID detection, covering several limitations in the field. In particular, no publicly available data from real ID documents exists, and most studies rely on proprietary in-house databases that are not available due to privacy reasons. In order to shed some light on this critical challenge that makes difficult to advance in the field, we explore a trade-off between privacy (i.e., amount of sensitive data available) and performance, proposing a novel patch-wise approach for privacy-preserving fake ID detection. Our proposed approach explores how privacy can be enhanced through: i) two levels of anonymization for an ID document (i.e., fully- and pseudo-anonymized), and ii) different patch size configurations, varying the amount of sensitive data visible in the patch image. Also, state-of-the-art methods such as Vision Transformers and Foundation Models are considered in the analysis. The experimental framework shows that, on an unseen database (DLC-2021), our proposal achieves 13.91% and 0% EERs at patch and ID document level, showing a good generalization to other databases. In addition to this exploration, another key contribution of our study is the release of the first publicly available database that contains 48,400 patches from both real and fake ID documents, along with the experimental framework and models, which will be available in our GitHub.
--------------------------------------------------------------------------------------------------------
GraspClutter6D introduces a comprehensive real-world dataset for robotic grasping in challenging cluttered environments, addressing limitations in existing benchmarks that focus on simplistic scenes. The dataset features 1,000 highly cluttered scenes with dense object arrangements (14.1 objects/scene with 62.6% occlusion) across 200 objects in 75 environment configurations. With rich annotations including 736K 6D object poses and 9.3B feasible robotic grasps for 52K RGB-D images, this resource enables significant improvements in segmentation, pose estimation, and grasping detection. Benchmarking results demonstrate that models trained on GraspClutter6D substantially outperform those trained on existing datasets, accelerating progress toward robust real-world robotic manipulation.
Authors: Seunghyeok Back, Joosoon Lee, Kangmin Kim, Heeseon Rho, Geonhyup Lee, Raeyoung Kang, Sangbeom Lee, Sangjun Noh, Youngjin Lee, Taeyeop Lee, Kyoobin Lee
Link: https://arxiv.org/abs/2504.06866v1
Date: 2025-04-09
Summary:
Robust grasping in cluttered environments remains an open challenge in robotics. While benchmark datasets have significantly advanced deep learning methods, they mainly focus on simplistic scenes with light occlusion and insufficient diversity, limiting their applicability to practical scenarios. We present GraspClutter6D, a large-scale real-world grasping dataset featuring: (1) 1,000 highly cluttered scenes with dense arrangements (14.1 objects/scene, 62.6\% occlusion), (2) comprehensive coverage across 200 objects in 75 environment configurations (bins, shelves, and tables) captured using four RGB-D cameras from multiple viewpoints, and (3) rich annotations including 736K 6D object poses and 9.3B feasible robotic grasps for 52K RGB-D images. We benchmark state-of-the-art segmentation, object pose estimation, and grasping detection methods to provide key insights into challenges in cluttered environments. Additionally, we validate the dataset's effectiveness as a training resource, demonstrating that grasping networks trained on GraspClutter6D significantly outperform those trained on existing datasets in both simulation and real-world experiments. The dataset, toolkit, and annotation tools are publicly available on our project website: https://sites.google.com/view/graspclutter6d.
--------------------------------------------------------------------------------------------------------
This innovative study tackles the crucial challenge of understanding value mechanisms in large language models (LLMs) beyond surface-level evaluations. The researchers introduce ValueExploration, a framework that investigates behavior-driven mechanisms of social values at the neuron level, focusing on Chinese Social Values as a case study. By constructing C-voice, a large-scale bilingual benchmark, they identify specific neurons responsible for encoding values and analyze how deactivating these neurons affects model behavior. This approach provides unprecedented insights into how values influence LLM decision-making, potentially enabling more transparent and value-aligned AI systems through neuron-level interpretability rather than just output evaluation.
Authors: Ling Hu, Yuemei Xu, Xiaoyang Gu, Letao Han
Link: https://arxiv.org/abs/2504.04994v1
Date: 2025-04-07
Summary:
Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.
--------------------------------------------------------------------------------------------------------
Video-MSG introduces a groundbreaking training-free guidance method for text-to-video generation that overcomes limitations in current approaches requiring fine-tuning or memory-intensive attention map manipulation. The three-step process creates a Video Sketch—a fine-grained spatio-temporal plan specifying backgrounds, foregrounds, and object trajectories—which then guides diffusion models through noise inversion and denoising. This model-agnostic approach works with various text-to-video backbones without additional memory requirements during inference, making it compatible with large models. Superior performance with VideoCrafter2 and CogVideoX-5B demonstrates Video-MSG's effectiveness in enhancing text alignment while maintaining visual quality, advancing capabilities in controllable video generation.
Authors: Jialu Li, Shoubin Yu, Han Lin, Jaemin Cho, Jaehong Yoon, Mohit Bansal
Link: https://arxiv.org/abs/2504.08641v1
Date: 2025-04-11
Summary:
Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.
--------------------------------------------------------------------------------------------------------
F³Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos
F³Set addresses the significant challenge of detecting fast, frequent, and fine-grained (F³) events in video analytics—a problem where current methods struggle due to motion blur and subtle visual differences. This benchmark provides extensive datasets for precise F³ event detection, featuring over 1,000 event types with precise timestamps and multi-level granularity analysis. Currently focused on sports applications but expandable to other domains, F³Set reveals substantial challenges for existing temporal action understanding methods. The researchers also introduce F³ED, a new detection method with superior performance. This benchmark and accompanying resources provide valuable tools for advancing video understanding capabilities in applications requiring high temporal precision and detailed event recognition.
Authors: Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, Jin Song Dong
Link: https://arxiv.org/abs/2504.08222v1
Date: 2025-04-11
Summary:
Analyzing Fast, Frequent, and Fine-grained (F³) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F³ criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F3Set, a benchmark that consists of video datasets for precise F³ event detection. Datasets in F³Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, F³Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F³Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F³ED, for F³ event detections, achieving superior performance. The dataset, model, and benchmark code are available at https://github.com/F3Set/F3Set.
--------------------------------------------------------------------------------------------------------
Leveraging Machine Learning Techniques in Intrusion Detection Systems for Internet of Things
This comprehensive exploration examines how machine learning and deep learning can enhance intrusion detection systems (IDS) for Internet of Things networks, where traditional security approaches struggle with scale and dynamics. The paper evaluates various ML techniques—from Support Vector Machines to Random Forests—alongside advanced DL models like LSTM and CNN, addressing challenges including false positives, data imbalance, and resource constraints. The research highlights emerging applications of Generative AI and Large Language Models in threat detection, automated responses, and security policy generation. By providing a framework for intelligent, adaptive security solutions, this work guides the development of next-generation protection systems for increasingly vulnerable IoT environments.
Authors: Saeid Jamshidi, Amin Nikanjam, Nafi Kawser Wazed, Foutse Khomh
Link: https://arxiv.org/abs/2504.07220v1
Date: 2025-04-09
Summary:
As the Internet of Things (IoT) continues to expand, ensuring the security of connected devices has become increasingly critical. Traditional Intrusion Detection Systems (IDS) often fall short in managing the dynamic and large-scale nature of IoT networks. This paper explores how Machine Learning (ML) and Deep Learning (DL) techniques can significantly enhance IDS performance in IoT environments. We provide a thorough overview of various IDS deployment strategies and categorize the types of intrusions common in IoT systems. A range of ML methods -- including Support Vector Machines, Naive Bayes, K-Nearest Neighbors, Decision Trees, and Random Forests -- are examined alongside advanced DL models such as LSTM, CNN, Autoencoders, RNNs, and Deep Belief Networks. Each technique is evaluated based on its accuracy, efficiency, and suitability for real-world IoT applications. We also address major challenges such as high false positive rates, data imbalance, encrypted traffic analysis, and the resource constraints of IoT devices. In addition, we highlight the emerging role of Generative AI and Large Language Models (LLMs) in improving threat detection, automating responses, and generating intelligent security policies. Finally, we discuss ethical and privacy concerns, underscoring the need for responsible and transparent implementation. This paper aims to provide a comprehensive framework for developing adaptive, intelligent, and secure IDS solutions tailored for the evolving landscape of IoT.
--------------------------------------------------------------------------------------------------------
Constitution or Collapse? Exploring Constitutional AI with Llama 3-8B
This study investigates the applicability of Constitutional AI—a technique using AI-generated feedback to reduce human labeling—to smaller language models like LLaMA 3-8B. While the original approach was designed for models with approximately 52 billion parameters, this research replicates the workflow with the much smaller LLaMA 3-8B. Results show Constitutional AI effectively increases harmlessness, reducing the Attack Success Rate by 40.8%, but at the cost of a 9.8% drop in helpfulness. The research reveals signs of model collapse in the final DPO-CAI model, suggesting that self-improvement may be an emergent property requiring larger model sizes—similar to reasoning and math abilities.
Authors: Xue Zhang
Link: https://arxiv.org/abs/2504.04918v1
Date: 2025-04-07
Summary:
As language models continue to grow larger, the cost of acquiring high-quality training data has increased significantly. Collecting human feedback is both expensive and time-consuming, and manual labels can be noisy, leading to an imbalance between helpfulness and harmfulness. Constitutional AI, introduced by Anthropic in December 2022, uses AI to provide feedback to another AI, greatly reducing the need for human labeling. However, the original implementation was designed for a model with around 52 billion parameters, and there is limited information on how well Constitutional AI performs with smaller models, such as LLaMA 3-8B. In this paper, we replicated the Constitutional AI workflow using the smaller LLaMA 3-8B model. Our results show that Constitutional AI can effectively increase the harmlessness of the model, reducing the Attack Success Rate in MT-Bench by 40.8%. However, similar to the original study, increasing harmlessness comes at the cost of helpfulness. The helpfulness metrics, which are an average of the Turn 1 and Turn 2 scores, dropped by 9.8% compared to the baseline. Additionally, we observed clear signs of model collapse in the final DPO-CAI model, indicating that smaller models may struggle with self-improvement due to insufficient output quality, making effective fine-tuning more challenging. Our study suggests that, like reasoning and math ability, self-improvement is an emergent property.
--------------------------------------------------------------------------------------------------------
EIDT-V introduces a novel, model-agnostic approach to zero-shot, training-free text-to-video generation that works with existing image diffusion models without requiring architectural modifications. Using intersections in diffusion trajectories alongside a grid-based method, the system leverages LLMs to generate coherent frame-wise prompts and identify differences between frames. A CLIP-based attention mask controls prompt switching timing for each grid cell, balancing coherence and variance. This flexible approach demonstrates state-of-the-art performance with diverse image generation models like Stable Diffusion, achieving superior temporal consistency and visual fidelity as confirmed by quantitative metrics and user studies. EIDT-V represents a significant advancement in accessible, high-quality video synthesis without specialized training.
Authors: Diljeet Jagpal, Xi Chen, Vinay P. Namboodiri
Link: https://arxiv.org/abs/2504.06861v1
Date: 2025-04-09
Summary:
Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model's superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation.
--------------------------------------------------------------------------------------------------------
Rate Analysis and Optimization of LoS Beyond Diagonal RIS-assisted MIMO Systems
This technical letter provides valuable insights into beyond-diagonal reconfigurable intelligent surface (BD-RIS) assisted multiple-input multiple-output (MIMO) communication systems. The researchers derive expressions for achievable rates when channels to and from the BD-RIS have line-of-sight properties while the direct link is non-line-of-sight. This mathematical analysis yields a closed-form solution for the optimal unitary and symmetric scattering BD-RIS matrix. Simulation results demonstrate that the proposed solution remains competitive even under more common Ricean channel fading models with weak direct links. These findings advance understanding of RIS technology for enhancing wireless communications, particularly in challenging propagation environments where conventional approaches may underperform.
Authors: Ignacio Santamaria, Jesus Gutierrez, Mohammad Soleymani, Eduard Jorswieck
Link: https://arxiv.org/abs/2504.07647v1
Date: 2025-04-10
Summary:
In this letter, we derive an expression for the achievable rate in a multiple-input multiple-output (MIMO) system assisted by a beyond-diagonal reconfigurable intelligent surface (BD-RIS) when the channels to and from the BD-RIS are line-of-sight (LoS) while the direct link is non-line-of-sight (NLoS). The rate expression allows to derive the optimal unitary and symmetric scattering BD-RIS matrix in closed form. Our simulation results show that the proposed solution is competitive even under the more usual Ricean channel fading model when the direct link is weak.
--------------------------------------------------------------------------------------------------------
This comprehensive tutorial introduces artificial spin ice (ASI)—arrays of interacting nanomagnets—as complex magnetic systems with emergent properties, rich microstate spaces, intrinsic memory, and GHz-range dynamics. The article provides foundational knowledge on micromagnetics theory, design principles, fabrication methods, and measurement techniques for these nanomagnetic arrays. Starting with historical context and physical phenomena, it explores experimental techniques for preparing microstates and characterizing magnetization dynamics in both ASI and broader ferromagnetic materials. The tutorial concludes with an introduction to neuromorphic computing applications of ASI systems, offering researchers an entry point into this interdisciplinary field combining nanotechnology, magnetism, and computational paradigms with applications in data processing and unconventional computing.
Authors: Rawnak Sultana, Amrit Kumar Mondal, Vinayak Shantaram Bhat, Kilian Stenning, Yue Li, Daan M. Arroo, Aastha Vasdev, Margaret R. McCarter, Lance E. De Long, J. Todd Hastings, Jack C. Gartside, M. Benjamin Jungfleisch
Link: https://arxiv.org/abs/2504.06548v1
Date: 2025-04-09
Summary:
Artificial spin ice, arrays of strongly interacting nanomagnets, are complex magnetic systems with many emergent properties, rich microstate spaces, intrinsic physical memory, high-frequency dynamics in the GHz range and compatibility with a broad range of measurement approaches. This tutorial article aims to provide the foundational knowledge needed to understand, design, develop, and improve the dynamic properties of artificial spin ice (ASI). Special emphasis is placed on introducing the theory of micromagnetics, which describes the complex dynamics within these systems, along with their design, fabrication methods, and standard measurement and control techniques. The article begins with a review of the historical background, introducing the underlying physical phenomena and interactions that govern artificial spin ice. We then explore standard experimental techniques used to prepare the microstate space of the nanomagnetic array and to characterize magnetization dynamics, both in artificial spin ice and more broadly in ferromagnetic materials. Finally, we introduce the basics of neuromorphic computing applied to the case of artificial spin ice systems with goal to help researchers new to the field grasp these exciting new developments.
--------------------------------------------------------------------------------------------------------
This innovative study addresses the challenge of transparency in AI-generated content by developing a technique to embed watermarks directly into large language model weights. The researchers propose finetuning a pair of low-rank adapters—one generating text and another detecting watermarks—to simultaneously optimize watermark embedding and detection. This end-to-end learned approach represents a significant advancement over API-based watermarking methods, though it presents optimization challenges in balancing watermark robustness, text naturalness, and task performance. The paper discusses strategies for optimizing this min-max objective and presents results demonstrating the effects on instruction finetuning. This technique could enhance accountability in AI text generation while maintaining model utility.
Authors: Fay Elhassan, Niccolò Ajroldi, Antonio Orvieto, Jonas Geiping
Link: https://arxiv.org/abs/2504.06446v1
Date: 2025-04-08
Summary:
The indistinguishability of AI-generated content from human text raises challenges in transparency and accountability. While several methods exist to watermark models behind APIs, embedding watermark strategies directly into model weights that are later reflected in the outputs of the model is challenging. In this study we propose a strategy to finetune a pair of low-rank adapters of a model, one serving as the text-generating model, and the other as the detector, so that a subtle watermark is embedded into the text generated by the first model and simultaneously optimized for detectability by the second. In this way, the watermarking strategy is fully learned end-to-end. This process imposes an optimization challenge, as balancing watermark robustness, naturalness, and task performance requires trade-offs. We discuss strategies on how to optimize this min-max objective and present results showing the effect of this modification to instruction finetuning.
--------------------------------------------------------------------------------------------------------
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
VCR-Bench introduces a comprehensive evaluation framework for video chain-of-thought reasoning capabilities in large vision-language models. Unlike existing benchmarks, this novel framework specifically assesses the reasoning process and distinguishes between perception and reasoning failures through 859 diverse videos and 1,034 manually annotated question-answer pairs with stepwise CoT rationales. The seven task dimensions and proposed CoT score enable thorough evaluation of the entire reasoning process. Extensive experiments reveal substantial limitations in current models—even the top performer achieves only 62.8% CoT score—with most scoring below 40%. Results indicate perception of temporal-spatial information as the primary bottleneck. The strong correlation between CoT scores and accuracy validates this evaluation approach for advancing complex video reasoning capabilities.
Authors: Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao
Link: https://arxiv.org/abs/2504.07956v1
Date: 2025-04-10
Summary:
The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.
--------------------------------------------------------------------------------------------------------
This research explores the alignment between AI influencers' views and public opinions through an interactive platform collecting data from 330 participants representative of the U.S. population and 100 AI influencers identified by Time magazine. The findings reveal significant disparities: the public primarily fears AI getting out of control, while influencers emphasize regulation—potentially to deflect attention from AI monetization priorities. Notably, influencers from underrepresented groups often hold views that differ from their demographic counterparts in the general public. This work provides valuable insights into the disconnect between those shaping AI development and those who will be affected by it, highlighting the importance of inclusive governance and broader stakeholder engagement in AI policy discussions.
Authors: Gustavo Moreira, Edyta Paulina Bogucka, Marios Constantinides, Daniele Quercia
Link: https://arxiv.org/abs/2504.06016v1
Date: 2025-04-08
Summary:
AI development is shaped by academics and industry leaders - let us call them ``influencers'' - but it is unclear how their views align with those of the public. To address this gap, we developed an interactive platform that served as a data collection tool for exploring public views on AI, including their fears, hopes, and overall sense of hopefulness. We made the platform available to 330 participants representative of the U.S. population in terms of age, sex, ethnicity, and political leaning, and compared their views with those of 100 AI influencers identified by Time magazine. The public fears AI getting out of control, while influencers emphasize regulation, seemingly to deflect attention from their alleged focus on monetizing AI's potential. Interestingly, the views of AI influencers from underrepresented groups such as women and people of color often differ from the views of underrepresented groups in the public.
--------------------------------------------------------------------------------------------------------
Dual Engines of Thoughts: A Depth-Breadth Integration Framework for Open-Ended Analysis
The Dual Engines of Thoughts framework addresses limitations in traditional reasoning systems by focusing on comprehensive open-ended reasoning rather than single-answer problems. Its three-component architecture—Base Prompter, Solver Agent, and Dual-Engine System—balances breadth and depth of analysis through parallel exploration of diverse factors and deep investigation of specific aspects. This customizable framework allows users to adjust analytical parameters and tool configurations based on specific requirements. Experimental results demonstrate its effectiveness in complex, multi-faceted questions, achieving 77-86% win rates compared to existing reasoning models. DEoT has potential applications in strategic planning, policy analysis, scientific research, and other domains requiring nuanced, multi-dimensional analysis of problems without predefined answers.
Authors: Fei-Hsuan Yu, Yun-Cheng Chou, Teng-Ruei Chen
Link: https://arxiv.org/abs/2504.07872v1
Date: 2025-04-10
Summary:
We propose the Dual Engines of Thoughts (DEoT), an analytical framework for comprehensive open-ended reasoning. While traditional reasoning frameworks primarily focus on finding "the best answer" or "the correct answer" for single-answer problems, DEoT is specifically designed for "open-ended questions," enabling both broader and deeper analytical exploration. The framework centers on three key components: a Base Prompter for refining user queries, a Solver Agent that orchestrates task decomposition, execution, and validation, and a Dual-Engine System consisting of a Breadth Engine (to explore diverse impact factors) and a Depth Engine (to perform deep investigations). This integrated design allows DEoT to balance wide-ranging coverage with in-depth analysis, and it is highly customizable, enabling users to adjust analytical parameters and tool configurations based on specific requirements. Experimental results show that DEoT excels in addressing complex, multi-faceted questions, achieving a total win rate of 77-86% compared to existing reasoning models, thus highlighting its effectiveness in real-world applications.
--------------------------------------------------------------------------------------------------------
This research explores alternative AI interaction models for complex decision-making support, comparing a novel approach called ExtendAI, which builds upon users' own rationales, with traditional recommendation-based AI. Through a mixed-methods study involving investment decisions, the researchers found that ExtendAI integrated better with users' thinking processes and led to slightly better outcomes, while RecommendAI provided more novel insights with less cognitive effort. The study reveals three fundamental tensions in AI-assisted decision-making and suggests that different assistance approaches serve distinct purposes. These findings have implications for designing AI systems that enhance rather than replace human reasoning, particularly in domains requiring careful consideration of multiple factors and personal values.
Authors: Leon Reicherts, Zelun Tony Zhang, Elisabeth von Oswald, Yuanting Liu, Yvonne Rogers, Mariam Hassib
Link: https://arxiv.org/abs/2504.06771v1
Date: 2025-04-09
Summary:
How can we design AI tools that effectively support human decision-making by complementing and enhancing users' reasoning processes? Common recommendation-centric approaches face challenges such as inappropriate reliance or a lack of integration with users' decision-making processes. Here, we explore an alternative interaction model in which the AI outputs build upon users' own decision-making rationales. We compare this approach, which we call ExtendAI, with a recommendation-based AI. Participants in our mixed-methods user study interacted with both AIs as part of an investment decision-making task. We found that the AIs had different impacts, with ExtendAI integrating better into the decision-making process and people's own thinking and leading to slightly better outcomes. RecommendAI was able to provide more novel insights while requiring less cognitive effort. We discuss the implications of these and other findings along with three tensions of AI-assisted decision-making which our study revealed.
--------------------------------------------------------------------------------------------------------
PLM-eXplain: Divide and Conquer the Protein Embedding Space
PLM-eXplain (PLM-X) addresses the interpretability gap in protein language models by developing an explainable adapter layer that decomposes embeddings into interpretable biochemical features and a residual subspace preserving predictive power. Applied to ESM2 embeddings, this approach incorporates established properties like secondary structure and hydropathy while maintaining high performance across three protein classification tasks: extracellular vesicle association, transmembrane helix identification, and aggregation propensity prediction. By enabling biological interpretation without sacrificing accuracy, PLM-X provides a generalizable solution for enhancing protein language model interpretability. This breakthrough connects powerful deep learning models with actionable biological insights, potentially accelerating applications in drug discovery, protein engineering, and disease mechanism understanding.
Authors: Jan van Eck, Dea Gogishvili, Wilson Silva, Sanne Abeln
Link: https://arxiv.org/abs/2504.07156v1
Date: 2025-04-09
Summary:
Protein language models (PLMs) have revolutionised computational biology through their ability to generate powerful sequence representations for diverse prediction tasks. However, their black-box nature limits biological interpretation and translation to actionable insights. We present an explainable adapter layer - PLM-eXplain (PLM-X), that bridges this gap by factoring PLM embeddings into two components: an interpretable subspace based on established biochemical features, and a residual subspace that preserves the model's predictive power. Using embeddings from ESM2, our adapter incorporates well-established properties, including secondary structure and hydropathy while maintaining high performance. We demonstrate the effectiveness of our approach across three protein-level classification tasks: prediction of extracellular vesicle association, identification of transmembrane helices, and prediction of aggregation propensity. PLM-X enables biological interpretation of model decisions without sacrificing accuracy, offering a generalisable solution for enhancing PLM interpretability across various downstream applications. This work addresses a critical need in computational biology by providing a bridge between powerful deep learning models and actionable biological insights.
--------------------------------------------------------------------------------------------------------
This research evaluates artificial intelligence accuracy in diagnosing diabetic retinopathy (DR), a major complication in diabetic patients that can lead to blindness if not detected early. Using the Synthetic Minority Over-sampling Technique (SMOTE) algorithm and Convolutional Neural Networks (CNN) on the public "APTOS 2019 Blindness Detection" dataset, the study achieved remarkable results: 99.55% accuracy for binary classification (normal vs. DR) and 95.26% accuracy for multi-class classification across severity stages. The confusion matrix evaluation further confirmed high performance (99.68% for binary, 96.65% for multiclass). These findings demonstrate significant potential for enhancing DR diagnosis compared to traditional methods, potentially enabling earlier intervention and preventing vision loss in diabetic patients.
Authors: Sidhiq Mardianta, Affandy, Catur Supriyanto, Catur Supriyanto, Adi Wijaya
Link: https://arxiv.org/abs/2504.05696v1
Date: 2025-04-08
Summary:
Diabetic retinopathy (DR) is one of the major complications in diabetic patients' eyes, potentially leading to permanent blindness if not detected timely. This study aims to evaluate the accuracy of artificial intelligence (AI) in diagnosing DR. The method employed is the Synthetic Minority Over-sampling Technique (SMOTE) algorithm, applied to identify DR and its severity stages from fundus images using the public dataset "APTOS 2019 Blindness Detection." Literature was reviewed via ScienceDirect, ResearchGate, Google Scholar, and IEEE Xplore. Classification results using Convolutional Neural Network (CNN) showed the best performance for the binary classes normal (0) and DR (1) with an accuracy of 99.55%, precision of 99.54%, recall of 99.54%, and F1-score of 99.54%. For the multiclass classification No_DR (0), Mild (1), Moderate (2), Severe (3), Proliferate_DR (4), the accuracy was 95.26%, precision 95.26%, recall 95.17%, and F1-score 95.23%. Evaluation using the confusion matrix yielded results of 99.68% for binary classification and 96.65% for multiclass. This study highlights the significant potential in enhancing the accuracy of DR diagnosis compared to traditional human analysis
--------------------------------------------------------------------------------------------------------
TISER introduces a novel framework enhancing temporal reasoning in large language models through timeline construction and iterative self-reflection. By extending reasoning traces through test-time scaling, the approach effectively captures complex temporal dependencies while improving traceability of the inference process. This addresses critical limitations in LLMs' ability to process time-related information such as event sequencing, durations, and inter-temporal relationships—capabilities essential for applications including question answering, scheduling, and historical analysis. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including out-of-distribution test sets, while enabling smaller open-source models to surpass larger closed-weight models on challenging temporal reasoning tasks. This advancement could significantly improve AI systems' reasoning about time-dependent information.
Authors: Adrián Bazaga, Rexhina Blloshmi, Bill Byrne, Adrià de Gispert
Link: https://arxiv.org/abs/2504.05258v1
Date: 2025-04-07
Summary:
Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks. However, they struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships. These capabilities are critical for applications including question answering, scheduling, and historical analysis. In this paper, we introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection. Our approach leverages test-time scaling to extend the length of reasoning traces, enabling models to capture complex temporal dependencies more effectively. This strategy not only boosts reasoning accuracy but also improves the traceability of the inference process. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including out-of-distribution test sets, and reveal that TISER enables smaller open-source models to surpass larger closed-weight models on challenging temporal reasoning tasks.
--------------------------------------------------------------------------------------------------------