Week Ending 3.2.2025

RESEARCH WATCH: 3.2.2025

Enabling AutoML for Zero-Touch Network Security: Use-Case Driven Analysis

Zero-Touch Networks represent the future of network management, promising fully automated systems capable of handling the complexity of 6G networks. This paper explores how Automated Machine Learning (AutoML) can address two critical security challenges: the need for human expertise in AI/ML security development and the vulnerability to adversarial attacks. By examining case studies of autonomous intrusion detection systems and defenses against Adversarial ML attacks, the authors demonstrate practical approaches to securing these networks. This research has significant implications for telecommunications providers and cybersecurity professionals seeking to develop robust, self-healing network infrastructure with minimal human intervention.

Authors: Li Yang, Mirna El Rajab, Abdallah Shami, Sami Muhaidat

Link: https://arxiv.org/abs/2502.21286v1

Date: 2025-02-28

Summary:

Zero-Touch Networks (ZTNs) represent a state-of-the-art paradigm shift towards fully automated and intelligent network management, enabling the automation and intelligence required to manage the complexity, scale, and dynamic nature of next-generation (6G) networks. ZTNs leverage Artificial Intelligence (AI) and Machine Learning (ML) to enhance operational efficiency, support intelligent decision-making, and ensure effective resource allocation. However, the implementation of ZTNs is subject to security challenges that need to be resolved to achieve their full potential. In particular, two critical challenges arise: the need for human expertise in developing AI/ML-based security mechanisms, and the threat of adversarial attacks targeting AI/ML models. In this survey paper, we provide a comprehensive review of current security issues in ZTNs, emphasizing the need for advanced AI/ML-based security mechanisms that require minimal human intervention and protect AI/ML models themselves. Furthermore, we explore the potential of Automated ML (AutoML) technologies in developing robust security solutions for ZTNs. Through case studies, we illustrate practical approaches to securing ZTNs against both conventional and AI/ML-specific threats, including the development of autonomous intrusion detection systems and strategies to combat Adversarial ML (AML) attacks. The paper concludes with a discussion of the future research directions for the development of ZTN security approaches.

--------------------------------------------------------------------------------------------------------

The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition

Wildlife conservation efforts increasingly rely on camera trap footage to monitor animal behavior and population health. The PanAf-FGBG dataset introduces a groundbreaking resource featuring 20 hours of wild chimpanzee behaviors paired with corresponding background videos from the same camera locations. This unique pairing enables researchers to evaluate how background information affects behavior recognition models and quantify performance in both in-distribution and out-of-distribution conditions. By establishing baselines and introducing effective latent-space normalization techniques, this research advances computer vision applications for wildlife monitoring, offering conservation scientists more reliable tools to detect early indicators of ecosystem changes.

Authors: Otto Brookes, Maksim Kukushkin, Majid Mirmehdi, Colleen Stephens, Paula Dieguez, Thurston C. Hicks, Sorrel Jones, Kevin Lee, Maureen S. McCarthy, Amelia Meier, Emmanuelle Normand, Erin G. Wessling, Roman M. Wittig, Kevin Langergraber, Klaus Zuberbühler, Lukas Boesch, Thomas Schmid, Mimi Arandjelovic, Hjalmar Kühl, Tilo Burghardt

Link: https://arxiv.org/abs/2502.21201v1

Date: 2025-02-28

Summary:

Computer vision analysis of camera trap video footage is essential for wildlife conservation, as captured behaviours offer some of the earliest indicators of changes in population health. Recently, several high-impact animal behaviour datasets and methods have been introduced to encourage their use; however, the role of behaviour-correlated background information and its significant effect on out-of-distribution generalisation remain unexplored. In response, we present the PanAf-FGBG dataset, featuring 20 hours of wild chimpanzee behaviours, recorded at over 350 individual camera locations. Uniquely, it pairs every video with a chimpanzee (referred to as a foreground video) with a corresponding background video (with no chimpanzee) from the same camera location. We present two views of the dataset: one with overlapping camera locations and one with disjoint locations. This setup enables, for the first time, direct evaluation of in-distribution and out-of-distribution conditions, and for the impact of backgrounds on behaviour recognition models to be quantified. All clips come with rich behavioural annotations and metadata including unique camera IDs and detailed textual scene descriptions. Additionally, we establish several baselines and present a highly effective latent-space normalisation technique that boosts out-of-distribution performance by +5.42% mAP for convolutional and +3.75% mAP for transformer-based models. Finally, we provide an in-depth analysis on the role of backgrounds in out-of-distribution behaviour recognition, including the so far unexplored impact of background durations (i.e., the count of background frames within foreground videos).

--------------------------------------------------------------------------------------------------------

Multimodal Learning for Just-In-Time Software Defect Prediction in Autonomous Driving Systems

As autonomous driving technologies become more prevalent, ensuring software reliability is paramount for public safety. This paper introduces an innovative multimodal learning approach for predicting software defects in autonomous driving systems before they manifest. By leveraging pre-trained transformers to process multiple data types—code features, change metrics, and contextual information—the model can identify potential defects with greater accuracy than existing methods. Experiments on three open-source projects (Apollo, Carla, and Donkeycar) demonstrate significant performance improvements over current techniques. This research could dramatically enhance autonomous vehicle safety through proactive software quality assurance.

Authors: Faisal Mohammad, Duksan Ryu

Link: https://arxiv.org/abs/2502.20806v1

Date: 2025-02-28

Summary:

In recent years, the rise of autonomous driving technologies has highlighted the critical importance of reliable software for ensuring safety and performance. This paper proposes a novel approach for just-in-time software defect prediction (JIT-SDP) in autonomous driving software systems using multimodal learning. The proposed model leverages the multimodal transformers in which the pre-trained transformers and a combining module deal with the multiple data modalities of the software system datasets such as code features, change metrics, and contextual information. The key point for adapting multimodal learning is to utilize the attention mechanism between the different data modalities such as text, numerical, and categorical. In the combining module, the output of a transformer model on text data and tabular features containing categorical and numerical data are combined to produce the predictions using the fully connected layers. Experiments conducted on three open-source autonomous driving system software projects collected from the GitHub repository (Apollo, Carla, and Donkeycar) demonstrate that the proposed approach significantly outperforms state-of-the-art deep learning and machine learning models regarding evaluation metrics. Our findings highlight the potential of multimodal learning to enhance the reliability and safety of autonomous driving software through improved defect prediction.

--------------------------------------------------------------------------------------------------------

JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation

While large language models can generate impressive text, they typically function as black boxes with limited interpretability. JAM (Just A Move) introduces a novel framework that combines cause-effect analysis with latent vector manipulation to create more controllable and responsible text generation. By uncovering the inherent causality in LLM generation and manipulating fundamental components in their architecture, JAM achieves up to 22% improvement over previous Controllable Text Generation methods while demonstrating greater computational efficiency. This approach offers significant potential for developing more interpretable AI systems with reduced toxicity and improved alignment with human values.

Authors: Yingbing Huang, Deming Chen, Abhishek K. Umrawal

Link: https://arxiv.org/abs/2502.20684v1

Date: 2025-02-28

Summary:

While large language models (LLMs) have made significant strides in generating coherent and contextually relevant text, they often function as opaque black boxes, trained on vast unlabeled datasets with statistical objectives, lacking an interpretable framework for responsible control. In this paper, we introduce JAM (Just A Move), a novel framework that interprets and controls text generation by integrating cause-effect analysis within the latent space of LLMs. Based on our observations, we uncover the inherent causality in LLM generation, which is critical for producing responsible and realistic outputs. Moreover, we explore latent vectors as fundamental components in LLM architectures, aiming to understand and manipulate them for more effective and efficient controllable text generation. We evaluate our framework using a range of tools, including the HHH criteria, toxicity reduction benchmarks, and GPT-4 alignment measures. Our results show that JAM achieves up to a 22% improvement over previous Controllable Text Generation (CTG) methods across multiple quantitative metrics and human-centric evaluations. Furthermore, JAM demonstrates greater computational efficiency compared to other CTG methods. These results highlight the effectiveness and efficiency of JAM for responsible and realistic text generation, paving the way for more interpretable and controllable models.

--------------------------------------------------------------------------------------------------------

Neuromorphic Circuits with Spiking Astrocytes for Increased Energy Efficiency, Fault Tolerance, and Memory Capacitance

This paper advances neuromorphic computing by incorporating biologically-inspired Leaky Integrate-and-Fire Astrocyte (LIFA) models into spiking neural networks. The researchers designed a core architecture where astrocytes support multiple neurons, creating a clustered model that significantly improves fault tolerance and operational efficiency. Their implementation achieves an impressive 81.10% fault tolerance rate, substantially outperforming other approaches. This breakthrough has profound implications for developing more resilient and adaptable neuromorphic systems that could power next-generation AI hardware with greater energy efficiency and robustness, particularly in adverse operating conditions.

Authors: Aybars Yunusoglu, Dexter Le, Murat Isik, I. Can Dikmen, Teoman Karadag

Link: https://arxiv.org/abs/2502.20492v1

Date: 2025-02-27

Summary:

In the rapidly advancing field of neuromorphic computing, integrating biologically-inspired models like the Leaky Integrate-and-Fire Astrocyte (LIFA) into spiking neural networks (SNNs) enhances system robustness and performance. This paper introduces the LIFA model in SNNs, addressing energy efficiency, memory management, routing mechanisms, and fault tolerance. Our core architecture consists of neurons, synapses, and astrocyte circuits, with each astrocyte supporting multiple neurons for self-repair. This clustered model improves fault tolerance and operational efficiency, especially under adverse conditions. We developed a routing methodology to map the LIFA model onto a fault-tolerant, many-core design, optimizing network functionality and efficiency. Our model features a fault tolerance rate of 81.10\% and a resilience improvement rate of 18.90\%, significantly surpassing other implementations. The results validate our approach in memory management, highlighting its potential as a robust solution for advanced neuromorphic computing applications. The integration of astrocytes represents a significant advancement, setting the stage for more resilient and adaptable neuromorphic systems.

--------------------------------------------------------------------------------------------------------

EgoNormia: Benchmarking Physical Social Norm Understanding

Human interactions are governed by social norms, yet vision-language models (VLMs) often lack explicit training on norm understanding in physical contexts. EgoNormia introduces a benchmark of 1,853 ego-centric videos of human interactions with questions evaluating prediction and justification of normative actions across seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. The research reveals that state-of-the-art VLMs score only 45% on this benchmark (compared to humans' 92%), highlighting significant risks when deploying AI agents in real-world settings. This benchmark offers a valuable tool for enhancing norm understanding in embodied AI systems.

Authors: MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang

Link: https://arxiv.org/abs/2502.20490v1

Date: 2025-02-27

Summary:

Human activity is moderated by norms. When performing actions in the real world, humans not only follow norms, but also consider the trade-off between different norms However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia $\|\epsilon\|$, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNomia to enhance normative reasoning in VLMs.

--------------------------------------------------------------------------------------------------------

Adaptive H&E-IHC information fusion staining framework based on feature extra

Immunohistochemistry (IHC) staining is crucial for evaluating diseases like breast cancer, but traditional methods can be costly and time-consuming. This paper presents an adaptive framework for transforming standard H&E stains into IHC images using advanced generative models. By employing a VMFE module for multi-scale feature extraction and wavelet transform convolution, the system effectively captures and preserves color information that might otherwise be lost. The approach overcomes previous limitations in digital staining, offering pathologists a cost-effective alternative that could accelerate diagnosis while maintaining accuracy across various datasets.

Authors: Yifan Jia, Xingda Yu, Zhengyang Ji, Songning Lai, Yutao Yue

Link: https://arxiv.org/abs/2502.20156v1

Date: 2025-02-27

Summary:

Immunohistochemistry (IHC) staining plays a significant role in the evaluation of diseases such as breast cancer. The H&E-to-IHC transformation based on generative models provides a simple and cost-effective method for obtaining IHC images. Although previous models can perform digital coloring well, they still suffer from (i) coloring only through the pixel features that are not prominent in HE, which is easy to cause information loss in the coloring process; (ii) The lack of pixel-perfect H&E-IHC groundtruth pairs poses a challenge to the classical L1 loss.To address the above challenges, we propose an adaptive information enhanced coloring framework based on feature extractors. We first propose the VMFE module to effectively extract the color information features using multi-scale feature extraction and wavelet transform convolution, while combining the shared decoder for feature fusion. The high-performance dual feature extractor of H&E-IHC is trained by contrastive learning, which can effectively perform feature alignment of HE-IHC in high latitude space. At the same time, the trained feature encoder is used to enhance the features and adaptively adjust the loss in the HE section staining process to solve the problems related to unclear and asymmetric information. We have tested on different datasets and achieved excellent performance.Our code is available at https://github.com/babyinsunshine/CEFF

--------------------------------------------------------------------------------------------------------

Connecting the Persian-speaking World through Transliteration

Despite speaking mutually intelligible varieties of Persian, Tajik speakers (using Cyrillic script) cannot read Iranian and Afghan texts (in Perso-Arabic script), cutting them off from most Persian-language internet content. This paper presents a transformer-based approach to transliteration between these scripts, achieving significant scores on novel digraphic datasets. Rather than pursuing full machine translation—which would be impractical given the scarcity of parallel data—this transliteration approach offers a more feasible solution. By bridging this script divide, the research could connect millions of Persian speakers, enhancing cultural exchange and information access across Central Asia and the Middle East.

Authors: Rayyan Merchant, Akhilesh Kakolu Ramarao, Kevin Tang

Link: https://arxiv.org/abs/2502.20047v1

Date: 2025-02-27

Summary:

Despite speaking mutually intelligible varieties of the same language, speakers of Tajik Persian, written in a modified Cyrillic alphabet, cannot read Iranian and Afghan texts written in the Perso-Arabic script. As the vast majority of Persian text on the Internet is written in Perso-Arabic, monolingual Tajik speakers are unable to interface with the Internet in any meaningful way. Due to overwhelming similarity between the formal registers of these dialects and the scarcity of Tajik-Farsi parallel data, machine transliteration has been proposed as more a practical and appropriate solution than machine translation. This paper presents a transformer-based G2P approach to Tajik-Farsi transliteration, achieving chrF++ scores of 58.70 (Farsi to Tajik) and 74.20 (Tajik to Farsi) on novel digraphic datasets, setting a comparable baseline metric for future work. Our results also demonstrate the non-trivial difficulty of this task in both directions. We also provide an overview of the differences between the two scripts and the challenges they present, so as to aid future efforts in Tajik-Farsi transliteration.

--------------------------------------------------------------------------------------------------------

Multibeam SETI Observations toward Nearby M dwarfs with FAST

This paper reports on targeted Search for Extraterrestrial Intelligence (SETI) observations using China's Five-hundred-meter Aperture Spherical radio Telescope (FAST) to examine three nearby M dwarf stars with exoplanet candidates. Employing multibeam coincidence matching to search for narrowband drifting signals across 1.05-1.45 GHz frequencies, the researchers could detect signals with minimum equivalent isotropic radiant power as low as 6.19×10^8 W—well within human technological capabilities. While an unusual signal at 1312.50 MHz initially appeared promising, careful analysis ruled out extraterrestrial origins. This work demonstrates the continuing advancement of radio astronomy in humanity's search for cosmic neighbors.

Authors: Xiao-Hang Luan, Bo-Lun Huang, Zhen-Zhao Tao, Yan Cui, Tong-Jie Zhang, Pei Wang

Link: https://arxiv.org/abs/2502.20419v1

Date: 2025-02-27

Summary:

The search for extraterrestrial intelligence (SETI) targeted searches aim to observe specific areas and objects to find possible technosignatures. Many SETI researches have focused on nearby stars and their planets in recent years. In this paper, we report a targeted SETI observations using the most sensitive L-band Five-hundred-meter Aperture Spherical radio Telescope (FAST) toward three nearby M dwarfs, all of which have been discovered exoplanet candidates. The minimum equivalent isotropic radiant power of the lower limit from the three sources we can detect is $6.19 \times 10^{8}$ W, which is well within the reach of current human technology. Applying the multibeam coincidence matching (MBCM) blind search mode, we search for narrowband drifting signals across 1.05-1.45 GHz in each of the two orthogonal linear polarization directions. An unusual signal at 1312.50 MHz detected from the observation toward AD Leo originally piqued our interest. However, we finally eliminate the possibility of an extraterrestrial origin based on much evidence, such as the polarization, frequency, and beam coverage characteristics.

--------------------------------------------------------------------------------------------------------

NeoBERT: A Next-Generation BERT

While autoregressive language models like LLaMA have seen remarkable advances, bidirectional encoders like BERT have lagged behind despite their importance in many NLP applications. NeoBERT bridges this gap by integrating state-of-the-art architectural improvements, modern data, and optimized pre-training methodologies into a next-generation encoder. Despite its compact 250M parameter size, it outperforms much larger models on the MTEB benchmark. With an extended context length of 4,096 tokens and design optimizations, NeoBERT serves as a drop-in replacement for existing base models, promising significant performance improvements for a wide range of natural language processing tasks.

Authors: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar

Link: https://arxiv.org/abs/2502.19587v1

Date: 2025-02-26

Summary:

Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

--------------------------------------------------------------------------------------------------------

TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

Understanding complex theorems often requires visual explanations alongside text-based reasoning. TheoremExplainAgent introduces an innovative approach for generating long-form theorem explanation videos using Manim animations. The researchers also created TheoremExplainBench, a comprehensive benchmark covering 240 theorems across multiple STEM disciplines with five automated evaluation metrics. While their system achieves a 93.8% success rate in generating detailed videos, the results highlight how multimodal explanations can reveal reasoning flaws that remain hidden in text-only formats. This work opens new pathways for enhancing STEM education through AI-generated visual explanations of complex mathematical and scientific concepts.

Authors: Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, Wenhu Chen

Link: https://arxiv.org/abs/2502.19400v1

Date: 2025-02-26

Summary:

Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.

--------------------------------------------------------------------------------------------------------

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

Personalizing large language models (LLMs) is crucial for applications like virtual assistants and content curation. This paper introduces Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem where an LLM learns to quickly adapt to individual users through just a few labeled preferences. To overcome the scarcity of real-world preference data, the researchers generate over 1M synthetic personalized preferences, finding that both diversity and coherent structure are essential for transferring to real users. The approach achieves impressive results across three domains—movie reviews, educational content, and question answering—demonstrating effective personalization for both synthetic and human users.

Authors: Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn

Link: https://arxiv.org/abs/2502.19312v1

Date: 2025-02-26

Summary:

Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

--------------------------------------------------------------------------------------------------------

Cross-Modality Investigation on WESAD Stress Classification

With the growing importance of mental health monitoring, this research explores transformer models for stress detection using physiological signals from the WESAD dataset. By training on six different modalities—electrocardiograms, electrodermal activity, electromyography, respiration rate, temperature, and accelerometer signals—the researchers achieve state-of-the-art performance with accuracy, precision, and recall values between 99.73% and 99.95%. Their novel exploration of cross-modal performance, supported by 2D visualization of embedding spaces and quantitative analysis of data variance, provides valuable insights for developing robust, generalizable stress monitoring systems that could be integrated into wearable healthcare technologies.

Authors: Eric Oliver, Sagnik Dakshit

Link: https://arxiv.org/abs/2502.18733v1

Date: 2025-02-26

Summary:

Deep learning's growing prevalence has driven its widespread use in healthcare, where AI and sensor advancements enhance diagnosis, treatment, and monitoring. In mobile health, AI-powered tools enable early diagnosis and continuous monitoring of conditions like stress. Wearable technologies and multimodal physiological data have made stress detection increasingly viable, but model efficacy depends on data quality, quantity, and modality. This study develops transformer models for stress detection using the WESAD dataset, training on electrocardiograms (ECG), electrodermal activity (EDA), electromyography (EMG), respiration rate (RESP), temperature (TEMP), and 3-axis accelerometer (ACC) signals. The results demonstrate the effectiveness of single-modality transformers in analyzing physiological signals, achieving state-of-the-art performance with accuracy, precision and recall values in the range of $99.73\%$ to $99.95\%$ for stress detection. Furthermore, this study explores cross-modal performance and also explains the same using 2D visualization of the learned embedding space and quantitative analysis based on data variance. Despite the large body of work on stress detection and monitoring, the robustness and generalization of these models across different modalities has not been explored. This research represents one of the initial efforts to interpret embedding spaces for stress detection, providing valuable information on cross-modal performance.

--------------------------------------------------------------------------------------------------------

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

This groundbreaking paper introduces SWE-RL, the first approach to scale reinforcement learning for enhancing LLM reasoning in real-world software engineering. By learning from extensive open-source software evolution data—including code snapshots, changes, and events—and using a lightweight rule-based reward system, the resulting Llama3-SWE-RL-70B model achieves an impressive 41.0% solve rate on verified GitHub issues. Remarkably, despite training solely on software evolution data, the model demonstrates improved performance on five out-of-domain tasks, including mathematics and general language understanding. This research opens new possibilities for enhancing LLM reasoning capabilities through reinforcement learning on domain-specific data.

Authors: Yuxiang Wei, Olivier Duchenne, Jade Copet, Quentin Carbonneaux, Lingming Zhang, Daniel Fried, Gabriel Synnaeve, Rishabh Singh, Sida I. Wang

Link: https://arxiv.org/abs/2502.18449v1

Date: 2025-02-25

Summary:

The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

--------------------------------------------------------------------------------------------------------

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Attention mechanisms are essential but computationally expensive components in large models due to their quadratic time complexity. SpargeAttn introduces a universal sparse and quantized attention method applicable to any model type. Using a two-stage online filter—first predicting the attention map to skip unnecessary matrix multiplications, then applying a softmax-aware filter without additional overhead—this approach significantly accelerates diverse language, image, and video generation models without sacrificing quality. By exploiting the natural sparsity in attention maps (where many values are near zero), SpargeAttn offers a practical solution for making large model inference more efficient across various applications.

Authors: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen

Link: https://arxiv.org/abs/2502.18137v1

Date: 2025-02-25

Summary:

An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.

--------------------------------------------------------------------------------------------------------

A First Look at the Performance Enhancement Potential of Fluid Reconfigurable Intelligent Surface

This paper introduces an innovative concept in wireless communications: the Fluid Reconfigurable Intelligent Surface (FRIS), which combines shape flexibility with position reconfigurability for each reflecting element. By dividing a meta-surface into non-overlapping subareas that can dynamically adjust both position and phase shift, FRIS creates programmable wireless channels with enhanced performance. The researchers optimize these fluid elements using particle swarm optimization and extend their analysis to multi-user scenarios. Numerical results demonstrate that FRIS significantly outperforms conventional RIS configurations in achievable rate performance, suggesting promising applications for next-generation wireless networks requiring high data rates and adaptability.

Authors: Abdelhamid Salem, Kai-Kit Wong, George Alexandropoulos, Chan-Byoung Chae, Ross Murch

Link: https://arxiv.org/abs/2502.17116v1

Date: 2025-02-24

Summary:

The fluid antenna concept represents shape-flexible and position-flexible antenna technologies designed to enhance wireless communication applications. In this paper, we apply this concept to reconfigurable intelligent surfaces (RISs), introducing fluid RIS (FRIS), where each tunably reflecting element becomes a fluid element with additional position reconfigurability. This new paradigm is referred to as fluid RIS (FRIS). We investigate an FRIS-programmable wireless channel, where the fluid meta-surface is divided into non-overlapping subareas, each acting as a fluid element that can dynamically adjust both its position and phase shift of the reflected signal. We first analyze the single-user, single-input single-output (SU-SISO) channel, in which a single-antenna transmitter communicates with a single-antenna receiver via an FRIS. The achievable rate is maximized by optimizing the fluid elements using a particle swarm optimization (PSO)- based approach. Next, we extend our analysis to the multi-user, multiple-input single-output (MU-MISO) case, where a multi-antenna base station (BS) transmits individual data streams to multiple single-antenna users via an FRIS. In this case, the joint optimization of the positions and phase shifts of the FRIS element, as well as the BS precoding to maximize the sum-rate is studied. To solve the problem, a combination of techniques including PSO, semi-definite relaxation (SDR), and minimum mean square error (MMSE) is proposed. Numerical results demonstrate that the proposed FRIS approach significantly outperforms conventional RIS configurations in terms of achievable rate performance.

--------------------------------------------------------------------------------------------------------

WildFrame: Comparing Framing in Humans and LLMs on Naturally Occurring Texts

How information is presented—its framing—significantly influences human perception and decision-making. This research introduces WildFrame, a dataset evaluating how large language models respond to positive and negative framing in naturally-occurring sentences, with direct comparison to human responses. By analyzing eight state-of-the-art LLMs on 1,000 reframed real-world statements, the researchers find that all models exhibit framing effects similar to humans, with both groups being more influenced by positive than negative reframing. These findings provide valuable insights for model developers seeking either to harness framing effects or mitigate them, depending on the specific application requirements.

Authors: Gili Lior, Liron Nacchace, Gabriel Stanovsky

Link: https://arxiv.org/abs/2502.17091v1

Date: 2025-02-24

Summary:

Humans are influenced by how information is presented, a phenomenon known as the framing effect. Previous work has shown that LLMs may also be susceptible to framing but has done so on synthetic data and did not compare to human behavior. We introduce WildFrame, a dataset for evaluating LLM responses to positive and negative framing, in naturally-occurring sentences, and compare humans on the same data. WildFrame consists of 1,000 texts, first selecting real-world statements with clear sentiment, then reframing them in either positive or negative light, and lastly, collecting human sentiment annotations. By evaluating eight state-of-the-art LLMs on WildFrame, we find that all models exhibit framing effects similar to humans ($r\geq0.57$), with both humans and models being more influenced by positive rather than negative reframing. Our findings benefit model developers, who can either harness framing or mitigate its effects, depending on the downstream application.

--------------------------------------------------------------------------------------------------------

Muon is Scalable for LLM Training

The Muon optimizer, based on matrix orthogonalization, has shown promise for small-scale language models, but its scalability remained unproven until now. This research identifies two crucial techniques for scaling Muon to larger models: adding weight decay and carefully adjusting per-parameter update scales. These modifications enable Muon to work out-of-the-box on large-scale training without extensive hyperparameter tuning, achieving approximately twice the computational efficiency of AdamW. The researchers demonstrate this by introducing Moonlight, a 3B/16B-parameter Mixture-of-Expert model trained with 5.7T tokens using Muon, which improves the current Pareto frontier with better performance using significantly fewer training FLOPs.

Authors: Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang

Link: https://arxiv.org/abs/2502.16982v1

Date: 2025-02-24

Summary:

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves ∼2× computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

--------------------------------------------------------------------------------------------------------

HybridLinker: Topology-Guided Posterior Sampling for Enhanced Diversity and Validity in 3D Molecular Linker Generation

Linker generation plays a crucial role in drug discovery, particularly in lead optimization and PROTAC design. This paper introduces HybridLinker, a framework that enhances PC-Aware (point cloud aware) inference by incorporating diverse bonding topologies from pretrained PC-Free models. At its core, LinkerDPS represents the first diffusion posterior sampling method operating across both PC-Free and PC-Aware spaces, connecting molecular topology with 3D point clouds through an energy-inspired function. By transferring diverse sampling distributions between these spaces, HybridLinker significantly improves both validity and diversity in molecular design tasks, establishing a new framework for more effective drug candidate generation.

Authors: Minyeong Hwang, Ziseok Lee, Gwangsoo Kim, Kyungsu Kim, Eunho Yang

Link: https://arxiv.org/abs/2502.17349v1

Date: 2025-02-24

Summary:

Linker generation is critical in drug discovery applications such as lead optimization and PROTAC design, where molecular fragments are assembled into diverse drug candidates. Existing methods fall into PC-Free and PC-Aware categories based on their use of 3D point clouds (PC). PC-Free models prioritize diversity but suffer from lower validity due to overlooking PC constraints, while PC-Aware models ensure higher validity but restrict diversity by enforcing strict PC constraints. To overcome these trade-offs without additional training, we propose HybridLinker, a framework that enhances PC-Aware inference by providing diverse bonding topologies from a pretrained PC-Free model as guidance. At its core, we propose LinkerDPS, the first diffusion posterior sampling (DPS) method operating across PC-Free and PC-Aware spaces, bridging molecular topology with 3D point clouds via an energy-inspired function. By transferring the diverse sampling distribution of PC-Free models into the PC-Aware distribution, HybridLinker significantly and consistently surpasses baselines, improving both validity and diversity in foundational molecular design and applied property optimization tasks, establishing a new DPS framework in the molecular and graph domains beyond imaging.

--------------------------------------------------------------------------------------------------------

Contextualizing biological perturbation experiments through language

High-content perturbation experiments provide unprecedented insights into biomolecular systems, but their widespread adoption is hindered by experimental and analysis costs. This paper introduces PerturbQA, a benchmark for structured reasoning over perturbation experiments that addresses open problems in modeling: predicting differential expression and change of direction for unseen perturbations, and gene set enrichment. After evaluating various machine learning approaches and finding current methods perform poorly, the researchers introduce Summer (SUMMarize, retrievE, and answeR), a domain-informed LLM framework that matches or exceeds the state-of-the-art. This work demonstrates how large language models can represent complex biological relationships and rationalize experimental outcomes in ways that align with downstream biological analyses.

Authors: Menghua Wu, Russell Littman, Jacob Levine, Lin Qiu, Tommaso Biancalani, David Richmond, Jan-Christian Huetter

Link: https://arxiv.org/abs/2502.21290v1

Date: 2025-02-28

Summary:

High-content perturbation experiments allow scientists to probe biomolecular systems at unprecedented resolution, but experimental and analysis costs pose significant barriers to widespread adoption. Machine learning has the potential to guide efficient exploration of the perturbation space and extract novel insights from these data. However, current approaches neglect the semantic richness of the relevant biology, and their objectives are misaligned with downstream biological analyses. In this paper, we hypothesize that large language models (LLMs) present a natural medium for representing complex biological relationships and rationalizing experimental outcomes. We propose PerturbQA, a benchmark for structured reasoning over perturbation experiments. Unlike current benchmarks that primarily interrogate existing knowledge, PerturbQA is inspired by open problems in perturbation modeling: prediction of differential expression and change of direction for unseen perturbations, and gene set enrichment. We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations, as well as standard LLM reasoning strategies, and we find that current methods perform poorly on PerturbQA. As a proof of feasibility, we introduce Summer (SUMMarize, retrievE, and answeR, a simple, domain-informed LLM framework that matches or exceeds the current state-of-the-art. Our code and data are publicly available at https://github.com/genentech/PerturbQA.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithMarch 3, 2025Comment