Week Ending 2.23.2025

RESEARCH WATCH: 2.23.2025

Evaluating Social Biases in LLM Reasoning

The emergence of chain-of-thought reasoning in large language models has shown impressive results in mathematical and coding tasks. However, this paper identifies a critical gap: how biases can infiltrate and amplify through these reasoning steps. By testing DeepSeek-R1 variants against their instruction-tuned counterparts on the BBQ dataset, the researchers provide the first empirical study of bias in LLM reasoning. This work could lead to more responsible AI development by helping identify and mitigate harmful biases that become more persuasive when embedded within seemingly logical arguments.

Authors: Xuyang Wu, Jinming Nian, Zhiqiang Tao, Yi Fang

Link: https://arxiv.org/abs/2502.15361v1

Date: 2025-02-21

Summary:

In the recent development of AI reasoning, large language models (LLMs) are trained to automatically generate chain-of-thought reasoning steps, which have demonstrated compelling performance on math and coding tasks. However, when bias is mixed within the reasoning process to form strong logical arguments, it could cause even more harmful results and further induce hallucinations. In this paper, we have evaluated the 8B and 32B variants of DeepSeek-R1 against their instruction tuned counterparts on the BBQ dataset, and investigated the bias that is elicited out and being amplified through reasoning steps. To the best of our knowledge, this empirical study is the first to assess bias issues in LLM reasoning.

--------------------------------------------------------------------------------------------------------

Improving the Diffusability of Autoencoders

Latent diffusion models have revolutionized high-quality image and video generation, but this paper identifies a previously overlooked problem: high-frequency components in autoencoder latent spaces that interfere with the diffusion process. Through spectral analysis, the researchers discovered this issue is especially pronounced in autoencoders with large bottleneck channel sizes. Their solution—scale equivariance regularization—aligns latent and RGB spaces across frequencies with minimal code changes and limited fine-tuning steps. This simple yet effective approach significantly improves generation quality, reducing FID by 19% for images and FVD by at least 44% for videos.

Authors: Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin

Link: https://arxiv.org/abs/2502.14831v1

Date: 2025-02-20

Summary:

Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation on Kinetics-700 17x256x256.

--------------------------------------------------------------------------------------------------------

YOLOv12: A Breakdown of the Key Architectural Features

YOLOv12 represents the next evolution in real-time object detection, building upon its predecessors while introducing key architectural improvements. By incorporating an optimized R-ELAN backbone, 7x7 separable convolutions, and FlashAttention-driven area-based attention, the model achieves superior feature extraction and detection capabilities. With multiple variants addressing different application needs, YOLOv12 delivers consistent improvements in both accuracy and speed. These advancements make it particularly valuable for real-time applications in autonomous systems, security, and analytics, with sufficient flexibility to deploy across diverse hardware platforms from edge devices to high-performance computing clusters.

Authors: Mujadded Al Rabbani Alif, Muhammad Hussain

Link: https://arxiv.org/abs/2502.14740v1

Date: 2025-02-20

Summary:

This paper presents an architectural analysis of YOLOv12, a significant advancement in single-stage, real-time object detection building upon the strengths of its predecessors while introducing key improvements. The model incorporates an optimised backbone (R-ELAN), 7x7 separable convolutions, and FlashAttention-driven area-based attention, improving feature extraction, enhanced efficiency, and robust detections. With multiple model variants, similar to its predecessors, YOLOv12 offers scalable solutions for both latency-sensitive and high-accuracy applications. Experimental results manifest consistent gains in mean average precision (mAP) and inference speed, making YOLOv12 a compelling choice for applications in autonomous systems, security, and real-time analytics. By achieving an optimal balance between computational efficiency and performance, YOLOv12 sets a new benchmark for real-time computer vision, facilitating deployment across diverse hardware platforms, from edge devices to high-performance clusters.

--------------------------------------------------------------------------------------------------------

Temporal Misalignment in ANN-SNN Conversion and Its Mitigation via Probabilistic Spiking Neurons

Spiking Neural Networks offer an energy-efficient alternative to traditional neural networks by mimicking biological neural processes, potentially addressing the growing energy demands of large AI models. This research identifies a previously unknown phenomenon—temporal misalignment—where random spike rearrangement across SNN layers unexpectedly improves performance. Based on this observation, the researchers introduce biologically plausible two-phase probabilistic spiking neurons that enhance the ANN-SNN conversion process. With theoretical backing and empirical validation across multiple datasets and architectures, this approach achieves state-of-the-art results, paving the way for more energy-efficient AI systems without sacrificing performance.

Authors: Velibor Bojković, Xiaofeng Wu, Bin Gu

Link: https://arxiv.org/abs/2502.14487v2

Date: 2025-02-21

Summary:

Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to Artificial Neural Networks (ANNs) by mimicking biological neural principles, establishing them as a promising approach to mitigate the increasing energy demands of large-scale neural models. However, fully harnessing the capabilities of SNNs remains challenging due to their discrete signal processing and temporal dynamics. ANN-SNN conversion has emerged as a practical approach, enabling SNNs to achieve competitive performance on complex machine learning tasks. In this work, we identify a phenomenon in the ANN-SNN conversion framework, termed temporal misalignment, in which random spike rearrangement across SNN layers leads to performance improvements. Based on this observation, we introduce biologically plausible two-phase probabilistic (TPP) spiking neurons, further enhancing the conversion process. We demonstrate the advantages of our proposed method both theoretically and empirically through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet across a variety of architectures, achieving state-of-the-art results.

--------------------------------------------------------------------------------------------------------

Enhancing Smart Environments with Context-Aware Chatbots using Large Language Models

This paper presents a novel architecture that transforms human-environment interactions by integrating contextual awareness into LLM-powered chatbots. By combining user location data from UWB tags, smart home sensors, and real-time human activity recognition, the system creates a comprehensive understanding of the user's context. This enables the chatbot to generate personalized interactions and recommendations based on what the user is doing and where they are. Moving beyond traditional static interactions, this dynamic approach was validated through a real-world case study, demonstrating significant benefits for creating intuitive and helpful smart home experiences that adapt to users in real time.

Authors: Aurora Polo-Rodríguez, Laura Fiorini, Erika Rovini, Filippo Cavallo, Javier Medina-Quero

Link: https://arxiv.org/abs/2502.14469v1

Date: 2025-02-20

Summary:

This work presents a novel architecture for context-aware interactions within smart environments, leveraging Large Language Models (LLMs) to enhance user experiences. Our system integrates user location data obtained through UWB tags and sensor-equipped smart homes with real-time human activity recognition (HAR) to provide a comprehensive understanding of user context. This contextual information is then fed to an LLM-powered chatbot, enabling it to generate personalised interactions and recommendations based on the user's current activity and environment. This approach moves beyond traditional static chatbot interactions by dynamically adapting to the user's real-time situation. A case study conducted from a real-world dataset demonstrates the feasibility and effectiveness of our proposed architecture, showcasing its potential to create more intuitive and helpful interactions within smart homes. The results highlight the significant benefits of integrating LLM with real-time activity and location data to deliver personalised and contextually relevant user experiences.

--------------------------------------------------------------------------------------------------------

Causes and Strategies in Multiagent Systems

This paper bridges a critical gap between causality research and multi-agent systems by introducing a systematic method to build multi-agent models from structural causal frameworks. The researchers develop "causal concurrent game structures" where transitions correspond to interventions on agent variables in the causal model, using the Halpern and Pearl causality framework to determine how agent decisions affect other variables. This novel approach enables analysis and reasoning about the causal effects of agents' strategic decisions—opening new avenues for understanding how agents' choices impact outcomes and potentially improving coordination, fairness, and explainability in multi-agent systems.

Authors: Sylvia S. Kerkhove, Natasha Alechina, Mehdi Dastani

Link: https://arxiv.org/abs/2502.13701v1

Date: 2025-02-19

Summary:

Causality plays an important role in daily processes, human reasoning, and artificial intelligence. There has however not been much research on causality in multi-agent strategic settings. In this work, we introduce a systematic way to build a multi-agent system model, represented as a concurrent game structure, for a given structural causal model. In the obtained so-called causal concurrent game structure, transitions correspond to interventions on agent variables of the given causal model. The Halpern and Pearl framework of causality is used to determine the effects of a certain value for an agent variable on other variables. The causal concurrent game structure allows us to analyse and reason about causal effects of agents' strategic decisions. We formally investigate the relation between causal concurrent game structures and the original structural causal models.

--------------------------------------------------------------------------------------------------------

Disentangling Long-Short Term State Under Unknown Interventions for Online Time Series Forecasting

Time series forecasting faces a fundamental challenge in online scenarios: maintaining long-term dependencies while adapting to short-term changes as data arrives sequentially. This paper proposes a framework that separates these components by modeling short-term changes as resulting from unknown interventions (like sudden policy changes in stock markets). Through identification theory and mild assumptions, the researchers develop LSTD—a model that extracts long/short-term states using specialized encoders, constrained to preserve long-term dependencies while forgetting short-term ones. Experimental results across multiple benchmarks demonstrate LSTD's superior performance in online forecasting, with practical applications in finance, supply chain management, and other dynamic systems.

Authors: Ruichu Cai, Haiqin Huang, Zhifang Jiang, Zijian Li, Changze Zhou, Yuequn Liu, Yuming Liu, Zhifeng Hao

Link: https://arxiv.org/abs/2502.12603v1

Date: 2025-02-18

Summary:

Current methods for time series forecasting struggle in the online scenario, since it is difficult to preserve long-term dependency while adapting short-term changes when data are arriving sequentially. Although some recent methods solve this problem by controlling the updates of latent states, they cannot disentangle the long/short-term states, leading to the inability to effectively adapt to nonstationary. To tackle this challenge, we propose a general framework to disentangle long/short-term states for online time series forecasting. Our idea is inspired by the observations where short-term changes can be led by unknown interventions like abrupt policies in the stock market. Based on this insight, we formalize a data generation process with unknown interventions on short-term states. Under mild assumptions, we further leverage the independence of short-term states led by unknown interventions to establish the identification theory to achieve the disentanglement of long/short-term states. Built on this theory, we develop a long short-term disentanglement model (LSTD) to extract the long/short-term states with long/short-term encoders, respectively. Furthermore, the LSTD model incorporates a smooth constraint to preserve the long-term dependencies and an interrupted dependency constraint to enforce the forgetting of short-term dependencies, together boosting the disentanglement of long/short-term states. Experimental results on several benchmark datasets show that our \textbf{LSTD} model outperforms existing methods for online time series forecasting, validating its efficacy in real-world applications.

--------------------------------------------------------------------------------------------------------

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Scaling laws have guided LLM development by balancing model size, tokens, and compute resources. This research investigates what factors most strongly influence loss-to-loss scaling between pretraining datasets and downstream tasks. Surprisingly, the pretraining data and tokenizer—not model size, hyperparameters, or even architecture differences between transformers and state-space models—determine scaling trends. These findings have significant implications for LLM development: practitioners should prioritize careful curation of pretraining datasets to optimize downstream performance, while freely optimizing architectures and other settings for training efficiency, potentially streamlining the development of specialized models for specific applications.

Authors: Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

Link: https://arxiv.org/abs/2502.12120v1

Date: 2025-02-17

Summary:

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend. In contrast, model size, optimization hyperparameters, and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

--------------------------------------------------------------------------------------------------------

GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs

This paper addresses a fundamental limitation in current multimodal large language models: their inability to effectively incorporate relational information (graph structure) alongside text and images. GraphGPT-o introduces a comprehensive framework for multimodal understanding and creation on attributed graphs by exploring linearization techniques to transform graph information into MLLM-compatible inputs, developing a hierarchical aligner for deep graph encoding, and adapting inference for interleaved text and image generation. The approach shows promising results across three multi-domain datasets, with potential applications in social network analysis, document understanding, knowledge graphs, and any domain where understanding relationships between multimodal elements is crucial.

Authors: Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han

Link: https://arxiv.org/abs/2502.11925v1

Date: 2025-02-17

Summary:

The rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework. However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG). It is underexplored how MLLMs can incorporate the relational information (\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts and images) on such graphs for multimodal comprehension and generation. In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs. We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs. Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs. Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. Extensive experiments on three datasets from different domains demonstrate the effectiveness of our proposed method. Datasets and codes will be open-sourced upon acceptance.

--------------------------------------------------------------------------------------------------------

Generative Multi-Agent Collaboration in Embodied AI: A Systematic Review

This survey examines how embodied multi-agent systems can leverage recent advances in foundation models to enable richer communication and adaptive problem-solving. By categorizing systems according to architecture and embodiment modalities, the researchers highlight how collaboration spans physical and virtual contexts. The paper analyzes how generative techniques enhance key building blocks—perception, planning, communication, and feedback—to improve system flexibility and robustness. Real-world examples demonstrate the transformative potential of integrating foundation models into embodied multi-agent frameworks, with applications in logistics, robotics, manufacturing, search and rescue, and collaborative problem-solving—ultimately reshaping how AI-driven teams operate in complex environments.

Authors: Di Wu, Xian Wei, Guang Chen, Hao Shen, Xiangfeng Wang, Wenhao Li, Bo Jin

Link: https://arxiv.org/abs/2502.11518v1

Date: 2025-02-17

Summary:

Embodied multi-agent systems (EMAS) have attracted growing attention for their potential to address complex, real-world challenges in areas such as logistics and robotics. Recent advances in foundation models pave the way for generative agents capable of richer communication and adaptive problem-solving. This survey provides a systematic examination of how EMAS can benefit from these generative capabilities. We propose a taxonomy that categorizes EMAS by system architectures and embodiment modalities, emphasizing how collaboration spans both physical and virtual contexts. Central building blocks, perception, planning, communication, and feedback, are then analyzed to illustrate how generative techniques bolster system robustness and flexibility. Through concrete examples, we demonstrate the transformative effects of integrating foundation models into embodied, multi-agent frameworks. Finally, we discuss challenges and future directions, underlining the significant promise of EMAS to reshape the landscape of AI-driven collaboration.

--------------------------------------------------------------------------------------------------------

Biases in Edge Language Models: Detection, Analysis, and Mitigation

As large language models migrate to resource-constrained edge devices like Raspberry Pi, this paper investigates how deployment environments affect bias in model outputs. Through comparative analysis across edge, cloud, and desktop deployments, the researchers found that Llama-2 running on Raspberry Pi 4 exhibits 43.23% and 21.89% more bias over time compared to desktop and cloud-based models. To address this challenge, they propose a feedback loop mechanism that applies predefined constraint weights during inference, resulting in a 79.28% reduction in model bias. This approach enables more ethical AI deployment in resource-limited settings for applications in healthcare, education, and personal assistance.

Authors: Vinamra Sharma, Danilo Pietro Pau, José Cano

Link: https://arxiv.org/abs/2502.11349v1

Date: 2025-02-17

Summary:

The integration of large language models (LLMs) on low-power edge devices such as Raspberry Pi, known as edge language models (ELMs), has introduced opportunities for more personalized, secure, and low-latency language intelligence that is accessible to all. However, the resource constraints inherent in edge devices and the lack of robust ethical safeguards in language models raise significant concerns about fairness, accountability, and transparency in model output generation. This paper conducts a comparative analysis of text-based bias across language model deployments on edge, cloud, and desktop environments, aiming to evaluate how deployment settings influence model fairness. Specifically, we examined an optimized Llama-2 model running on a Raspberry Pi 4; GPT 4o-mini, Gemini-1.5-flash, and Grok-beta models running on cloud servers; and Gemma2 and Mistral models running on a MacOS desktop machine. Our results demonstrate that Llama-2 running on Raspberry Pi 4 is 43.23% and 21.89% more prone to showing bias over time compared to models running on the desktop and cloud-based environments. We also propose the implementation of a feedback loop, a mechanism that iteratively adjusts model behavior based on previous outputs, where predefined constraint weights are applied layer-by-layer during inference, allowing the model to correct bias patterns, resulting in 79.28% reduction in model bias.

--------------------------------------------------------------------------------------------------------

KOALA: Knowledge Conflict Augmentations for Robustness in Vision Language Models

This pioneering research explores how vision language models (VLMs) handle knowledge conflicts across modalities—an understudied but crucial aspect of multimodal AI systems. By developing SEGSUB, a framework that applies targeted image perturbations, the researchers investigate three types of knowledge conflicts: parametric, source, and counterfactual. Their findings reveal that while VLMs are surprisingly robust against image perturbations, they struggle with counterfactual examples and fail on source conflicts. The research also uncovers GPT-4o's tendency to hallucinate when presented with contextualized counterfactual examples. These insights could improve VLM reliability in critical applications like autonomous vehicles, medical imaging, and content moderation.

Authors: Peter Carragher, Nikitha Rao, Abhinand Jha, R Raghav, Kathleen M. Carley

Link: https://arxiv.org/abs/2502.14908v1

Date: 2025-02-19

Summary:

The robustness of large language models (LLMs) against knowledge conflicts in unimodal question answering systems has been well studied. However, the effect of conflicts in information sources on vision language models (VLMs) in multimodal settings has not yet been explored. In this work, we propose \segsub, a framework that applies targeted perturbations to image sources to study and improve the robustness of VLMs against three different types of knowledge conflicts, namely parametric, source, and counterfactual conflicts. Contrary to prior findings that showed that LLMs are sensitive to parametric conflicts arising from textual perturbations, we find VLMs are largely robust to image perturbation. On the other hand, VLMs perform poorly on counterfactual examples (<30% accuracy) and fail to reason over source conflicts (<1% accuracy). We also find a link between hallucinations and image context, with GPT-4o prone to hallucination when presented with highly contextualized counterfactual examples. While challenges persist with source conflicts, finetuning models significantly improves reasoning over counterfactual samples. Our findings highlight the need for VLM training methodologies that enhance their reasoning capabilities, particularly in addressing complex knowledge conflicts between multimodal sources.

--------------------------------------------------------------------------------------------------------

Classifiers of Data Sharing Statements in Clinical Trial Records

As clinical trial data sharing grows increasingly important for scientific advancement, efficiently identifying available individual participant data (IPD) becomes crucial. This paper evaluates how well domain-specific pre-trained language models can interpret textual data-sharing statements from ClinicalTrials.gov to identify available datasets. The researchers found that classifiers trained on manually annotated labels outperformed those trained on original database availability categories, suggesting that textual descriptions contain valuable information not captured in standardized categories. These automated classifiers could significantly improve the discovery of reusable clinical trial data, accelerating medical research, meta-analyses, and the development of new treatments by making valuable datasets more findable.

Authors: Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth

Link: https://arxiv.org/abs/2502.12362v1

Date: 2025-02-17

Summary:

Digital individual participant data (IPD) from clinical trials are increasingly distributed for potential scientific reuse. The identification of available IPD, however, requires interpretations of textual data-sharing statements (DSS) in large databases. Recent advancements in computational linguistics include pre-trained language models that promise to simplify the implementation of effective classifiers based on textual inputs. In a subset of 5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers based on domain-specific pre-trained language models reproduce original availability categories as well as manually annotated labels. Typical metrics indicate that classifiers that predicted manual annotations outperformed those that learned to output the original availability categories. This suggests that the textual DSS descriptions contain applicable information that the availability categories do not, and that such classifiers could thus aid the automatic identification of available IPD in large trial databases.

--------------------------------------------------------------------------------------------------------

LM Agents for Coordinating Multi-User Information Gathering

This paper introduces PeopleJoin, a benchmark for evaluating how language model agents coordinate collaborative problem-solving across multiple users. The framework simulates realistic scenarios where information is distributed across "organizations" of 2-20 users, requiring agents to identify appropriate teammates, gather information through conversation, and compile useful answers for the original requester. Featuring two domains—database question answering and document creation—PeopleJoin enables the evaluation of different agent architectures on both accuracy and efficiency. This research addresses critical challenges in developing AI assistants that can effectively coordinate human collaboration in workplaces, research teams, and other multi-stakeholder environments where information is naturally distributed.

Authors: Harsh Jhamtani, Jacob Andreas, Benjamin Van Durme

Link: https://arxiv.org/abs/2502.12328v1

Date: 2025-02-17

Summary:

This paper introduces PeopleJoin, a benchmark for evaluating LM-mediated collaborative problem solving. Given a user request, PeopleJoin agents must identify teammates who might be able to assist, converse with these teammates to gather information, and finally compile a useful answer or summary for the original user. PeopleJoin comprises two evaluation domains: PeopleJoin-QA, focused on questions about tabular data, and PeopleJoin-DocCreation, focused on document creation tasks. The two domains are adapted from existing NLP benchmarks for database question answering and multi-document summarization; here, however, the information needed to complete these tasks is distributed across synthetic ``organizations'' of 2--20 users, simulating natural multi-user collaboration scenarios. We implemented several popular LM agent architectures, evaluating their accuracy and efficiency at completing tasks, and highlight new research questions that can be studied using PeopleJoin.

--------------------------------------------------------------------------------------------------------

Magnetic Fields or Overstable Convective Modes in HR 7495: Exploring the Underlying Causes of the Spike in the 'Hump & Spike' Features

This paper investigates the mysterious "hump & spike" features observed in over 200 A- and F-type stars, focusing on HR 7495—the brightest such star. Using data from Kepler, TESS, and spectropolarimetric observations spanning 4.5 years, the researchers evaluate two competing explanations for the spike: magnetic phenomena (stellar spots) or pulsations (Overstable Convective modes). Their comprehensive analysis strongly supports the stellar spots hypothesis, suggesting HR 7495 and potentially all similar stars harbor undetected weak magnetic fields generated by a dynamo mechanism. These findings advance our understanding of stellar evolution and magnetic field generation in intermediate-mass stars, with implications for stellar modeling and population studies.

Authors: V. Antoci, M. Cantiello, V. Khalack, A. Henriksen, H. Saio, T. R. White, L. Buchhave

Link: https://arxiv.org/abs/2502.11879v1

Date: 2025-02-17

Summary:

More than 200 A- and F-type stars observed with Kepler exhibit a distinctive 'hump & spike' feature in their Fourier spectra. The hump is commonly interpreted as unresolved Rossby modes, while the spike has been linked to rotational modulation. Two competing interpretations exist for the spike: magnetic phenomena, such as stellar spots, or Overstable Convective (OsC) modes resonantly exciting low-frequency g modes within the stellar envelope. We analysed photometric data from Kepler and TESS for HR 7495, the brightest 'hump & spike' star (V=5.06), covering 4.5 years and four seasons, respectively. Additionally, radial velocity measurements and spectropolarimetric data were used to investigate magnetic fields and surface features. Furthermore, we analysed model-based artificial light and radial velocity curves to examine the influence of OsC modes on the phase-folded light curves. The phase-folded light curves show that the spike characteristics of HR 7495 align more closely with rotational modulation by stellar spots than with OsC modes. No significant magnetic fields were detected, limiting the field's possible amplitude and geometry. This supports the hypothesis of a subsurface convective layer operating a dynamo, producing low-amplitude, complex magnetic fields. The variability patterns suggest multiple evolving spots. A comparison of contemporaneously observed light and RV data with modelled OsC modes reveals a 0.5 phase offset, strongly disfavouring pulsations as the cause of the spike. While the evolutionary stage of HR 7495 does not entirely preclude the possibility of OsC modes, the observational data overwhelmingly support the stellar spots hypothesis. Our analysis, combined with previous literature, suggests that if not all A- and F-type, at least the 'hump & spike' stars, harbour an undetected weak magnetic field, likely driven by a dynamo mechanism.

--------------------------------------------------------------------------------------------------------

Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

As large language models increasingly tackle long-context tasks like document analysis, the challenge of creating suitable instruction tuning data becomes significant. This research investigates a fundamental question: how much and what type of context is needed for effective long-context instruction tuning? The surprising finding that models trained on short contexts can generalize to longer ones led the researchers to develop "context synthesis"—a framework that leverages existing LLMs to extend high-quality instruction-answer pairs with background contexts. This approach nearly matches the performance of human-annotated long-context data while being more efficient to produce, potentially accelerating the development of document-processing AI systems for legal, medical, and business applications.

Authors: Wenhao Zhu, Pinzhen Chen, Hanxu Hu, Shujian Huang, Fei Yuan, Jiajun Chen, Alexandra Birch

Link: https://arxiv.org/abs/2502.15592v1

Date: 2025-02-21

Summary:

Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: https://github.com/NJUNLP/context-synthesis.

--------------------------------------------------------------------------------------------------------

Enhancing Vehicle Make and Model Recognition with 3D Attention Modules

Vehicle make and model recognition (VMMR) plays a crucial role in intelligent transportation systems, supporting applications from suspicious vehicle detection to autonomous driving. This paper addresses the fundamental challenges of VMMR—subtle visual distinctions between models and the vast variety of vehicle classes—by implementing a novel three-dimensional attention module. Without increasing the original model's parameters, this module generates 3D attention weights that refine feature maps, helping the network focus on critical distinguishing characteristics. Integrated into the middle section of a convolutional model, the approach achieves 90.69% accuracy on the Stanford Cars dataset, outperforming state-of-the-art convolutional and transformer-based models in this fine-grained classification task.

Authors: Narges Semiromizadeh, Omid Nejati Manzari, Shahriar B. Shokouhi, Sattar Mirzakuchaki

Link: https://arxiv.org/abs/2502.15398v1

Date: 2025-02-21

Summary:

Vehicle make and model recognition (VMMR) is a crucial component of the Intelligent Transport System, garnering significant attention in recent years. VMMR has been widely utilized for detecting suspicious vehicles, monitoring urban traffic, and autonomous driving systems. The complexity of VMMR arises from the subtle visual distinctions among vehicle models and the wide variety of classes produced by manufacturers. Convolutional Neural Networks (CNNs), a prominent type of deep learning model, have been extensively employed in various computer vision tasks, including VMMR, yielding remarkable results. As VMMR is a fine-grained classification problem, it primarily faces inter-class similarity and intra-class variation challenges. In this study, we implement an attention module to address these challenges and enhance the model's focus on critical areas containing distinguishing features. This module, which does not increase the parameters of the original model, generates three-dimensional (3-D) attention weights to refine the feature map. Our proposed model integrates the attention module into two different locations within the middle section of a convolutional model, where the feature maps from these sections offer sufficient information about the input frames without being overly detailed or overly coarse. The performance of our proposed model, along with state-of-the-art (SOTA) convolutional and transformer-based models, was evaluated using the Stanford Cars dataset. Our proposed model achieved the highest accuracy, 90.69\%, among the compared models.

--------------------------------------------------------------------------------------------------------

CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models

This groundbreaking research addresses a critical gap in autonomous driving safety: effectively incorporating rare but potentially catastrophic scenarios into policy learning. CurricuVLM uniquely leverages Vision-Language Models to analyze driving agent behavior, identify performance weaknesses, and dynamically generate tailored training scenarios. By performing in-depth reasoning on unsafe driving situations with narrative descriptions, the framework creates personalized curriculum learning that targets specific limitations. Experiments on the Waymo Open Motion Dataset show superior performance across navigation success, driving efficiency, and safety metrics. This approach could significantly improve autonomous vehicle safety by ensuring systems are properly trained on their specific performance bottlenecks rather than generic scenarios.

Authors: Zihao Sheng, Zilin Huang, Yansong Qu, Yue Leng, Sruthi Bhavanam, Sikai Chen

Link: https://arxiv.org/abs/2502.15119v1

Date: 2025-02-21

Summary:

Ensuring safety in autonomous driving systems remains a critical challenge, particularly in handling rare but potentially catastrophic safety-critical scenarios. While existing research has explored generating safety-critical scenarios for autonomous vehicle (AV) testing, there is limited work on effectively incorporating these scenarios into policy learning to enhance safety. Furthermore, developing training curricula that adapt to an AV's evolving behavioral patterns and performance bottlenecks remains largely unexplored. To address these challenges, we propose CurricuVLM, a novel framework that leverages Vision-Language Models (VLMs) to enable personalized curriculum learning for autonomous driving agents. Our approach uniquely exploits VLMs' multimodal understanding capabilities to analyze agent behavior, identify performance weaknesses, and dynamically generate tailored training scenarios for curriculum adaptation. Through comprehensive analysis of unsafe driving situations with narrative descriptions, CurricuVLM performs in-depth reasoning to evaluate the AV's capabilities and identify critical behavioral patterns. The framework then synthesizes customized training scenarios targeting these identified limitations, enabling effective and personalized curriculum learning. Extensive experiments on the Waymo Open Motion Dataset show that CurricuVLM outperforms state-of-the-art baselines across both regular and safety-critical scenarios, achieving superior performance in terms of navigation success, driving efficiency, and safety metrics. Further analysis reveals that CurricuVLM serves as a general approach that can be integrated with various RL algorithms to enhance autonomous driving systems. The code and demo video are available at: https://zihaosheng.github.io/CurricuVLM/.

--------------------------------------------------------------------------------------------------------

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

While large vision-language models can process inputs with up to 128k tokens, they struggle to generate coherent outputs beyond 1,000 words. This research identifies the primary limitation—lack of long output examples during supervised fine-tuning—and addresses it with LongWriter-V-22k, a dataset of 22,158 examples with outputs up to 10,000 words. To maintain high-fidelity to input images in long outputs, the researchers propose IterDPO, which iteratively breaks lengthy outputs into manageable segments for preference optimization. Their 7B parameter model outperforms larger proprietary models like GPT-4o on the newly developed MMLongBench-Write benchmark, enabling applications from detailed document summarization to in-depth image analysis and comprehensive report generation.

Authors: Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li

Link: https://arxiv.org/abs/2502.14834v1

Date: 2025-02-20

Summary:

Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V

--------------------------------------------------------------------------------------------------------

Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks

This position paper argues that despite graph learning's promise in domains like drug design and molecular property prediction, fundamental benchmarking problems threaten its continued relevance. Current benchmarks focus too narrowly on specific domains like 2D molecular graphs while neglecting important applications in combinatorial optimization, relational databases, and chip design. Many datasets poorly represent underlying data, leading to misaligned use cases. Fragmented evaluations and excessive focus on accuracy metrics encourage overfitting rather than generalizable insights. The authors call for a paradigm shift toward meaningful benchmarks, rigorous evaluation protocols, and stronger collaboration with domain experts to unlock graph learning's potential in solving real-world problems.

Authors: Maya Bechler-Speicher, Ben Finkelshtein, Fabrizio Frasca, Luis Müller, Jan Tönshoff, Antoine Siraudin, Viktor Zaverkin, Michael M. Bronstein, Mathias Niepert, Bryan Perozzi, Mikhail Galkin, Christopher Morris

Link: https://arxiv.org/abs/2502.14546v1

Date: 2025-02-20

Summary:

While machine learning on graphs has demonstrated promise in drug design and molecular property prediction, significant benchmarking challenges hinder its further progress and relevance. Current benchmarking practices often lack focus on transformative, real-world applications, favoring narrow domains like two-dimensional molecular graphs over broader, impactful areas such as combinatorial optimization, relational databases, or chip design. Additionally, many benchmark datasets poorly represent the underlying data, leading to inadequate abstractions and misaligned use cases. Fragmented evaluations and an excessive focus on accuracy further exacerbate these issues, incentivizing overfitting rather than fostering generalizable insights. These limitations have prevented the development of truly useful graph foundation models. This position paper calls for a paradigm shift toward more meaningful benchmarks, rigorous evaluation protocols, and stronger collaboration with domain experts to drive impactful and reliable advances in graph learning research, unlocking the potential of graph learning.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithFebruary 25, 2025Comment