Week Ending 3.16.2025

RESEARCH WATCH: 3.16.2025

AIstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation

AIstorian addresses the unique challenges of AI-generated historical biographies, which require adherence to historical writing conventions, factual accuracy, and coherent integration of fragmented information. Developed by Huawei, this system combines knowledge graph-powered retrieval with multi-agent hallucination detection and error correction. AIstorian's two-step training approach includes data augmentation and stylistic preference optimization. When tested on historical Jinshi data, it achieved 3.8x better factual accuracy and 47.6% reduction in hallucinations compared to baselines. This technology could revolutionize historical research by automating accurate biography generation while maintaining period-appropriate writing styles.

Authors: Fengyu Li, Yilin Li, Junhao Zhu, Lu Chen, Yanfei Zhang, Jia Zhou, Hui Zu, Jingwen Zhao, Yunjun Gao

Link: https://arxiv.org/abs/2503.11346v1

Date: 2025-03-14

Summary:

Huawei has always been committed to exploring the AI application in historical research. Biography generation, as a specialized form of abstractive summarization, plays a crucial role in historical research but faces unique challenges that existing large language models (LLMs) struggle to address. These challenges include maintaining stylistic adherence to historical writing conventions, ensuring factual fidelity, and handling fragmented information across multiple documents. We present AIstorian, a novel end-to-end agentic system featured with a knowledge graph (KG)-powered retrieval-augmented generation (RAG) and anti-hallucination multi-agents. Specifically, AIstorian introduces an in-context learning based chunking strategy and a KG-based index for accurate and efficient reference retrieval. Meanwhile, AIstorian orchestrates multi-agents to conduct on-the-fly hallucination detection and error-type-aware correction. Additionally, to teach LLMs a certain language style, we finetune LLMs based on a two-step training approach combining data augmentation-enhanced supervised fine-tuning with stylistic preference optimization. Extensive experiments on a real-life historical Jinshi dataset demonstrate that AIstorian achieves a 3.8x improvement in factual accuracy and a 47.6% reduction in hallucination rate compared to existing baselines. The data and code are available at: https://github.com/ZJU-DAILY/AIstorian.

--------------------------------------------------------------------------------------------------------

Fourier Neural Operator based surrogates for $CO_2$ storage in realistic geologies

This research develops accelerated simulation models for carbon capture and storage (CCS) decision-making processes. Traditional subsurface CO2 storage site selection requires computationally expensive simulations of CO2 flow. The proposed Fourier Neural Operator (FNO) model enables real-time, high-resolution simulation of CO2 plume migration, offering 100,000x computational acceleration with minimal accuracy loss. The researchers also explore super-resolution techniques to improve training efficiency and enhance prediction reliability. Built on NVIDIA's Modulus library, this framework will enable rapid screening of potential CCS sites and could extend to other energy applications like geothermal reservoir modeling and hydrogen storage, advancing sustainable energy solutions.

Authors: Anirban Chandra, Marius Koch, Suraj Pawar, Aniruddha Panda, Kamyar Azizzadenesheli, Jeroen Snippe, Faruk O. Alpak, Farah Hariri, Clement Etienam, Pandu Devarakota, Anima Anandkumar, Detlef Hohl

Link: https://arxiv.org/abs/2503.11031v1

Date: 2025-03-14

Summary:

This study aims to develop surrogate models for accelerating decision making processes associated with carbon capture and storage (CCS) technologies. Selection of sub-surface CO2 storage sites often necessitates expensive and involved simulations of CO2 flow fields. Here, we develop a Fourier Neural Operator (FNO) based model for real-time, high-resolution simulation of CO2 plume migration. The model is trained on a comprehensive dataset generated from realistic subsurface parameters and offers O(105) computational acceleration with minimal sacrifice in prediction accuracy. We also explore super-resolution experiments to improve the computational cost of training the FNO based models. Additionally, we present various strategies for improving the reliability of predictions from the model, which is crucial while assessing actual geological sites. This novel framework, based on NVIDIA's Modulus library, will allow rapid screening of sites for CCS. The discussed workflows and strategies can be applied to other energy solutions like geothermal reservoir modeling and hydrogen storage. Our work scales scientific machine learning models to realistic 3D systems that are more consistent with real-life subsurface aquifers/reservoirs, paving the way for next-generation digital twins for subsurface CCS applications.

--------------------------------------------------------------------------------------------------------

Observation-Graph Interaction and Key-Detail Guidance for Vision and Language Navigation

OIKG addresses limitations in Vision and Language Navigation (VLN), where agents navigate environments following natural language instructions. The framework introduces two key innovations: an observation-graph interaction module that decouples angular and visual information while enhancing navigation space representations, and a key-detail guidance module that dynamically extracts and utilizes location and object information from instructions. Through these techniques, OIKG achieves more precise cross-modal alignment and dynamic instruction interpretation. Experiments on R2R and RxR datasets demonstrate state-of-the-art performance across multiple metrics, showing significant improvements in navigation precision. This technology could enhance autonomous navigation systems in complex, instruction-guided environments.

Authors: Yifan Xie, Binkai Ou, Fei Ma, Yaohua Liu

Link: https://arxiv.org/abs/2503.11006v1

Date: 2025-03-14

Summary:

Vision and Language Navigation (VLN) requires an agent to navigate through environments following natural language instructions. However, existing methods often struggle with effectively integrating visual observations and instruction details during navigation, leading to suboptimal path planning and limited success rates. In this paper, we propose OIKG (Observation-graph Interaction and Key-detail Guidance), a novel framework that addresses these limitations through two key components: (1) an observation-graph interaction module that decouples angular and visual information while strengthening edge representations in the navigation space, and (2) a key-detail guidance module that dynamically extracts and utilizes fine-grained location and object information from instructions. By enabling more precise cross-modal alignment and dynamic instruction interpretation, our approach significantly improves the agent's ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR datasets demonstrate that OIKG achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of our method in enhancing navigation precision through better observation-instruction alignment.

--------------------------------------------------------------------------------------------------------

(ε, δ) Considered Harmful: Best Practices for Reporting Differential Privacy Guarantees

This paper addresses inconsistencies in how differential privacy (DP) guarantees are reported in machine learning. The authors advocate for Gaussian differential privacy (GDP) as the primary reporting method, with full privacy profiles as a secondary option. Unlike other approaches, GDP has a single parameter, making guarantees easier to compare across different settings. The researchers demonstrate GDP's accuracy by analyzing privacy profiles of state-of-the-art DP large-scale image classification and the U.S. Census TopDown algorithm. These findings could standardize privacy reporting in sensitive applications like healthcare data analysis and census operations, improving transparency and comparability of privacy guarantees.

Authors: Juan Felipe Gomez, Bogdan Kulynych, Georgios Kaissis, Jamie Hayes, Borja Balle, Antti Honkela

Link: https://arxiv.org/abs/2503.10945v1

Date: 2025-03-13

Summary:

Current practices for reporting the level of differential privacy (DP) guarantees for machine learning (ML) algorithms provide an incomplete and potentially misleading picture of the guarantees and make it difficult to compare privacy levels across different settings. We argue for using Gaussian differential privacy (GDP) as the primary means of communicating DP guarantees in ML, with the full privacy profile as a secondary option in case GDP is too inaccurate. Unlike other widely used alternatives, GDP has only one parameter, which ensures easy comparability of guarantees, and it can accurately capture the full privacy profile of many important ML applications. To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S. Decennial Census, observing that GDP fits the profiles remarkably well in all three cases. Although GDP is ideal for reporting the final guarantees, other formalisms (e.g., privacy loss random variables) are needed for accurate privacy accounting. We show that such intermediate representations can be efficiently converted to GDP with minimal loss in tightness.

--------------------------------------------------------------------------------------------------------

Evaluating a Novel Neuroevolution and Neural Architecture Search System

Neuvo NAS+ introduces an innovative approach to neural network optimization that customizes both network architecture and training parameters for specific tasks. While industry trends favor large transformer models, specialized binary classifiers remain crucial for applications requiring computational efficiency and low latency. The system selects network features and training hyperparameters tailored to each dataset, outperforming traditional machine learning approaches in binary classification tasks. Experiments reveal substantial diversity in evolved architectures across datasets, confirming the value of task-specific optimization. For real-world applications requiring efficient binary classification, Neuvo NAS+ offers accuracy comparable to complex models while using significantly fewer computational resources.

Authors: Benjamin David Winter, William John Teahan

Link: https://arxiv.org/abs/2503.10869v1

Date: 2025-03-13

Summary:

The choice of neural network features can have a large impact on both the accuracy and speed of the network. Despite the current industry shift towards large transformer models, specialized binary classifiers remain critical for numerous practical applications where computational efficiency and low latency are essential. Neural network features tend to be developed homogeneously, resulting in slower or less accurate networks when testing against multiple datasets. In this paper, we show the effectiveness of Neuvo NAS+ a novel Python implementation of an extended Neural Architecture Search (NAS+) which allows the user to optimise the training parameters of a network as well as the network's architecture. We provide an in-depth analysis of the importance of catering a network's architecture to each dataset. We also describe the design of the Neuvo NAS+ system that selects network features on a task-specific basis including network training hyper-parameters such as the number of epochs and batch size. Results show that the Neuvo NAS+ task-specific approach significantly outperforms several machine learning approaches such as Naive Bayes, C4.5, Support Vector Machine and a standard Artificial Neural Network for solving a range of binary classification problems in terms of accuracy. Our experiments demonstrate substantial diversity in evolved network architectures across different datasets, confirming the value of task-specific optimization. Additionally, Neuvo NAS+ outperforms other evolutionary algorithm optimisers in terms of both accuracy and computational efficiency, showing that properly optimized binary classifiers can match or exceed the performance of more complex models while requiring significantly fewer computational resources.

--------------------------------------------------------------------------------------------------------

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Block Diffusion introduces a novel class of language models that bridges the gap between discrete denoising diffusion and autoregressive models. While diffusion models offer benefits like parallelized generation and controllability, they traditionally lag in likelihood modeling and are limited to fixed-length generation. This hybrid approach overcomes these limitations by supporting flexible-length generation and improving inference efficiency through KV caching and parallel token sampling. The researchers provide an efficient training algorithm, gradient variance estimators, and data-driven noise schedules. Setting new state-of-the-art performance among diffusion language models, Block Diffusion enables arbitrary-length sequence generation, potentially revolutionizing applications requiring flexible text generation.

Authors: Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov

Link: https://arxiv.org/abs/2503.09573v1

Date: 2025-03-12

Summary:

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms/

--------------------------------------------------------------------------------------------------------

RetSTA: An LLM-Based Approach for Standardizing Clinical Fundus Image Reports

RetSTA addresses the challenge of standardizing clinical fundus diagnostic reports, which currently lack unified formats, terminology, and style. This inconsistency hinders large language models from effectively processing medical data. The researchers created a bilingual standard terminology of fundus clinical terms and developed two models: RetSTA-7B-Zero and RetSTA-7B. The zero-shot model demonstrates strong standardization capabilities but with limited disease coverage. The enhanced RetSTA-7B incorporates standardized data from its predecessor with English translations, achieving report-level standardization across diverse clinical scenarios. This technology could significantly improve healthcare quality by standardizing ophthalmology reporting and facilitating more effective medical data integration.

Authors: Jiushen Cai, Weihang Zhang, Hanruo Liu, Ningli Wang, Huiqi Li

Link: https://arxiv.org/abs/2503.09358v1

Date: 2025-03-12

Summary:

Standardization of clinical reports is crucial for improving the quality of healthcare and facilitating data integration. The lack of unified standards, including format, terminology, and style, is a great challenge in clinical fundus diagnostic reports, which increases the difficulty for large language models (LLMs) to understand the data. To address this, we construct a bilingual standard terminology, containing fundus clinical terms and commonly used descriptions in clinical diagnosis. Then, we establish two models, RetSTA-7B-Zero and RetSTA-7B. RetSTA-7B-Zero, fine-tuned on an augmented dataset simulating clinical scenarios, demonstrates powerful standardization behaviors. However, it encounters a challenge of limitation to cover a wider range of diseases. To further enhance standardization performance, we build RetSTA-7B, which integrates a substantial amount of standardized data generated by RetSTA-7B-Zero along with corresponding English data, covering diverse complex clinical scenarios and achieving report-level standardization for the first time. Experimental results demonstrate that RetSTA-7B outperforms other compared LLMs in bilingual standardization task, which validates its superior performance and generalizability. The checkpoints are available at https://github.com/AB-Story/RetSTA-7B.

--------------------------------------------------------------------------------------------------------

HeGMN: Heterogeneous Graph Matching Network for Learning Graph Similarity

HeGMN addresses the limitations of existing graph similarity learning methods when applied to heterogeneous graphs. The framework introduces a two-tier matching mechanism: a heterogeneous graph isomorphism network that perceives different semantic relationships during aggregation, and graph-level and node-level matching modules that employ type-aligned matching principles. The researchers also propose a heterogeneous graph resampling method to construct graph pairs with corresponding edit distances, filling a gap in available datasets. Experiments demonstrate HeGMN's superior performance in graph similarity prediction across all tested datasets. This technology could enhance applications in drug discovery, social network analysis, and computer vision where graph matching is essential.

Authors: Shilong Sang, Ke-Jia Chen, Zheng Liu

Link: https://arxiv.org/abs/2503.08739v1

Date: 2025-03-11

Summary:

Graph similarity learning (GSL), also referred to as graph matching in many scenarios, is a fundamental problem in computer vision, pattern recognition, and graph learning. However, previous GSL methods assume that graphs are homogeneous and struggle to maintain their performance on heterogeneous graphs. To address this problem, this paper proposes a Heterogeneous Graph Matching Network (HeGMN), which is an end-to-end graph similarity learning framework composed of a two-tier matching mechanism. Firstly, a heterogeneous graph isomorphism network is proposed as the encoder, which reinvents graph isomorphism network for heterogeneous graphs by perceiving different semantic relationships during aggregation. Secondly, a graph-level and node-level matching modules are designed, both employing type-aligned matching principles. The former conducts graph-level matching by node type alignment, and the latter computes the interactions between the cross-graph nodes with the same type thus reducing noise interference and computational overhead. Finally, the graph-level and node-level matching features are combined and fed into fully connected layers for predicting graph similarity scores. In experiments, we propose a heterogeneous graph resampling method to construct heterogeneous graph pairs and define the corresponding heterogeneous graph edit distance, filling the gap in missing datasets. Extensive experiments demonstrate that HeGMN consistently achieves advanced performance on graph similarity prediction across all datasets.

--------------------------------------------------------------------------------------------------------

RePO: ReLU-based Preference Optimization

RePO introduces a streamlined approach to aligning large language models (LLMs) with human preferences, addressing computational and stability challenges of existing methods. While Direct Preference Optimization (DPO) established an offline paradigm with a single hyperparameter β, subsequent methods like SimPO reintroduced complexity through dual parameters. RePO eliminates β through gradient analysis and adopts a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically characterized as SimPO's limiting case, RePO forms a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms alternatives across multiple base models while requiring only one hyperparameter to tune.

Authors: Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang

Link: https://arxiv.org/abs/2503.07426v1

Date: 2025-03-10

Summary:

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter β, subsequent methods like SimPO reintroduce complexity through dual parameters(β, γ). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates β via two advances: (1) retaining SimPO's reference-free margins but removing β through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case (β→∞), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

--------------------------------------------------------------------------------------------------------

Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model

RxnIM addresses a critical gap in chemical research: the lack of machine-readable data from published chemical reactions. Despite AI's potential in advancing organic chemistry, its effectiveness depends on quality data that remains largely inaccessible in structured form. RxnIM is the first multimodal large language model specifically designed to parse chemical reaction images into machine-readable formats, extracting chemical components and interpreting textual descriptions of reaction conditions. With an average F1 score of 88% on benchmarks, outperforming previous methods by 5%, this technology represents a significant step toward automating the construction of comprehensive chemical reaction databases from literature images, potentially accelerating AI-driven chemical research and discovery.

Authors: Yufan Chen, Ching Ting Leung, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao

Link: https://arxiv.org/abs/2503.08156v1

Date: 2025-03-11

Summary:

Artificial intelligence (AI) has demonstrated significant promise in advancing organic chemistry research; however, its effectiveness depends on the availability of high-quality chemical reaction data. Currently, most published chemical reactions are not available in machine-readable form, limiting the broader application of AI in this field. The extraction of published chemical reactions into structured databases still relies heavily on manual curation, and robust automatic parsing of chemical reaction images into machine-readable data remains a significant challenge. To address this, we introduce the Reaction Image Multimodal large language model (RxnIM), the first multimodal large language model specifically designed to parse chemical reaction images into machine-readable reaction data. RxnIM not only extracts key chemical components from reaction images but also interprets the textual content that describes reaction conditions. Together with specially designed large-scale dataset generation method to support model training, our approach achieves excellent performance, with an average F1 score of 88% on various benchmarks, surpassing literature methods by 5%. This represents a crucial step toward the automatic construction of large databases of machine-readable reaction data parsed from images in the chemistry literature, providing essential data resources for AI research in chemistry. The source code, model checkpoints, and datasets developed in this work are released under permissive licenses. An instance of the RxnIM web application can be accessed at https://huggingface.co/spaces/CYF200127/RxnIM.

--------------------------------------------------------------------------------------------------------

A Theoretical Framework for Preventing Class Collapse in Supervised Contrastive Learning

This paper addresses the challenge of class collapse in supervised contrastive learning (SupCL), where discrimination among individual embeddings within the same class is reduced. The researchers introduce the Simplex-to-Simplex Embedding Model (SSEM), a theoretical framework that models various embedding structures, including those that minimize supervised contrastive loss. Through SSEM, they analyze how hyperparameters affect learned representations and provide practical guidelines for hyperparameter selection to mitigate class collapse risk. These findings, supported by empirical results across synthetic and real-world datasets, could significantly improve representation learning in applications ranging from computer vision to natural language processing.

Authors: Chungpa Lee, Jeongheon Oh, Kibok Lee, Jy-yong Sohn

Link: https://arxiv.org/abs/2503.08203v1

Date: 2025-03-11

Summary:

Supervised contrastive learning (SupCL) has emerged as a prominent approach in representation learning, leveraging both supervised and self-supervised losses. However, achieving an optimal balance between these losses is challenging; failing to do so can lead to class collapse, reducing discrimination among individual embeddings in the same class. In this paper, we present theoretically grounded guidelines for SupCL to prevent class collapse in learned representations. Specifically, we introduce the Simplex-to-Simplex Embedding Model (SSEM), a theoretical framework that models various embedding structures, including all embeddings that minimize the supervised contrastive loss. Through SSEM, we analyze how hyperparameters affect learned representations, offering practical guidelines for hyperparameter selection to mitigate the risk of class collapse. Our theoretical findings are supported by empirical results across synthetic and real-world datasets.

--------------------------------------------------------------------------------------------------------

A primer on optimal transport for causal inference with observational data

This primer explores the profound yet underrecognized connections between optimal transport theory and causal inference with observational data. Optimal transport, which analyzes probabilities by comparing underlying state spaces, naturally aligns with causal inference's focus on understanding counterfactual states. The author reveals that many foundational causal inference models have implicitly relied on optimal transport principles without acknowledging the connection. By unifying language and notation across statistics, mathematics, and econometrics, this review illuminates existing relationships and identifies novel research directions. The insights could transform approaches to causal effect identification in fields ranging from economics to healthcare, where understanding causal relationships from observational data is crucial.

Authors: Florian F Gunsilius

Link: https://arxiv.org/abs/2503.07811v2

Date: 2025-03-12

Summary:

The theory of optimal transportation has developed into a powerful and elegant framework for comparing probability distributions, with wide-ranging applications in all areas of science. The fundamental idea of analyzing probabilities by comparing their underlying state space naturally aligns with the core idea of causal inference, where understanding and quantifying counterfactual states is paramount. Despite this intuitive connection, explicit research at the intersection of optimal transport and causal inference is only beginning to develop. Yet, many foundational models in causal inference have implicitly relied on optimal transport principles for decades, without recognizing the underlying connection. Therefore, the goal of this review is to offer an introduction to the surprisingly deep existing connections between optimal transport and the identification of causal effects with observational data -- where optimal transport is not just a set of potential tools, but actually builds the foundation of model assumptions. As a result, this review is intended to unify the language and notation between different areas of statistics, mathematics, and econometrics, by pointing out these existing connections, and to explore novel problems and directions for future work in both areas derived from this realization.

--------------------------------------------------------------------------------------------------------

Who Are You Behind the Screen? Implicit MBTI and Gender Detection Using Artificial Intelligence

This research investigates the detection of personality traits and gender directly from linguistic patterns in digital conversations. Unlike conventional methods that rely on self-reported data, the researchers refined a RoBERTa language model to capture complex linguistic cues from Telegram messages. Using confidence levels, they achieved 86.16% accuracy in identifying MBTI personality types and 74.4% accuracy in gender classification. The study reveals that individuals with introverted and intuitive preferences are particularly active in text-based interactions. These findings have implications for personalized technology and psychological research, potentially enhancing personalized recommendation systems, mental health applications, and targeted marketing, while raising important considerations about privacy and ethical use.

Authors: Kourosh Shahnazari, Seyed Moein Ayyoubzadeh

Link: https://arxiv.org/abs/2503.09853v1

Date: 2025-03-12

Summary:

In personalized technology and psychological research, precisely detecting demographic features and personality traits from digital interactions becomes ever more important. This work investigates implicit categorization, inferring personality and gender variables directly from linguistic patterns in Telegram conversation data, while conventional personality prediction techniques mostly depend on explicitly self-reported labels. We refine a Transformer-based language model (RoBERTa) to capture complex linguistic cues indicative of personality traits and gender differences using a dataset comprising 138,866 messages from 1,602 users annotated with MBTI types and 195,016 messages from 2,598 users annotated with gender. Confidence levels help to greatly raise model accuracy to 86.16\%, hence proving RoBERTa's capacity to consistently identify implicit personality types from conversational text data. Our results highlight the usefulness of Transformer topologies for implicit personality and gender classification, hence stressing their efficiency and stressing important trade-offs between accuracy and coverage in realistic conversational environments. With regard to gender classification, the model obtained an accuracy of 74.4\%, therefore capturing gender-specific language patterns. Personality dimension analysis showed that people with introverted and intuitive preferences are especially more active in text-based interactions. This study emphasizes practical issues in balancing accuracy and data coverage as Transformer-based models show their efficiency in implicit personality and gender prediction tasks from conversational texts.

--------------------------------------------------------------------------------------------------------

Junior Software Developers' Perspectives on Adopting LLMs for Software Engineering: a Systematic Literature Review

This systematic literature review examines junior software developers' perspectives on adopting Large Language Model-based tools for software engineering. Analyzing 56 primary studies, the researchers found that only 8.9% provide clear definitions of junior developers, highlighting a lack of uniformity. Information searching emerged as the most common task using LLM tools, with ChatGPT being the most frequently studied. The majority of studies (83.9%) report both positive and negative perceptions about LLM adoption. Interestingly, developers are using LLMs not just for code generation but also to improve their development skills, while remaining aware of limitations like incorrect suggestions and AI hallucinations. These findings offer valuable insights for researchers, educators, and the software industry.

Authors: Samuel Ferino, Rashina Hoda, John Grundy, Christoph Treude

Link: https://arxiv.org/abs/2503.07556v1

Date: 2025-03-10

Summary:

Many studies exploring the adoption of Large Language Model-based tools for software development by junior developers have emerged in recent years. These studies have sought to understand developers' perspectives about using those tools, a fundamental pillar for successfully adopting LLM-based tools in Software Engineering. The aim of this paper is to provide an overview of junior software developers' perspectives and use of LLM-based tools for software engineering (LLM4SE). We conducted a systematic literature review (SLR) following guidelines by Kitchenham et al. on 56 primary studies, applying the definition for junior software developers as software developers with equal or less than five years of experience, including Computer Science/Software Engineering students. We found that the majority of the studies focused on comprehending the different aspects of integrating AI tools in SE. Only 8.9\% of the studies provide a clear definition for junior software developers, and there is no uniformity. Searching for relevant information is the most common task using LLM tools. ChatGPT was the most common LLM tool present in the studies (and experiments). A majority of the studies (83.9\%) report both positive and negative perceptions about the impact of adopting LLM tools. We also found and categorised advantages, challenges, and recommendations regarding LLM adoption. Our results indicate that developers are using LLMs not just for code generation, but also to improve their development skills. Critically, they are not just experiencing the benefits of adopting LLM tools, but they are also aware of at least a few LLM limitations, such as the generation of wrong suggestions, potential data leaking, and AI hallucination. Our findings offer implications for software engineering researchers, educators, and developers.

--------------------------------------------------------------------------------------------------------

A Multimodal Fusion Model Leveraging MLP Mixer and Handcrafted Features-based Deep Learning Networks for Facial Palsy Detection

This research presents an innovative multimodal approach to facial palsy detection, combining an MLP mixer-based model for unstructured data (images or facial line segments) with a feed-forward neural network for structured data (facial landmarks, expression features, or handcrafted features). Traditional assessment methods rely on subjective clinical evaluations, but this algorithmic approach offers potential improvements in objectivity and efficiency. Using videos from 20 facial palsy patients and 20 healthy subjects, the multimodal fusion model achieved an impressive 96.00 F1 score, significantly outperforming single-modality approaches. This technology could enhance diagnostic accuracy and monitoring of facial palsy, potentially improving patient outcomes through earlier detection and more precise treatment planning.

Authors: Heng Yim Nicole Oo, Min Hun Lee, Jeong Hoon Lim

Link: https://arxiv.org/abs/2503.10371v1

Date: 2025-03-13

Summary:

Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessments by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes an MLP mixer-based model to process unstructured data (i.e. RGB images or images with facial line segments) and a feed-forward neural network to process structured data (i.e. facial landmark coordinates, features of facial expressions, or handcrafted features) for detecting facial palsy. We then contribute to a study to analyze the effect of different data modalities and the benefits of a multimodal fusion-based approach using videos of 20 facial palsy patients and 20 healthy subjects. Our multimodal fusion model achieved 96.00 F1, which is significantly higher than the feed-forward neural network trained on handcrafted features alone (82.80 F1) and an MLP mixer-based model trained on raw RGB images (89.00 F1).

--------------------------------------------------------------------------------------------------------

AI-Driven Decision Support in Oncology: Evaluating Data Readiness for Skin Cancer Treatment

This research evaluates data readiness for developing AI-based Clinical Decision Support Systems (CDSS) in skin cancer treatment. Conducted at the University Hospital Münster's Skin Tumor Center, the study examines data quality, availability, and extractability—crucial factors for effective AI implementation in oncology. Through literature review, data readiness assessment, and expert workshops, the researchers identified essential data points for skin cancer treatment decisions and evaluated their presence across information systems. The findings highlight challenges in extracting information from unstructured data and emphasize the importance of high-quality, accessible data for successful AI-driven decision support in complex medical settings. This work could improve treatment outcomes through more informed clinical decision-making.

Authors: Joscha Grüger, Tobias Geyer, Tobias Brix, Michael Storck, Sonja Leson, Laura Bley, Carsten Weishaupt, Ralph Bergmann, Stephan A. Braun

Link: https://arxiv.org/abs/2503.09164v1

Date: 2025-03-12

Summary:

This research focuses on evaluating and enhancing data readiness for the development of an Artificial Intelligence (AI)-based Clinical Decision Support System (CDSS) in the context of skin cancer treatment. The study, conducted at the Skin Tumor Center of the University Hospital M\"unster, delves into the essential role of data quality, availability, and extractability in implementing effective AI applications in oncology. By employing a multifaceted methodology, including literature review, data readiness assessment, and expert workshops, the study addresses the challenges of integrating AI into clinical decision-making. The research identifies crucial data points for skin cancer treatment decisions, evaluates their presence and quality in various information systems, and highlights the difficulties in extracting information from unstructured data. The findings underline the significance of high-quality, accessible data for the success of AI-driven CDSS in medical settings, particularly in the complex field of oncology.

--------------------------------------------------------------------------------------------------------

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

This study reveals that Chain-of-Thought (CoT) reasoning in frontier AI models—including Sonnet 3.7, DeepSeek R1, and ChatGPT-4o—can be unfaithful even in realistic contexts without artificial bias. The researchers found non-negligible rates of unfaithful reasoning: 16.3% for Sonnet 3.7, 5.3% for DeepSeek R1, and 7.0% for ChatGPT-4o. Models sometimes rationalize implicit biases in binary questions, producing superficially coherent but logically contradictory arguments. They also identified restoration errors, where models silently correct reasoning mistakes, and unfaithful shortcuts in solving complex problems. These findings pose significant challenges for AI safety work that relies on monitoring CoT reasoning to detect undesired behavior, highlighting the need for more robust evaluation methods.

Authors: Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

Link: https://arxiv.org/abs/2503.08679v2

Date: 2025-03-13

Summary:

Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. In contrast, we show that unfaithful CoT can occur on realistic prompts with no artificial bias. Our results reveal non-negligible rates of several forms of unfaithful reasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) and ChatGPT-4o (7.0%) all answer a notable proportion of question pairs unfaithfully. Specifically, we find that models rationalize their implicit biases in answers to binary questions ("implicit post-hoc rationalization"). For example, when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We also investigate restoration errors (Dziri et al., 2023), where models make and then silently correct errors in their reasoning, and unfaithful shortcuts, where models use clearly illogical reasoning to simplify solving problems in Putnam questions (a hard benchmark). Our findings raise challenges for AI safety work that relies on monitoring CoT to detect undesired behavior.

--------------------------------------------------------------------------------------------------------

Learning to Inference Adaptively for Multimodal Large Language Models

AdaLLaVA addresses the computational challenges of deploying Multimodal Large Language Models (MLLMs) in resource-constrained environments. While MLLMs demonstrate impressive reasoning capabilities, their substantial computational requirements limit practical applications. Unlike previous efficiency solutions, AdaLLaVA dynamically reconfigures MLLM operations during inference based on input data and latency budgets, adapting to changing resource availability. Extensive experiments across question-answering, reasoning, and hallucination benchmarks demonstrate AdaLLaVA's effectiveness in meeting input latency budgets while achieving various accuracy-latency tradeoffs. The framework adapts to both input latency and content requirements, integrates with token selection for enhanced efficiency, and generalizes across different MLLMs, potentially enabling broader deployment of these powerful models in real-world applications.

Authors: Zhuoyan Xu, Khoi Duc Nguyen, Preeti Mukherjee, Saurabh Bagchi, Somali Chaterji, Yingyu Liang, Yin Li

Link: https://arxiv.org/abs/2503.10905v1

Date: 2025-03-13

Summary:

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent efforts on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs.Our project webpage with code release is at https://zhuoyan-xu.github.io/ada-llava/.

--------------------------------------------------------------------------------------------------------

Advancing Education through Tutoring Systems: A Systematic Literature Review

This systematic review examines how Tutoring Systems—including Intelligent Tutoring Systems (ITS) and Robot Tutoring Systems (RTS)—address global educational challenges through personalized instruction. As students worldwide struggle with core academic proficiency, these systems offer promising solutions. ITS leverages AI models like Bayesian Knowledge Tracing and Large Language Models for cognitive support, while RTS enhances emotional engagement through human-like interactions. Analyzing 86 studies, the researchers identified three distinct categories: computer-based ITS, robot-based RTS, and multimodal integrated systems. While significant advancements enhance adaptability and learning outcomes, challenges persist in ethics, scalability, and cognitive adaptability. The findings suggest that integrated hybrid solutions could maximize educational benefits, potentially transforming educational practices worldwide.

Authors: Vincent Liu, Ehsan Latif, Xiaoming Zhai

Link: https://arxiv.org/abs/2503.09748v1

Date: 2025-03-12

Summary:

This study systematically reviews the transformative role of Tutoring Systems, encompassing Intelligent Tutoring Systems (ITS) and Robot Tutoring Systems (RTS), in addressing global educational challenges through advanced technologies. As many students struggle with proficiency in core academic areas, Tutoring Systems emerge as promising solutions to bridge learning gaps by delivering personalized and adaptive instruction. ITS leverages artificial intelligence (AI) models, such as Bayesian Knowledge Tracing and Large Language Models, to provide precise cognitive support, while RTS enhances social and emotional engagement through human-like interactions. This systematic review, adhering to the PRISMA framework, analyzed 86 representative studies. We evaluated the pedagogical and technological advancements, engagement strategies, and ethical considerations surrounding these systems. Based on these parameters, Latent Class Analysis was conducted and identified three distinct categories: computer-based ITS, robot-based RTS, and multimodal systems integrating various interaction modes. The findings reveal significant advancements in AI techniques that enhance adaptability, engagement, and learning outcomes. However, challenges such as ethical concerns, scalability issues, and gaps in cognitive adaptability persist. The study highlights the complementary strengths of ITS and RTS, proposing integrated hybrid solutions to maximize educational benefits. Future research should focus on bridging gaps in scalability, addressing ethical considerations comprehensively, and advancing AI models to support diverse educational needs.

--------------------------------------------------------------------------------------------------------

Combining Local Symmetry Exploitation and Reinforcement Learning for Optimised Probabilistic Inference -- A Work In Progress

This research aims to optimize probabilistic inference in graphical models by finding efficient variable elimination orders. While optimal elimination order identification is a challenging combinatorial optimization problem, the researchers adapt a reinforcement learning approach originally developed for tensor networks. They incorporate structure exploitation into the optimization process, focusing on local symmetries within a model's factors. By introducing compact encodings of intermediate results based on these symmetries, the agent can explore more efficient contraction orders. This approach could significantly improve computational efficiency in probabilistic reasoning across various domains, including medical diagnosis, fault detection, and risk assessment systems that rely on graphical models.

Authors: Sagad Hamid, Tanya Braun

Link: https://arxiv.org/abs/2503.08786v1

Date: 2025-03-11

Summary:

Efficient probabilistic inference by variable elimination in graphical models requires an optimal elimination order. However, finding an optimal order is a challenging combinatorial optimisation problem for models with a large number of random variables. Most recently, a reinforcement learning approach has been proposed to find efficient contraction orders in tensor networks. Due to the duality between graphical models and tensor networks, we adapt this approach to probabilistic inference in graphical models. Furthermore, we incorporate structure exploitation into the process of finding an optimal order. Currently, the agent's cost function is formulated in terms of intermediate result sizes which are exponential in the number of indices (i.e., random variables). We show that leveraging specific structures during inference allows for introducing compact encodings of intermediate results which can be significantly smaller. By considering the compact encoding sizes for the cost function instead, we enable the agent to explore more efficient contraction orders. The structure we consider in this work is the presence of local symmetries (i.e., symmetries within a model's factors).

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithMarch 17, 2025Comment