Week Ending 2.16.2025

RESEARCH WATCH: 2.16.2025

STAR: Spectral Truncation and Rescale for Model Merging

Model merging has become a crucial technique for combining multiple AI models' capabilities without expensive retraining. However, performance typically degrades as more models are merged. STAR introduces a novel approach using spectral truncation and rescaling to reduce conflicts between merged models, maintaining better performance even when combining many models. This could be particularly valuable for organizations looking to create versatile AI systems that can handle multiple tasks efficiently without the computational cost of training new models from scratch.

Authors: Yu-Ang Lee, Ching-Yun Ko, Tejaswini Pedapati, I-Hsin Chung, Mi-Yen Yeh, Pin-Yu Chen

Link: https://arxiv.org/abs/2502.10339v1

Date: 2025-02-14

Summary:

Model merging is an efficient way of obtaining a multi-task model from several pretrained models without further fine-tuning, and it has gained attention in various domains, including natural language processing (NLP). Despite the efficiency, a key challenge in model merging is the seemingly inevitable decrease in task performance as the number of models increases. In this paper, we propose $\mathbf{S}$pectral $\mathbf{T}$runcation $\mathbf{A}$nd $\mathbf{R}$escale (STAR) that aims at mitigating ``merging conflicts'' by truncating small components in the respective spectral spaces, which is followed by an automatic parameter rescaling scheme to retain the nuclear norm of the original matrix. STAR requires no additional inference on original training data and is robust to hyperparamater choice. We demonstrate the effectiveness of STAR through extensive model merging cases on diverse NLP tasks. Specifically, STAR works robustly across varying model sizes, and can outperform baselines by 4.2$\%$ when merging 12 models on Flan-T5. Our code is publicly available at https://github.com/IBM/STAR.

--------------------------------------------------------------------------------------------------------

Technical Risks of (Lethal) Autonomous Weapons Systems

As autonomous weapons systems become increasingly sophisticated, understanding their technical limitations and risks becomes crucial for international security. This research examines the fundamental challenges of implementing reliable AI-based weapon systems, highlighting issues like unpredictable behavior and lack of transparency. The findings have important implications for military planning, international policy-making, and the development of safety protocols for autonomous systems deployment in critical operations.

Authors: Heramb Podar, Alycia Colijn

Link: https://arxiv.org/abs/2502.10174v1

Date: 2025-02-14

Summary:

The autonomy and adaptability of (Lethal) Autonomous Weapons Systems, (L)AWS in short, promise unprecedented operational capabilities, but they also introduce profound risks that challenge the principles of control, accountability, and stability in international security. This report outlines the key technological risks associated with (L)AWS deployment, emphasizing their unpredictability, lack of transparency, and operational unreliability, which can lead to severe unintended consequences. Key Takeaways: 1. Proposed advantages of (L)AWS can only be achieved through objectification and classification, but a range of systematic risks limit the reliability and predictability of classifying algorithms. 2. These systematic risks include the black-box nature of AI decision-making, susceptibility to reward hacking, goal misgeneralization and potential for emergent behaviors that escape human control. 3. (L)AWS could act in ways that are not just unexpected but also uncontrollable, undermining mission objectives and potentially escalating conflicts. 4. Even rigorously tested systems may behave unpredictably and harmfully in real-world conditions, jeopardizing both strategic stability and humanitarian principles.

--------------------------------------------------------------------------------------------------------

Automation Bias in the AI Act: On the Legal Implications of Attempting to De-Bias Human Oversight of AI

The European Union's AI Act addresses automation bias - the human tendency to over-rely on AI systems. This paper analyzes how the legislation approaches this psychological phenomenon and its practical implications for AI providers and users. The research is particularly relevant for organizations developing high-risk AI systems in Europe, as it highlights the challenges of implementing legal requirements around human oversight and automation bias awareness.

Authors: Johann Laux, Hannah Ruschemeier

Link: https://arxiv.org/abs/2502.10036v1

Date: 2025-02-14

Summary:

This paper examines the legal implications of the explicit mentioning of automation bias (AB) in the Artificial Intelligence Act (AIA). The AIA mandates human oversight for high-risk AI systems and requires providers to enable awareness of AB, i.e., the tendency to over-rely on AI outputs. The paper analyses how this extra-juridical concept is embedded in the AIA, the division of responsibility between AI providers and deployers, and the challenges of legally enforcing this novel awareness requirement. The analysis shows that the AIA's focus on providers does not adequately address design and context as causes of AB, and questions whether the AIA should directly regulate the risk of AB rather than just mandating awareness. As the AIA's approach requires a balance between legal mandates and behavioural science, the paper proposes that harmonised standards should reference the state of research on AB and human-AI interaction. Ultimately, further empirical research will be essential for effective safeguards.

--------------------------------------------------------------------------------------------------------

Dream to Drive: Model-Based Vehicle Control Using Analytic World Models

This research introduces a novel approach to training autonomous vehicle controllers using differentiable simulators. By training world models rather than just policies, the system offers improved performance for vehicle control and planning. The approach could significantly advance autonomous driving technology by enabling more efficient and reliable training of vehicle control systems, particularly in complex real-world scenarios.

Authors: Asen Nachkov, Danda Pani Paudel, Jan-Nico Zaech, Davide Scaramuzza, Luc Van Gool

Link: https://arxiv.org/abs/2502.10012v1

Date: 2025-02-14

Summary:

Differentiable simulators have recently shown great promise for training autonomous vehicle controllers. Being able to backpropagate through them, they can be placed into an end-to-end training loop where their known dynamics turn into useful priors for the policy to learn, removing the typical black box assumption of the environment. So far, these systems have only been used to train policies. However, this is not the end of the story in terms of what they can offer. Here, for the first time, we use them to train world models. Specifically, we present three new task setups that allow us to learn next state predictors, optimal planners, and optimal inverse states. Unlike analytic policy gradients (APG), which requires the gradient of the next simulator state with respect to the current actions, our proposed setups rely on the gradient of the next state with respect to the current state. We call this approach Analytic World Models (AWMs) and showcase its applications, including how to use it for planning in the Waymax simulator. Apart from pushing the limits of what is possible with such simulators, we offer an improved training recipe that increases performance on the large-scale Waymo Open Motion dataset by up to 12% compared to baselines at essentially no additional cost.

--------------------------------------------------------------------------------------------------------

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Addressing the challenge of protecting Large Language Models from multi-turn jailbreaks while maintaining usability, X-Boundary introduces a novel approach to establishing clear safety boundaries. The method successfully prevents harmful interactions while reducing false positives that limit legitimate use. This research could be particularly valuable for organizations deploying LLMs in public-facing applications where security and usability must be carefully balanced.

Authors: Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao

Link: https://arxiv.org/abs/2502.09990v1

Date: 2025-02-14

Summary:

Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.

--------------------------------------------------------------------------------------------------------

MDCrow: Automating Molecular Dynamics Workflows with Large Language Models

Molecular dynamics simulations are essential for understanding complex biological systems but are notoriously difficult to automate. MDCrow leverages large language models to automate these workflows, using expert-designed tools for file handling, simulation setup, and analysis. This innovation could significantly accelerate drug discovery, materials science research, and other fields requiring molecular dynamics simulations.

Authors: Quintina Campbell, Sam Cox, Jorge Medina, Brittany Watterson, Andrew D. White

Link: https://arxiv.org/abs/2502.09565v1

Date: 2025-02-13

Summary:

Molecular dynamics (MD) simulations are essential for understanding biomolecular systems but remain challenging to automate. Recent advances in large language models (LLM) have demonstrated success in automating complex scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an agentic LLM assistant capable of automating MD workflows. MDCrow uses chain-of-thought over 40 expert-designed tools for handling and processing files, setting up simulations, analyzing the simulation outputs, and retrieving relevant information from literature and databases. We assess MDCrow's performance across 25 tasks of varying required subtasks and difficulty, and we evaluate the agent's robustness to both difficulty and prompt style. \texttt{gpt-4o} is able to complete complex tasks with low variance, followed closely by \texttt{llama3-405b}, a compelling open-source model. While prompt style does not influence the best models' performance, it has significant effects on smaller models.

--------------------------------------------------------------------------------------------------------

Language Agents as Digital Representatives in Collective Decision-Making

This research explores using language models as proxies for human participants in collective decision-making processes. By training these models to represent individual preferences accurately, the system could enable more efficient large-scale consultation and decision-making processes. Applications could include public policy development, corporate governance, and community planning where gathering direct input from all stakeholders is impractical.

Authors: Daniel Jarrett, Miruna Pîslar, Michiel A. Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, Andrea Tacchetti

Link: https://arxiv.org/abs/2502.09369v1

Date: 2025-02-13

Summary:

Consider the process of collective decision-making, in which a group of individuals interactively select a preferred outcome from among a universe of alternatives. In this context, "representation" is the activity of making an individual's preferences present in the process via participation by a proxy agent -- i.e. their "representative". To this end, learned models of human behavior have the potential to fill this role, with practical implications for multi-agent scenario studies and mechanism design. In this work, we investigate the possibility of training \textit{language agents} to behave in the capacity of representatives of human agents, appropriately expressing the preferences of those individuals whom they stand for. First, we formalize the setting of \textit{collective decision-making} -- as the episodic process of interaction between a group of agents and a decision mechanism. On this basis, we then formalize the problem of \textit{digital representation} -- as the simulation of an agent's behavior to yield equivalent outcomes from the mechanism. Finally, we conduct an empirical case study in the setting of \textit{consensus-finding} among diverse humans, and demonstrate the feasibility of fine-tuning large language models to act as digital representatives.

--------------------------------------------------------------------------------------------------------

When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models

This study compares how humans and large language models process challenging "garden-path" sentences - those that initially mislead readers about their structure. The research provides insights into both human language processing and AI language understanding capabilities. These findings could help improve language model design and evaluation, particularly for applications requiring precise language understanding like translation or content generation.

Authors: Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant

Link: https://arxiv.org/abs/2502.09307v1

Date: 2025-02-13

Summary:

Modern Large Language Models (LLMs) have shown human-like abilities in many language tasks, sparking interest in comparing LLMs' and humans' language processing. In this paper, we conduct a detailed comparison of the two on a sentence comprehension task using garden-path constructions, which are notoriously challenging for humans. Based on psycholinguistic research, we formulate hypotheses on why garden-path sentences are hard, and test these hypotheses on human participants and a large suite of LLMs using comprehension questions. Our findings reveal that both LLMs and humans struggle with specific syntactic complexities, with some models showing high correlation with human comprehension. To complement our findings, we test LLM comprehension of garden-path constructions with paraphrasing and text-to-image generation tasks, and find that the results mirror the sentence comprehension question results, further validating our findings on LLM understanding of these constructions.

--------------------------------------------------------------------------------------------------------

A Novel Approach to for Multimodal Emotion Recognition : Multimodal semantic information fusion

This paper presents a new approach to multimodal emotion recognition, combining visual and linguistic information more effectively through contrastive learning and visual sequence compression. The system shows improved accuracy in recognizing emotions across different modalities. Potential applications include human-computer interaction, mental health monitoring, and customer service analysis where understanding emotional context is crucial.

Authors: Wei Dai, Dequan Zheng, Feng Yu, Yanrong Zhang, Yaohui Hou

Link: https://arxiv.org/abs/2502.08573v1

Date: 2025-02-12

Summary:

With the advancement of artificial intelligence and computer vision technologies, multimodal emotion recognition has become a prominent research topic. However, existing methods face challenges such as heterogeneous data fusion and the effective utilization of modality correlations. This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression. The proposed method enhances cross-modal feature fusion through contrastive learning and reduces redundancy in the visual modality by leveraging visual sequence compression. Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition, validating the effectiveness of multimodal feature fusion and the proposed approach.

--------------------------------------------------------------------------------------------------------

LLMs can implicitly learn from mistakes in-context

This research reveals that large language models can learn from incorrect answers in mathematical reasoning tasks without explicit explanations of the errors. This finding could revolutionize how we train and prompt language models, particularly in educational and problem-solving applications. The approach could lead to more efficient training methods and better performance in various reasoning tasks.

Authors: Lisa Alazraki, Maximilian Mozes, Jon Ander Campos, Yi Chern Tan, Marek Rei, Max Bartolo

Link: https://arxiv.org/abs/2502.08550v1

Date: 2025-02-12

Summary:

Learning from mistakes is a fundamental feature of human intelligence. Previous work has shown that Large Language Models (LLMs) can also learn from incorrect answers when provided with a comprehensive rationale detailing why an answer is wrong or how to correct it. In this work, we examine whether LLMs can learn from mistakes in mathematical reasoning tasks when these explanations are not provided. We investigate if LLMs are able to implicitly infer such rationales simply from observing both incorrect and correct answers. Surprisingly, we find that LLMs perform better, on average, when rationales are eliminated from the context and incorrect answers are simply shown alongside correct ones. This approach also substantially outperforms chain-of-thought prompting in our evaluations. We show that these results are consistent across LLMs of different sizes and varying reasoning abilities. Further, we carry out an in-depth analysis, and show that prompting with both wrong and correct answers leads to greater performance and better generalisation than introducing additional, more diverse question-answer pairs into the context. Finally, we show that new rationales generated by models that have only observed incorrect and correct answers are scored equally as highly by humans as those produced with the aid of exemplar rationales. Our results demonstrate that LLMs are indeed capable of in-context implicit learning.

--------------------------------------------------------------------------------------------------------

Measuring Diversity in Synthetic Datasets

Measuring the diversity of synthetic datasets generated by large language models is crucial for ensuring robust model training. DCScore introduces a new method for evaluating dataset diversity from a classification perspective. This tool could be particularly valuable for organizations developing training datasets for AI systems, helping ensure their models are trained on sufficiently diverse data for better generalization.

Authors: Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian

Link: https://arxiv.org/abs/2502.08512v1

Date: 2025-02-12

Summary:

Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing approaches. Code is available at: https://github.com/BlueWhaleLab/DCScore.

--------------------------------------------------------------------------------------------------------

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

This research reveals concerning vulnerabilities in language models that could allow them to be manipulated into providing deceptive responses while maintaining accuracy in other areas. The findings highlight important security considerations for AI deployment and emphasize the need for robust safeguards in AI systems used in public-facing applications where trustworthiness is crucial.

Authors: Laurène Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff

Link: https://arxiv.org/abs/2502.08301v1

Date: 2025-02-12

Summary:

Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These "deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.

--------------------------------------------------------------------------------------------------------

Latest Advancements Towards Catastrophic Forgetting under Data Scarcity: A Comprehensive Survey on Few-Shot Class Incremental Learning

This comprehensive survey examines the latest developments in few-shot class incremental learning, addressing how neural networks can learn from very limited data in dynamic environments. The research is particularly relevant for applications where collecting large datasets is impractical or impossible, such as rare medical conditions or specialized industrial processes.

Authors: M. Anwar Ma'sum, Mahardhika Pratama, Igor Skrjanc

Link: https://arxiv.org/abs/2502.08181v1

Date: 2025-02-12

Summary:

Data scarcity significantly complicates the continual learning problem, i.e., how a deep neural network learns in dynamic environments with very few samples. However, the latest progress of few-shot class incremental learning (FSCIL) methods and related studies show insightful knowledge on how to tackle the problem. This paper presents a comprehensive survey on FSCIL that highlights several important aspects i.e. comprehensive and formal objectives of FSCIL approaches, the importance of prototype rectifications, the new learning paradigms based on pre-trained model and language-guided mechanism, the deeper analysis of FSCIL performance metrics and evaluation, and the practical contexts of FSCIL in various areas. Our extensive discussion presents the open challenges, potential solutions, and future directions of FSCIL.

--------------------------------------------------------------------------------------------------------

Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

This paper addresses the crucial challenge of selecting appropriate models for off-policy evaluation in reinforcement learning. The research introduces new methods for tuning hyperparameters and evaluating model performance when working with offline data. These advances could improve the reliability of reinforcement learning in real-world applications where direct experimentation is costly or risky.

Authors: Pai Liu, Lingfeng Zhao, Shivangi Agarwal, Jinghan Liu, Audrey Huang, Philip Amortila, Nan Jiang

Link: https://arxiv.org/abs/2502.08021v1

Date: 2025-02-11

Summary:

Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). In this work we focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Our contributions are two fold. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically evaluating them. Compared to the model-free protocol in prior works, our new protocol allows for more stable generation of candidate value functions, better control of misspecification, and evaluation of model-free and model-based methods alike. We exemplify the protocol on a Gym environment, and find that our new model-free selector, LSTD-Tournament, demonstrates promising empirical performance.

--------------------------------------------------------------------------------------------------------

Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

This paper introduces a novel approach to combining the strengths of different language models during inference without requiring additional training. The method could significantly improve the performance of AI systems by allowing them to leverage complementary knowledge from multiple models efficiently. This has practical applications in improving AI system performance across diverse domains.

Authors: Ziyao Wang, Muneeza Azmart, Ang Li, Raya Horesh, Mikhail Yurochkin

Link: https://arxiv.org/abs/2502.08020v1

Date: 2025-02-11

Summary:

Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to improve their performance across domains. To realize this potential, we introduce a novel Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion at test time without requiring additional model training. CoSD employs a draft model to generate initial sequences and an easy-to-learn rule or decision tree to decide when to invoke an assistant model to improve these drafts. CoSD not only enhances knowledge fusion but also improves inference efficiency, is transferable across domains and models, and offers greater explainability. Experimental results demonstrate that CoSD improves accuracy by up to 10\% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications

--------------------------------------------------------------------------------------------------------

TransMLA: Multi-Head Latent Attention Is All You Need

This research introduces a method for converting existing language models to use Multi-head Latent Attention, potentially improving their efficiency without sacrificing performance. The approach could help organizations optimize their AI systems for better resource utilization while maintaining or improving capabilities. This has practical implications for deploying large language models in resource-constrained environments.

Authors: Fanxu Meng, Zengwei Yao, Muhan Zhang

Link: https://arxiv.org/abs/2502.07864v2

Date: 2025-02-13

Summary:

Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.

--------------------------------------------------------------------------------------------------------

WebChecker: A Versatile EVL Plugin for Validating HTML Pages with Bootstrap Frameworks

This plugin for the Epsilon Validation Language helps validate HTML pages using Bootstrap frameworks. By enforcing implicit rules governing HTML and CSS frameworks, it helps ensure web page consistency and correctness. This tool could be particularly valuable for web development teams working on large-scale projects where maintaining consistent standards is crucial.

Authors: Milind Cherukuri

Link: https://arxiv.org/abs/2502.07479v1

Date: 2025-02-11

Summary:

WebChecker is a plugin for Epsilon Validation Language (EVL), designed to validate both static and dynamic HTML pages utilizing frameworks like Bootstrap. By employing configurable EVL constraints, WebChecker enforces implicit rules governing HTML and CSS frameworks. The effectiveness of the plugin is demonstrated through its application on Bootstrap, the widely adopted HTML, CSS, and JavaScript framework. WebChecker comes with a set of EVL constraints to assess Bootstrap based web pages. To substantiate our claims, I present an illustrative example featuring two solutions that effectively enforce implicit rules.

--------------------------------------------------------------------------------------------------------

Human-in-the-Loop Annotation for Image-Based Engagement Estimation: Assessing the Impact of Model Reliability on Annotation Accuracy

This study examines how model reliability and cognitive framing affect human trust and performance in collaborative annotation systems. The research provides insights into optimizing human-machine collaboration for emotion estimation tasks. These findings could improve the design of systems where humans and AI work together, particularly in applications requiring nuanced judgment like content moderation or medical diagnosis.

Authors: Sahana Yadnakudige Subramanya, Ko Watanabe, Andreas Dengel, Shoya Ishimaru

Link: https://arxiv.org/abs/2502.07404v1

Date: 2025-02-11

Summary:

Human-in-the-loop (HITL) frameworks are increasingly recognized for their potential to improve annotation accuracy in emotion estimation systems by combining machine predictions with human expertise. This study focuses on integrating a high-performing image-based emotion model into a HITL annotation framework to evaluate the collaborative potential of human-machine interaction and identify the psychological and practical factors critical to successful collaboration. Specifically, we investigate how varying model reliability and cognitive framing influence human trust, cognitive load, and annotation behavior in HITL systems. We demonstrate that model reliability and psychological framing significantly impact annotators' trust, engagement, and consistency, offering insights into optimizing HITL frameworks. Through three experimental scenarios with 29 participants--baseline model reliability (S1), fabricated errors (S2), and cognitive bias introduced by negative framing (S3)--we analyzed behavioral and qualitative data. Reliable predictions in S1 yielded high trust and annotation consistency, while unreliable outputs in S2 led to increased critical evaluations but also heightened frustration and response variability. Negative framing in S3 revealed how cognitive bias influenced participants to perceive the model as more relatable and accurate, despite misinformation regarding its reliability. These findings highlight the importance of both reliable machine outputs and psychological factors in shaping effective human-machine collaboration. By leveraging the strengths of both human oversight and automated systems, this study establishes a scalable HITL framework for emotion annotation and lays the foundation for broader applications in adaptive learning and human-computer interaction.

--------------------------------------------------------------------------------------------------------

Bridging the Evaluation Gap: Leveraging Large Language Models for Topic Model Evaluation

This research presents a framework for using large language models to evaluate topic modeling systems in scientific literature. The approach could help digital libraries better organize and retrieve scholarly content by providing more dynamic and adaptable evaluation methods. This has particular relevance for academic institutions and research organizations managing large collections of scientific literature.

Authors: Zhiyin Tan, Jennifer D'Souza

Link: https://arxiv.org/abs/2502.07352v1

Date: 2025-02-11

Summary:

This study presents a framework for automated evaluation of dynamically evolving topic taxonomies in scientific literature using Large Language Models (LLMs). In digital library systems, topic modeling plays a crucial role in efficiently organizing and retrieving scholarly content, guiding researchers through complex knowledge landscapes. As research domains proliferate and shift, traditional human centric and static evaluation methods struggle to maintain relevance. The proposed approach harnesses LLMs to measure key quality dimensions, such as coherence, repetitiveness, diversity, and topic-document alignment, without heavy reliance on expert annotators or narrow statistical metrics. Tailored prompts guide LLM assessments, ensuring consistent and interpretable evaluations across various datasets and modeling techniques. Experiments on benchmark corpora demonstrate the method's robustness, scalability, and adaptability, underscoring its value as a more holistic and dynamic alternative to conventional evaluation strategies.

--------------------------------------------------------------------------------------------------------

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction

This paper introduces a novel approach to improving language models' reasoning capabilities by transforming code into natural language input-output predictions. The method helps models learn universal reasoning patterns while maintaining procedural rigor. This could enhance AI systems' performance across various reasoning tasks, from mathematical problem-solving to logical deduction.

Authors: Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He

Link: https://arxiv.org/abs/2502.07316v2

Date: 2025-02-12

Summary:

Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives -- like logic flow planning, state-space searching, decision tree traversal, and modular decomposition -- while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at https://github.com/hkust-nlp/CodeIO.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithFebruary 17, 2025Comment