Week Ending 4.20.2025
RESEARCH WATCH: 4.20.2025
Exploring Expert Failures Improves LLM Agent Tuning
This research addresses a critical limitation in training Large Language Model (LLM) agents through Rejection Sampling Fine-Tuning (RFT). While RFT has been effective for simpler tasks, it struggles with complex scenarios that remain persistently out-of-distribution. The authors introduce Exploring Expert Failures (EEF), which intelligently leverages unsuccessful expert trajectories to improve agent exploration and skill acquisition. By identifying beneficial actions from failed attempts while excluding harmful ones, EEF helps solve previously unsolvable tasks. The approach achieved impressive results in WebShop (62% win rate) and set new state-of-the-art benchmarks in both WebShop and SciWorld, demonstrating its potential for more robust LLM agent development.
Authors: Li-Cheng Lan, Andrew Bai, Minhao Cheng, Ruochen Wang, Cho-Jui Hsieh, Tianyi Zhou
Link: https://arxiv.org/abs/2504.13145v1
Date: 2025-04-17
Summary:
Large Language Models (LLMs) have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and interactions. Rejection Sampling Fine-Tuning (RFT) has emerged as an effective method for finetuning LLMs as agents: it first imitates expert-generated successful trajectories and further improves agentic skills through iterative fine-tuning on successful, self-generated trajectories. However, since the expert (e.g., GPT-4) succeeds primarily on simpler subtasks and RFT inherently favors simpler scenarios, many complex subtasks remain unsolved and persistently out-of-distribution (OOD). Upon investigating these challenging subtasks, we discovered that previously failed expert trajectories can often provide valuable guidance, e.g., plans and key actions, that can significantly improve agent exploration efficiency and acquisition of critical skills. Motivated by these observations, we propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories and integrates them into the training dataset. Potentially harmful actions are meticulously excluded to prevent contamination of the model learning process. By leveraging the beneficial actions in expert failures, EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance. Remarkably, our approach achieved a 62\% win rate in WebShop, outperforming RFT (53. 6\%) and GPT-4 (35. 6\%), and to the best of our knowledge, setting a new state-of-the-art as the first method to surpass a score of 0.81 in WebShop and exceed 81 in SciWorld.
--------------------------------------------------------------------------------------------------------
Image-Editing Specialists: An RLAIF Approach for Diffusion Models
This paper presents an innovative online reinforcement learning framework for training specialized instruction-based image-editing diffusion models. The approach addresses two key challenges in image editing: maintaining structural preservation of input images and ensuring semantic alignment with user prompts. Without extensive human annotations or large datasets, the method enables precise modifications in complex scenes while maintaining high fidelity in unedited areas. Requiring only five reference images depicting a concept, the models can perform intricate edits after just ten training steps. The versatility extends to robotics applications, where enhanced visual realism in simulated environments through targeted sim-to-real image edits improves their effectiveness as proxies for real-world settings.
Authors: Elior Benarous, Yilun Du, Heng Yang
Link: https://arxiv.org/abs/2504.12833v1
Date: 2025-04-17
Summary:
We present a novel approach to training specialized instruction-based image-editing diffusion models, addressing key challenges in structural preservation with input images and semantic alignment with user prompts. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the realism and alignment with instructions in two ways. First, the proposed models achieve precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. Second, they capture fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. This approach simplifies users' efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where enhancing the visual realism of simulated environments through targeted sim-to-real image edits improves their utility as proxies for real-world settings.
--------------------------------------------------------------------------------------------------------
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
This survey examines the rapidly evolving domain of anti-UAV technologies, focusing on three primary objectives: classification, detection, and tracking of unmanned aerial vehicles. The authors detail cutting-edge methodologies including diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. The paper systematically evaluates current solutions across both single-modality and multi-sensor approaches utilizing RGB, infrared, audio, radar, and RF technologies. Despite advances, the analysis reveals significant gaps in real-time performance, stealth detection, and swarm-based scenarios. By identifying these challenges and suggesting future research directions, the paper aims to stimulate innovation in developing next-generation defense strategies for increasingly ubiquitous UAV technologies.
Authors: Yifei Dong, Fengyi Wu, Sanjian Zhang, Guangyu Chen, Yuzhi Hu, Masumi Yano, Jingdong Sun, Siyu Huang, Feng Liu, Qi Dai, Zhi-Qi Cheng
Link: https://arxiv.org/abs/2504.11967v2
Date: 2025-04-17
Summary:
Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges. This survey provides a wide-ranging examination of the anti-UAV domain, centering on three core objectives-classification, detection, and tracking-while detailing emerging methodologies such as diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. We systematically evaluate state-of-the-art solutions across both single-modality and multi-sensor pipelines (spanning RGB, infrared, audio, radar, and RF) and discuss large-scale as well as adversarially oriented benchmarks. Our analysis reveals persistent gaps in real-time performance, stealth detection, and swarm-based scenarios, underscoring pressing needs for robust, adaptive anti-UAV systems. By highlighting open research directions, we aim to foster innovation and guide the development of next-generation defense strategies in an era marked by the extensive use of UAVs.
--------------------------------------------------------------------------------------------------------
Training and synchronizing oscillator networks with Equilibrium Propagation
This research demonstrates how the Equilibrium Propagation algorithm enables effective gradient-based training of oscillator networks, addressing a critical challenge in scaling up oscillator-based computing systems. The study focuses on two oscillator models—purely phase-coupled oscillators and those with both amplitude and phase interactions—showing how they can synchronize effectively even when initial oscillator frequencies are significantly dispersed. Through simulations, the authors prove these networks can scale to standard image recognition tasks, achieving nearly 98% accuracy on MNIST despite noise from imperfect synchronization. This breakthrough paves the way for practical hardware implementations of large-scale oscillator networks, particularly those utilizing spintronic devices, representing an important advancement for unconventional computing and artificial intelligence applications.
Authors: Théophile Rageau, Julie Grollier
Link: https://arxiv.org/abs/2504.11884v1
Date: 2025-04-16
Summary:
Oscillator networks represent a promising technology for unconventional computing and artificial intelligence. Thus far, these systems have primarily been demonstrated in small-scale implementations, such as Ising Machines for solving combinatorial problems and associative memories for image recognition, typically trained without state-of-the-art gradient-based algorithms. Scaling up oscillator-based systems requires advanced gradient-based training methods that also ensure robustness against frequency dispersion between individual oscillators. Here, we demonstrate through simulations that the Equilibrium Propagation algorithm enables effective gradient-based training of oscillator networks, facilitating synchronization even when initial oscillator frequencies are significantly dispersed. We specifically investigate two oscillator models: purely phase-coupled oscillators and oscillators coupled via both amplitude and phase interactions. Our results show that these oscillator networks can scale successfully to standard image recognition benchmarks, such as achieving nearly 98\% test accuracy on the MNIST dataset, despite noise introduced by imperfect synchronization. This work thus paves the way for practical hardware implementations of large-scale oscillator networks, such as those based on spintronic devices.
--------------------------------------------------------------------------------------------------------
Intelligent road crack detection and analysis based on improved YOLOv8
This paper addresses the growing challenge of pavement distress detection as urbanization accelerates and traffic increases. Traditional manual inspection methods for pothole detection are inefficient and costly, prompting the development of an intelligent road crack detection and analysis system based on an enhanced YOLOv8 deep learning framework. The authors trained a target segmentation model using 4,029 images to accurately recognize and segment crack regions in roads. The system can precisely calculate the maximum and minimum widths of cracks and pinpoint their exact locations. Experimental results demonstrate that incorporating ECA and CBAM attention mechanisms significantly enhances detection accuracy and efficiency, offering a novel technological solution for improving road maintenance and safety monitoring systems.
Authors: Haomin Zuo, Zhengyang Li, Jiangchuan Gong, Zhen Tian
Link: https://arxiv.org/abs/2504.13208v1
Date: 2025-04-16
Summary:
As urbanization speeds up and traffic flow increases, the issue of pavement distress is becoming increasingly pronounced, posing a severe threat to road safety and service life. Traditional methods of pothole detection rely on manual inspection, which is not only inefficient but also costly. This paper proposes an intelligent road crack detection and analysis system, based on the enhanced YOLOv8 deep learning framework. A target segmentation model has been developed through the training of 4029 images, capable of efficiently and accurately recognizing and segmenting crack regions in roads. The model also analyzes the segmented regions to precisely calculate the maximum and minimum widths of cracks and their exact locations. Experimental results indicate that the incorporation of ECA and CBAM attention mechanisms substantially enhances the model's detection accuracy and efficiency, offering a novel solution for road maintenance and safety monitoring.
--------------------------------------------------------------------------------------------------------
This research investigates the potential of multimodal Large Language Models (LLMs) in detecting AI-generated content, specifically focusing on image forgery detection. While LLMs possess rich world knowledge, they aren't inherently designed for detecting AI-generated images or understanding local forgery details. The authors propose a framework that enables multimodal LLMs to evaluate image authenticity, locate tampered regions, provide evidence, and identify generation methods based on semantic tampering clues. Through meticulous prompt engineering and few-shot learning techniques, they demonstrate that GPT4V can achieve impressive accuracy rates—92.1% for Autosplice and 86.3% for LaMa—competitive with specialized forgery detection methods. The study also examines current limitations of multimodal LLMs and suggests potential improvements for future development.
Authors: Yiran He, Yun Cao, Bowen Yang, Zeyu Zhang
Link: https://arxiv.org/abs/2504.11686v1
Date: 2025-04-16
Summary:
The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.
--------------------------------------------------------------------------------------------------------
Auto-Prep: Holistic Prediction of Data Preparation Steps for Self-Service Business Intelligence
This paper addresses a persistent pain point in Business Intelligence (BI): the data preparation phase that precedes dashboard creation. Despite the democratization of the dashboarding process through user-friendly tools like Power BI and Tableau, non-technical enterprise users still struggle with preparing raw data for analysis. Based on a systematic study of 2,000 real BI projects, the authors observed that data transformation and table join steps are typically intertwined, necessitating a holistic approach to prediction. Their Auto-Prep system uses a principled graph-based algorithm inspired by Steiner-tree to accurately predict both transformations and joins with provable quality guarantees. Evaluations using real BI projects demonstrate that Auto-Prep correctly predicts over 70% of preparation steps, significantly outperforming existing algorithms and even GPT-4.
Authors: Eugenie Y. Lai, Yeye He, Surajit Chaudhuri
Link: https://arxiv.org/abs/2504.11627v1
Date: 2025-04-15
Summary:
Business Intelligence (BI) plays a critical role in empowering modern enterprises to make informed data-driven decisions, and has grown into a billion-dollar business. Self-service BI tools like Power BI and Tableau have democratized the ``dashboarding'' phase of BI, by offering user-friendly, drag-and-drop interfaces that are tailored to non-technical enterprise users. However, despite these advances, we observe that the ``data preparation'' phase of BI continues to be a key pain point for BI users today. In this work, we systematically study around 2K real BI projects harvested from public sources, focusing on the data-preparation phase of the BI workflows. We observe that users often have to program both (1) data transformation steps and (2) table joins steps, before their raw data can be ready for dashboarding and analysis. A careful study of the BI workflows reveals that transformation and join steps are often intertwined in the same BI project, such that considering both holistically is crucial to accurately predict these steps. Leveraging this observation, we develop an Auto-Prep system to holistically predict transformations and joins, using a principled graph-based algorithm inspired by Steiner-tree, with provable quality guarantees. Extensive evaluations using real BI projects suggest that Auto-Prep can correctly predict over 70\% transformation and join steps, significantly more accurate than existing algorithms as well as language-models such as GPT-4.
--------------------------------------------------------------------------------------------------------
This paper addresses optimization challenges in training Chemistry Foundation Models (CFMs) that utilize Graph Neural Networks to process 3D molecular structures. Unlike traditional GNNs working on large homogeneous graphs, CFMs handle numerous geometric graphs of varying sizes, requiring specialized optimization approaches. The authors focus on MACE, a state-of-the-art CFM, and tackle two critical training phases: data distribution and model training. They formulate load balancing as a multi-objective bin packing problem and develop an iterative algorithm for efficient data distribution. Additionally, they optimize symmetric tensor contraction—the key computational kernel in MACE—to enhance overall performance. This combined approach significantly improves training efficiency, reducing per-epoch execution time from 12 to 2 minutes on 740 GPUs with a 2.6M sample dataset.
Authors: Jesun Firoz, Franco Pellegrini, Mario Geiger, Darren Hsu, Jenna A. Bilbrey, Han-Yi Chou, Maximilian Stadler, Markus Hoehnerbach, Tingyu Wang, Dejun Lin, Emine Kucukbenli, Henry W. Sprueill, Ilyes Batatia, Sotiris S. Xantheas, MalSoon Lee, Chris Mundy, Gabor Csanyi, Justin S. Smith, Ponnuswamy Sadayappan, Sutanay Choudhury
Link: https://arxiv.org/abs/2504.10700v1
Date: 2025-04-14
Summary:
Chemistry Foundation Models (CFMs) that leverage Graph Neural Networks (GNNs) operating on 3D molecular graph structures are becoming indispensable tools for computational chemists and materials scientists. These models facilitate the understanding of matter and the discovery of new molecules and materials. In contrast to GNNs operating on a large homogeneous graphs, GNNs used by CFMs process a large number of geometric graphs of varying sizes, requiring different optimization strategies than those developed for large homogeneous GNNs. This paper presents optimizations for two critical phases of CFM training: data distribution and model training, targeting MACE - a state-of-the-art CFM. We address the challenge of load balancing in data distribution by formulating it as a multi-objective bin packing problem. We propose an iterative algorithm that provides a highly effective, fast, and practical solution, ensuring efficient data distribution. For the training phase, we identify symmetric tensor contraction as the key computational kernel in MACE and optimize this kernel to improve the overall performance. Our combined approach of balanced data distribution and kernel optimization significantly enhances the training process of MACE. Experimental results demonstrate a substantial speedup, reducing per-epoch execution time for training from 12 to 2 minutes on 740 GPUs with a 2.6M sample dataset.
--------------------------------------------------------------------------------------------------------
This paper introduces Backwards Adaptive Reward Shaping (BARS), a no-regret framework that transforms sparse outcome-based rewards into effective procedure-based signals for training language models in sequential reasoning tasks. Chain-of-thought reasoning enables LLMs to solve multi-step problems, but traditional outcome-based rewards face challenges with credit assignment and slow convergence. Meanwhile, procedure-based rewards offer efficient step-level feedback but typically require costly human supervision. BARS addresses this by using sparse rewards from terminal-state priors and cover trees to scale rewards while preventing exploitation. With mathematical guarantees including Bellman contraction and bounded dynamic regret, BARS provides the first rigorous no-regret algorithm for outcome reward shaping, offering a theoretical foundation for explaining the empirical success of DeepSeek's R1 and similar approaches.
Authors: Tarun Chitra
Link: https://arxiv.org/abs/2504.09777v1
Date: 2025-04-14
Summary:
Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce \emph{Backwards Adaptive Reward Shaping} (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and $(\Delta, \epsilon)$-gap rewards, our backward Euler solver achieves $\epsilon$-accuracy in $O\left((R_{\max}/\Delta)\log(1/\epsilon)\right)$ iterations with $O(\log T)$ dynamic regret over $T$ rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek's R1.
--------------------------------------------------------------------------------------------------------
This study explores public expectations regarding AI alignment in content moderation across Germany and the United States. Based on surveys of 1,800 German and 1,756 American respondents, the research examines support for four alignment types: accuracy/reliability, safety, bias mitigation, and promotion of aspirational goals. Americans report significantly higher AI use and consistently greater support for all alignment features, reflecting broader technological openness. Both countries prioritize accuracy and safety over more normatively charged goals like fairness and aspirational values. Individual factors including AI experience, free speech attitudes, political ideology, partisan affiliation, and gender influence these preferences differently across countries. These findings contribute valuable insights to AI governance debates and highlight the importance of grounding alignment discussions in empirical public attitudes and developing normatively grounded expectations for AI content governance.
Authors: Andreas Jungherr, Adrian Rauchfleisch
Link: https://arxiv.org/abs/2504.12476v1
Date: 2025-04-16
Summary:
Recent advances in generative Artificial Intelligence have raised public awareness, shaping expectations and concerns about their societal implications. Central to these debates is the question of AI alignment -- how well AI systems meet public expectations regarding safety, fairness, and social values. However, little is known about what people expect from AI-enabled systems and how these expectations differ across national contexts. We present evidence from two surveys of public preferences for key functional features of AI-enabled systems in Germany (n = 1800) and the United States (n = 1756). We examine support for four types of alignment in AI moderation: accuracy and reliability, safety, bias mitigation, and the promotion of aspirational imaginaries. U.S. respondents report significantly higher AI use and consistently greater support for all alignment features, reflecting broader technological openness and higher societal involvement with AI. In both countries, accuracy and safety enjoy the strongest support, while more normatively charged goals -- like fairness and aspirational imaginaries -- receive more cautious backing, particularly in Germany. We also explore how individual experience with AI, attitudes toward free speech, political ideology, partisan affiliation, and gender shape these preferences. AI use and free speech support explain more variation in Germany. In contrast, U.S. responses show greater attitudinal uniformity, suggesting that higher exposure to AI may consolidate public expectations. These findings contribute to debates on AI governance and cross-national variation in public preferences. More broadly, our study demonstrates the value of empirically grounding AI alignment debates in public attitudes and of explicitly developing normatively grounded expectations into theoretical and policy discussions on the governance of AI-generated content.
--------------------------------------------------------------------------------------------------------
This paper offers an optimistic perspective on how Large Language Models (LLMs) will transform children's understanding of and interaction with technology. The author suggests that current educational impacts of LLMs are minimal compared to forthcoming changes. Through a scenario-based study and self-ethnographic research, the paper demonstrates these emerging effects and identifies five significant considerations that interactive systems designers must address in future development. The research argues that children growing up with LLMs will develop fundamentally different expectations about technological interaction, creating new paradigms for human-computer interfaces. This shift will likely influence educational approaches, digital literacy development, and the design of all future technological systems intended for use by the next generation.
Authors: Russell Beale
Link: https://arxiv.org/abs/2504.13667v1
Date: 2025-04-18
Summary:
This paper presents a hopeful perspective on the potentially dramatic impacts of Large Language Models on how we children learn and how they will expect to interact with technology. We review the effects of LLMs on education so far, and make the case that these effects are minor compared to the upcoming changes that are occurring. We present a small scenario and self-ethnographic study demonstrating the effects of these changes, and define five significant considerations that interactive systems designers will have to accommodate in the future.
--------------------------------------------------------------------------------------------------------
Who is More Bayesian: Humans or ChatGPT?
This research compares human and AI decision-making in binary classification tasks where Bayes Rule represents the optimal approach. Analyzing data from laboratory experiments by El-Gamal, Grether, Holt, and Smith, the researchers confirm that while Bayes Rule best predicts human choices overall, many subjects make suboptimal decisions reflecting biases like the "representativeness heuristic" (overweighting sample evidence) and "conservatism" (overweighting priors). When testing various versions of ChatGPT on the same tasks, the authors document a remarkable evolution: early versions (ChatGPT 3.5) exhibited sub-human performance, while the latest iterations (ChatGPT 4o) achieve nearly perfect Bayesian reasoning—surpassing human capabilities. This evolution suggests rapid improvement in LLMs' reasoning abilities across generations, with implications for AI deployment in decision-making contexts.
Authors: Tianshi Mu, Pranjal Rawat, John Rust, Chengjun Zhang, Qixuan Zhong
Link: https://arxiv.org/abs/2504.10636v1
Date: 2025-04-14
Summary:
We compare the performance of human and artificially intelligent (AI) decision makers in simple binary classification tasks where the optimal decision rule is given by Bayes Rule. We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith. We confirm that while overall, Bayes Rule represents the single best model for predicting human choices, subjects are heterogeneous and a significant share of them make suboptimal choices that reflect judgement biases described by Kahneman and Tversky that include the ``representativeness heuristic'' (excessive weight on the evidence from the sample relative to the prior) and ``conservatism'' (excessive weight on the prior relative to the sample). We compare the performance of AI subjects gathered from recent versions of large language models (LLMs) including several versions of ChatGPT. These general-purpose generative AI chatbots are not specifically trained to do well in narrow decision making tasks, but are trained instead as ``language predictors'' using a large corpus of textual data from the web. We show that ChatGPT is also subject to biases that result in suboptimal decisions. However we document a rapid evolution in the performance of ChatGPT from sub-human performance for early versions (ChatGPT 3.5) to superhuman and nearly perfect Bayesian classifications in the latest versions (ChatGPT 4o).
--------------------------------------------------------------------------------------------------------
This study explores how competition can enhance LLM-based multi-agent systems for news-driven time series forecasting. The challenge lies in measuring the influence of different news events on time series fluctuations, requiring innovative thinking and the ability to identify misleading logic. The authors embed a competition mechanism within multi-agent discussions to stimulate innovative thought and incorporate a fine-tuned small-scale LLM within the reflective stage to better identify misleading information. Experimental results confirm that competition significantly boosts agents' innovative thinking capacity and improves time series prediction performance. Interestingly, the intensity of competition influences agent performance, echoing findings from social science and providing new perspectives for studying LLM-based multi-agent systems in forecasting applications.
Authors: Yuxuan Zhang, Yangyang Feng, Daifeng Li, Kexin Zhang, Junlan Chen, Bowen Deng
Link: https://arxiv.org/abs/2504.10210v1
Date: 2025-04-14
Summary:
Multi-agents-based news-driven time series forecasting is considered as a potential paradigm shift in the era of large language models (LLMs). The challenge of this task lies in measuring the influences of different news events towards the fluctuations of time series. This requires agents to possess stronger abilities of innovative thinking and the identifying misleading logic. However, the existing multi-agent discussion framework has limited enhancement on time series prediction in terms of optimizing these two capabilities. Inspired by the role of competition in fostering innovation, this study embeds a competition mechanism within the multi-agent discussion to enhance agents' capability of generating innovative thoughts. Furthermore, to bolster the model's proficiency in identifying misleading information, we incorporate a fine-tuned small-scale LLM model within the reflective stage, offering auxiliary decision-making support. Experimental results confirm that the competition can boost agents' capacity for innovative thinking, which can significantly improve the performances of time series prediction. Similar to the findings of social science, the intensity of competition within this framework can influence the performances of agents, providing a new perspective for studying LLMs-based multi-agent systems.
--------------------------------------------------------------------------------------------------------
A Review of Traffic Wave Suppression Strategies: Variable Speed Limit vs. Jam-Absorption Driving
This comprehensive review addresses the persistent challenge of stop-and-go traffic waves—moving jams that propagate upstream indefinitely, causing reduced efficiency, increased accident risk, and higher emissions. The paper examines two major suppression strategies developed over the past two decades: variable speed limit (VSL) and jam-absorption driving (JAD). Despite sharing similar motivations, objectives, and theoretical foundations, these approaches have developed largely in isolation from each other. The authors bridge these disconnected fields by synthesizing their achievements and identifying research opportunities across multiple perspectives: fundamental diagrams, traffic dynamics modeling, state estimation/prediction, stochasticity, strategy validation scenarios, and practical deployment. By enabling each field to leverage the strengths of the other, this work promotes the overall goal of eliminating freeway congestion through more effective wave suppression techniques.
Authors: Zhengbing He, Jorge Laval, Yu Han, Ryosuke Nishi, Cathy Wu
Link: https://arxiv.org/abs/2504.11372v1
Date: 2025-04-15
Summary:
The main form of freeway traffic congestion is the familiar stop-and-go wave, characterized by wide moving jams that propagate indefinitely upstream provided enough traffic demand. They cause severe, long-lasting adverse effects, such as reduced traffic efficiency, increased driving risks, and higher vehicle emissions. This underscores the crucial importance of artificial intervention in the propagation of stop-and-go waves. Over the past two decades, two prominent strategies for stop-and-go wave suppression have emerged: variable speed limit (VSL) and jam-absorption driving (JAD). Although they share similar research motivations, objectives, and theoretical foundations, the development of these strategies has remained relatively disconnected. To synthesize fragmented advances and drive the field forward, this paper first provides a comprehensive review of the achievements in the stop-and-go wave suppression-oriented VSL and JAD, respectively. It then focuses on bridging the two areas and identifying research opportunities from the following perspectives: fundamental diagrams, traffic dynamics modeling, traffic state estimation and prediction, stochasticity, scenarios for strategy validation, and field tests and practical deployment. We expect that through this review, one area can effectively address its limitations by identifying and leveraging the strengths of the other, thus promoting the overall research goal of freeway stop-and-go wave suppression.
--------------------------------------------------------------------------------------------------------
Performance of Large Language Models in Supporting Medical Diagnosis and Treatment
This study evaluates how effectively Large Language Models can support medical diagnosis and treatment planning by analyzing their performance on the 2024 Portuguese National Exam for medical specialty access. The researchers tested various open-source and closed-source LLMs, finding considerable variation in accuracy and cost-effectiveness, with several models outperforming human medical student benchmarks on this specific assessment. The analysis identifies top-performing models based on combined accuracy and cost metrics while discussing the impact of reasoning methodologies like Chain-of-Thought. These findings demonstrate LLMs' potential as valuable complementary tools for healthcare professionals making complex clinical decisions, suggesting they could enhance diagnostic accuracy and treatment recommendations when properly integrated into medical workflows.
Authors: Diogo Sousa, Guilherme Barbosa, Catarina Rocha, Dulce Oliveira
Link: https://arxiv.org/abs/2504.10405v1
Date: 2025-04-14
Summary:
The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including both open-source and closed-source models, on the 2024 Portuguese National Exam for medical specialty access (PNA), a standardized medical knowledge assessment. Our results highlight considerable variation in accuracy and cost-effectiveness, with several models demonstrating performance exceeding human benchmarks for medical students on this specific task. We identify leading models based on a combined score of accuracy and cost, discuss the implications of reasoning methodologies like Chain-of-Thought, and underscore the potential for LLMs to function as valuable complementary tools aiding medical professionals in complex clinical decision-making.
--------------------------------------------------------------------------------------------------------
From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
This paper addresses the cost-performance dilemma in Large Language Model deployment by introducing a three-stage optimization pipeline. While LLMs have revolutionized NLP by replacing traditional pipelines with "one-stage" systems, this approach incurs substantial costs and latency. The authors' framework begins with prototyping, creating an optimal performance system that generates high-quality training data as a teacher model. The second stage transfers knowledge to a smaller 0.5B student model through rejection fine-tuning, reinforcement learning, and knowledge distillation. The final stage applies quantization and pruning to further compress the model to 0.4B, achieving ultra-low latency and cost. This modular approach delivers effective performance while dramatically reducing resource requirements, with potential applications across various NLP domains.
Authors: Jiliang Ni, Jiachen Pu, Zhongyi Yang, Kun Zhou, Hui Wang, Xiaoliang Xiao, Dakui Wang, Xin Li, Jingfeng Luo, Conggang Hu
Link: https://arxiv.org/abs/2504.13471v1
Date: 2025-04-18
Summary:
In recent years, Large Language Models (LLMs) have significantly advanced artificial intelligence by optimizing traditional Natural Language Processing (NLP) pipelines, improving performance and generalization. This has spurred their integration into various systems. Many NLP systems, including ours, employ a "one-stage" pipeline directly incorporating LLMs. While effective, this approach incurs substantial costs and latency due to the need for large model parameters to achieve satisfactory outcomes. This paper introduces a three-stage cost-efficient end-to-end LLM deployment pipeline-including prototyping, knowledge transfer, and model compression-to tackle the cost-performance dilemma in LLM-based frameworks. Our approach yields a super tiny model optimized for cost and performance in online systems, simplifying the system architecture. Initially, by transforming complex tasks into a function call-based LLM-driven pipeline, an optimal performance prototype system is constructed to produce high-quality data as a teacher model. The second stage combine techniques like rejection fine-tuning, reinforcement learning and knowledge distillation to transfer knowledge to a smaller 0.5B student model, delivering effective performance at minimal cost. The final stage applies quantization and pruning to extremely compress model to 0.4B, achieving ultra-low latency and cost. The framework's modular design and cross-domain capabilities suggest potential applicability in other NLP areas.
--------------------------------------------------------------------------------------------------------
Evolutionary Reinforcement Learning for Interpretable Decision-Making in Supply Chain Management
This research tackles a key challenge in Industry 4.0: stakeholder reluctance to adopt AI-based optimization for Supply Chain Management due to the "black-box" nature of most solutions. The authors combine evolutionary computation with Reinforcement Learning to generate interpretable decision-making policies as decision trees, embedding this approach within a simulation-based optimization framework designed for supply chain uncertainties. Testing on both fictional and real-world supply chain problems, the interpretable approach delivers competitive and sometimes superior performance compared to traditional optimization and RL algorithms. This challenges the prevailing notion that interpretability necessarily compromises efficiency. The framework shows strong potential for industrial applications, offering seamless integration with Python-based algorithms while providing the transparency necessary for stakeholder confidence.
Authors: Stefano Genetti, Alberto Longobardi, Giovanni Iacca
Link: https://arxiv.org/abs/2504.12023v1
Date: 2025-04-16
Summary:
In the context of Industry 4.0, Supply Chain Management (SCM) faces challenges in adopting advanced optimization techniques due to the "black-box" nature of most AI-based solutions, which causes reluctance among company stakeholders. To overcome this issue, in this work, we employ an Interpretable Artificial Intelligence (IAI) approach that combines evolutionary computation with Reinforcement Learning (RL) to generate interpretable decision-making policies in the form of decision trees. This IAI solution is embedded within a simulation-based optimization framework specifically designed to handle the inherent uncertainties and stochastic behaviors of modern supply chains. To our knowledge, this marks the first attempt to combine IAI with simulation-based optimization for decision-making in SCM. The methodology is tested on two supply chain optimization problems, one fictional and one from the real world, and its performance is compared against widely used optimization and RL algorithms. The results reveal that the interpretable approach delivers competitive, and sometimes better, performance, challenging the prevailing notion that there must be a trade-off between interpretability and optimization efficiency. Additionally, the developed framework demonstrates strong potential for industrial applications, offering seamless integration with various Python-based algorithms.
--------------------------------------------------------------------------------------------------------
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
This paper introduces REAL, a groundbreaking benchmark and framework for evaluating multi-turn agent performance on deterministic simulations of real-world websites. Comprising high-fidelity replicas of 11 widely-used websites across diverse domains, REAL includes 112 practical tasks that mirror everyday complex user interactions. The fully controlled setting eliminates safety risks while enabling robust, reproducible evaluation of agent capabilities. The novel evaluation framework combines programmatic checks for action-based tasks with LLM-based judgments for information retrieval. Supporting both open-source and proprietary agent systems through a flexible harness, REAL accommodates black-box commands within browser environments. Empirical results reveal that even frontier language models achieve at most a 41% success rate, highlighting critical gaps in autonomous web navigation. The framework supports easy integration of new tasks and scalable post-training data generation, advancing agent evaluation capabilities.
Authors: Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, Sumeet Motwani
Link: https://arxiv.org/abs/2504.11543v2
Date: 2025-04-17
Summary:
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.
--------------------------------------------------------------------------------------------------------
Memristive chaotic circuit for information processing through time
This research presents an innovative approach to information processing inspired by the human brain's ability to process sensory information in real-time with remarkable efficiency. The authors developed a memristor-based compact chaotic circuit that, unlike conventional computing systems, processes data streams "through time," utilizing time as an internal variable. They created a hardware memristive version of the simplest formal chaotic circuit that leverages the nonlinearity of nonvolatile memristor devices to evolve with complex dynamics in response to driving signals. When implemented in a single-node reservoir computing scheme, the circuit successfully performed nonlinear classification tasks and processed temporal information streams. These results demonstrate the potential of simple memristor-based chaotic circuits to function as nonlinear dynamics-based computing systems for temporal information processing.
Authors: Manuel Escudero, Sabina Spiga, Stefano Brivio
Link: https://arxiv.org/abs/2504.13600v1
Date: 2025-04-18
Summary:
Human brain processes sensory information in real-time with extraordinary efficiency compared to the possibilities of current artificial computing systems. It operates as a complex nonlinear system, composed of interacting dynamic units - neurons and synapses - that processes data-streams as time goes by, i.e. through time, using time as an internal self-standing variable. Here we report on a memristor-based compact chaotic circuit included in a computing architecture that can process information through time. We realized a hardware memristive version of the formally simplest chaotic circuit that, thanks to the nonlinearity of the nonvolatile memristor device, evolves with complex dynamics in response to a driving signal. The circuit is used in a single-node reservoir computing scheme to demonstrate nonlinear classification tasks and the processing of data streams through time. These results demonstrate that a simple memristor-based chaotic circuit has the potential to operate as a nonlinear dynamics-based computing system and to process temporal information through time.
--------------------------------------------------------------------------------------------------------
In between myth and reality: AI for math -- a case study in category theory
This paper explores the capabilities of AI systems in mathematical research through an experiment with two leading contemporary AI platforms in the domain of category theory. The author conducts this investigation with dual objectives: understanding how AI can assist mathematical research and providing feedback to AI developers for system improvement. The experiment offers insights into current AI strengths and limitations when tackling advanced mathematical concepts, with particular attention to category theory—a field known for its abstract and foundational nature in mathematics. Through systematic testing, the research bridges the gap between expectations and reality regarding AI's mathematical capabilities, providing valuable insights for both the mathematical community and AI developers working toward enhancing these systems for scientific research applications.
Authors: Răzvan Diaconescu
Link: https://arxiv.org/abs/2504.13360v1
Date: 2025-04-17
Summary:
Recently, there is an increasing interest in understanding the performance of AI systems in solving math problems. A multitude of tests have been performed, with mixed conclusions. In this paper we discuss an experiment we have made in the direction of mathematical research, with two of the most prominent contemporary AI systems. One of the objective of this experiment is to get an understanding of how AI systems can assist mathematical research. Another objective is to support the AI systems developers by formulating suggestions for directions of improvement.
------------------------------------------------------------------------------------------------------