Week Ending 3.4.2024

RESEARCH WATCH: 3.4.2024

Retrieval-Augmented Generation for AI-Generated Content: A Survey

A Survey Artificial intelligence is revolutionizing content generation, but challenges remain in maintaining up-to-date knowledge, preventing data leaks, and reducing training costs. Retrieval-Augmented Generation (RAG) enhances AI-generated content by retrieving relevant information, increasing accuracy and robustness. This survey comprehensively reviews integrating RAG into AI content generation scenarios, offering a unified perspective and highlighting advancements to guide future progress. Potential applications span improving AI writing assistants, chatbots, and multimodal content creation tools.

Authors: Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Bin Cui

Link: https://arxiv.org/abs/2402.19473v1

Date: 2024-02-29

Summary:

The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by advancements in model algorithms, scalable foundation model architectures, and the availability of ample high-quality datasets. While AIGC has achieved remarkable performance, it still faces challenges, such as the difficulty of maintaining up-to-date and long-tail knowledge, the risk of data leakage, and the high costs associated with training and inference. Retrieval-Augmented Generation (RAG) has recently emerged as a paradigm to address such challenges. In particular, RAG introduces the information retrieval process, which enhances AIGC results by retrieving relevant objects from available data stores, leading to greater accuracy and robustness. In this paper, we comprehensively review existing efforts that integrate RAG technique into AIGC scenarios. We first classify RAG foundations according to how the retriever augments the generator. We distill the fundamental abstractions of the augmentation methodologies for various retrievers and generators. This unified perspective encompasses all RAG scenarios, illuminating advancements and pivotal technologies that help with potential future progress. We also summarize additional enhancements methods for RAG, facilitating effective engineering and implementation of RAG systems. Then from another view, we survey on practical applications of RAG across different modalities and tasks, offering valuable references for researchers and practitioners. Furthermore, we introduce the benchmarks for RAG, discuss the limitations of current RAG systems, and suggest potential directions for future research. Project: https://github.com/hymie122/RAG-Survey

--------------------------------------------------------------------------------------------------------

Loose LIPS Sink Ships: Asking Questions in Battleship with Language-Informed Program Sampling

Asking informative questions under uncertainty is a remarkable human capability. This work studies question-asking through the game Battleship, proposing a language-informed program sampling model that uses large language models to generate natural questions, translate them into programs, and evaluate their expected information gain. The approach mirrors human performance and could enhance AI question-asking abilities for applications like tutoring systems and conversational agents.

Authors: Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum

Link: https://arxiv.org/abs/2402.19471v1

Date: 2024-02-29

Summary:

Questions combine our mastery of language with our remarkable facility for reasoning about uncertainty. How do people navigate vast hypothesis spaces to pose informative questions given limited cognitive resources? We study these tradeoffs in a classic grounded question-asking task based on the board game Battleship. Our language-informed program sampling (LIPS) model uses large language models (LLMs) to generate natural language questions, translate them into symbolic programs, and evaluate their expected information gain. We find that with a surprisingly modest resource budget, this simple Monte Carlo optimization strategy yields informative questions that mirror human performance across varied Battleship board scenarios. In contrast, LLM-only baselines struggle to ground questions in the board state; notably, GPT-4V provides no improvement over non-visual baselines. Our results illustrate how Bayesian models of question-asking can leverage the statistics of language to capture human priors, while highlighting some shortcomings of pure LLMs as grounded reasoners.

--------------------------------------------------------------------------------------------------------

Curiosity-driven Red-teaming for Large Language Models

While powerful, large language models risk generating incorrect or toxic content. This paper proposes curiosity-driven red-teaming to systematically probe for prompts that trigger undesirable outputs, achieving greater coverage than prior methods. The technique could improve LLM safety and robustness for real-world deployment across applications by exposing potential failure modes.

Authors: Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal

Link: https://arxiv.org/abs/2402.19464v1

Date: 2024-02-29

Summary:

Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{https://github.com/Improbable-AI/curiosity_redteam}

--------------------------------------------------------------------------------------------------------

Leveraging AI Predicted and Expert Revised Annotations in Interactive Segmentation: Continual Tuning or Full Training?

Creating large annotated datasets for medical image segmentation is challenging. This paper proposes a method to combine multiple partially annotated datasets by leveraging mutual exclusivity, enabling better scene understanding while utilizing readily available data. The approach could accelerate curating high-quality datasets for computer-assisted surgery and other healthcare applications requiring precise image segmentation.

Authors: Tiezheng Zhang, Xiaoxi Chen, Chongyu Qu, Alan Yuille, Zongwei Zhou

Link: https://arxiv.org/abs/2402.19423v1

Date: 2024-02-29

Summary:

Interactive segmentation, an integration of AI algorithms and human expertise, premises to improve the accuracy and efficiency of curating large-scale, detailed-annotated datasets in healthcare. Human experts revise the annotations predicted by AI, and in turn, AI improves its predictions by learning from these revised annotations. This interactive process continues to enhance the quality of annotations until no major revision is needed from experts. The key challenge is how to leverage AI predicted and expert revised annotations to iteratively improve the AI. Two problems arise: (1) The risk of catastrophic forgetting--the AI tends to forget the previously learned classes if it is only retrained using the expert revised classes. (2) Computational inefficiency when retraining the AI using both AI predicted and expert revised annotations; moreover, given the dominant AI predicted annotations in the dataset, the contribution of newly revised annotations--often account for a very small fraction--to the AI training remains marginal. This paper proposes Continual Tuning to address the problems from two perspectives: network design and data reuse. Firstly, we design a shared network for all classes followed by class-specific networks dedicated to individual classes. To mitigate forgetting, we freeze the shared network for previously learned classes and only update the class-specific network for revised classes. Secondly, we reuse a small fraction of data with previous annotations to avoid over-computing. The selection of such data relies on the importance estimate of each data. The importance score is computed by combining the uncertainty and consistency of AI predictions. Our experiments demonstrate that Continual Tuning achieves a speed 16x greater than repeatedly training AI from scratch without compromising the performance.

--------------------------------------------------------------------------------------------------------

Crafting Knowledge: Exploring the Creative Mechanisms of Chat-Based Search Engines

Chat-based search engines powered by large language models demonstrate remarkable abilities in understanding and creatively presenting web information. This study dissects how such systems select sources for responses, revealing unique text preferences emerging from the underlying language models. The findings could guide improving information retrieval and presentation in conversational AI assistants.

Authors: Lijia Ma, Xingchen Xu, Yong Tan

Link: https://arxiv.org/abs/2402.19421v1

Date: 2024-02-29

Summary:

In the domain of digital information dissemination, search engines act as pivotal conduits linking information seekers with providers. The advent of chat-based search engines utilizing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG), exemplified by Bing Chat, marks an evolutionary leap in the search ecosystem. They demonstrate metacognitive abilities in interpreting web information and crafting responses with human-like understanding and creativity. Nonetheless, the intricate nature of LLMs renders their "cognitive" processes opaque, challenging even their designers' understanding. This research aims to dissect the mechanisms through which an LLM-powered chat-based search engine, specifically Bing Chat, selects information sources for its responses. To this end, an extensive dataset has been compiled through engagements with New Bing, documenting the websites it cites alongside those listed by the conventional search engine. Employing natural language processing (NLP) techniques, the research reveals that Bing Chat exhibits a preference for content that is not only readable and formally structured, but also demonstrates lower perplexity levels, indicating a unique inclination towards text that is predictable by the underlying LLM. Further enriching our analysis, we procure an additional dataset through interactions with the GPT-4 based knowledge retrieval API, unveiling a congruent text preference between the RAG API and Bing Chat. This consensus suggests that these text preferences intrinsically emerge from the underlying language models, rather than being explicitly crafted by Bing Chat's developers. Moreover, our investigation documents a greater similarity among websites cited by RAG technologies compared to those ranked highest by conventional search engines.

--------------------------------------------------------------------------------------------------------

OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models

Most medical language models rely on extensive fine-tuning using specialized data, limiting accessibility. This work shows that prompt engineering with open-source large language models can achieve state-of-the-art performance on medical question-answering benchmarks, outperforming prior fine-tuned models. The OpenMedLM platform could democratize access to accurate medical information retrieval for applications in healthcare.

Authors: Jenish Maharjan, Anurag Garikipati, Navan Preet Singh, Leo Cyrus, Mayank Sharma, Madalina Ciobanu, Gina Barnes, Rahul Thapa, Qingqing Mao, Ritankar Das

Link: https://arxiv.org/abs/2402.19371v1

Date: 2024-02-29

Summary:

LLMs have become increasingly capable at accomplishing a range of specialized-tasks and can be utilized to expand equitable access to medical knowledge. Most medical LLMs have involved extensive fine-tuning, leveraging specialized medical data and significant, thus costly, amounts of computational power. Many of the top performing LLMs are proprietary and their access is limited to very few research groups. However, open-source (OS) models represent a key area of growth for medical LLMs due to significant improvements in performance and an inherent ability to provide the transparency and compliance required in healthcare. We present OpenMedLM, a prompting platform which delivers state-of-the-art (SOTA) performance for OS LLMs on medical benchmarks. We evaluated a range of OS foundation LLMs (7B-70B) on four medical benchmarks (MedQA, MedMCQA, PubMedQA, MMLU medical-subset). We employed a series of prompting strategies, including zero-shot, few-shot, chain-of-thought (random selection and kNN selection), and ensemble/self-consistency voting. We found that OpenMedLM delivers OS SOTA results on three common medical LLM benchmarks, surpassing the previous best performing OS models that leveraged computationally costly extensive fine-tuning. The model delivers a 72.6% accuracy on the MedQA benchmark, outperforming the previous SOTA by 2.4%, and achieves 81.7% accuracy on the MMLU medical-subset, establishing itself as the first OS LLM to surpass 80% accuracy on this benchmark. Our results highlight medical-specific emergent properties in OS LLMs which have not yet been documented to date elsewhere, and showcase the benefits of further leveraging prompt engineering to improve the performance of accessible LLMs for medical applications.

--------------------------------------------------------------------------------------------------------

One model to use them all: Training a segmentation model with complementary datasets

Training deep learning models for detailed surgical scene segmentation requires large, fully annotated datasets which are costly to create. This paper proposes combining multiple partially annotated datasets with complementary labels into one model, reducing annotation burden. The approach could accelerate developing robust computer-vision systems to understand the surgical field for computer-assisted interventions.

Authors: Alexander C. Jenke, Sebastian Bodenstedt, Fiona R. Kolbinger, Marius Distler, Jürgen Weitz, Stefanie Speidel

Link: https://arxiv.org/abs/2402.19340v1

Date: 2024-02-29

Summary:

Understanding a surgical scene is crucial for computer-assisted surgery systems to provide any intelligent assistance functionality. One way of achieving this scene understanding is via scene segmentation, where every pixel of a frame is classified and therefore identifies the visible structures and tissues. Progress on fully segmenting surgical scenes has been made using machine learning. However, such models require large amounts of annotated training data, containing examples of all relevant object classes. Such fully annotated datasets are hard to create, as every pixel in a frame needs to be annotated by medical experts and, therefore, are rarely available. In this work, we propose a method to combine multiple partially annotated datasets, which provide complementary annotations, into one model, enabling better scene segmentation and the use of multiple readily available datasets. Our method aims to combine available data with complementary labels by leveraging mutual exclusive properties to maximize information. Specifically, we propose to use positive annotations of other classes as negative samples and to exclude background pixels of binary annotations, as we cannot tell if they contain a class not annotated but predicted by the model. We evaluate our method by training a DeepLabV3 on the publicly available Dresden Surgical Anatomy Dataset, which provides multiple subsets of binary segmented anatomical structures. Our approach successfully combines 6 classes into one model, increasing the overall Dice Score by 4.4% compared to an ensemble of models trained on the classes individually. By including information on multiple classes, we were able to reduce confusion between stomach and colon by 24%. Our results demonstrate the feasibility of training a model on multiple datasets. This paves the way for future work further alleviating the need for one large, fully segmented datasets.

--------------------------------------------------------------------------------------------------------

RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Integrating large language models' coding capabilities with reinforcement learning's precise control could unlock powerful AI systems. This work introduces RL-GPT, a hierarchical framework coupling a coding-based agent with a reinforcement learned agent, demonstrating superior efficiency in challenging embodied tasks like Minecraft. The approach could drive advances in robotics, game AI, and other domains requiring combining high-level planning with low-level skilled control.

Authors: Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, Jiaya Jia

Link: https://arxiv.org/abs/2402.19299v1

Date: 2024-02-29

Summary:

Large Language Models (LLMs) have demonstrated proficiency in utilizing various tools by coding, yet they face limitations in handling intricate logic and precise control. In embodied tasks, high-level planning is amenable to direct coding, while low-level actions often necessitate task-specific refinement, such as Reinforcement Learning (RL). To seamlessly integrate both modalities, we introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent. The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks. This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline. Our approach outperforms traditional RL methods and existing GPT agents, demonstrating superior efficiency. In the Minecraft game, it rapidly obtains diamonds within a single day on an RTX3090. Additionally, it achieves SOTA performance across all designated MineDojo tasks.

--------------------------------------------------------------------------------------------------------

Learning Logic Specifications for Policy Guidance in POMDPs: an Inductive Logic Programming Approach

Planning under uncertainty with partially observable Markov decision processes is computationally demanding for complex domains. This paper proposes learning high-quality, interpretable heuristics from execution traces to guide the action selection process, evaluated on challenging problems. The method could enhance scalability and transparency of automated decision-making under uncertainty for applications like robotics and resource management.

Authors: Daniele Meli, Alberto Castellini, Alessandro Farinelli

Link: https://arxiv.org/abs/2402.19265v1

Date: 2024-02-29

Summary:

Partially Observable Markov Decision Processes (POMDPs) are a powerful framework for planning under uncertainty. They allow to model state uncertainty as a belief probability distribution. Approximate solvers based on Monte Carlo sampling show great success to relax the computational demand and perform online planning. However, scaling to complex realistic domains with many actions and long planning horizons is still a major challenge, and a key point to achieve good performance is guiding the action-selection process with domain-dependent policy heuristics which are tailored for the specific application domain. We propose to learn high-quality heuristics from POMDP traces of executions generated by any solver. We convert the belief-action pairs to a logical semantics, and exploit data- and time-efficient Inductive Logic Programming (ILP) to generate interpretable belief-based policy specifications, which are then used as online heuristics. We evaluate thoroughly our methodology on two notoriously challenging POMDP problems, involving large action spaces and long planning horizons, namely, rocksample and pocman. Considering different state-of-the-art online POMDP solvers, including POMCP, DESPOT and AdaOPS, we show that learned heuristics expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specific heuristics within lower computational time. Moreover, they well generalize to more challenging scenarios not experienced in the training phase (e.g., increasing rocks and grid size in rocksample, incrementing the size of the map and the aggressivity of ghosts in pocman).

--------------------------------------------------------------------------------------------------------

Whispers that Shake Foundations: Analyzing and Mitigating False Premise Hallucinations in Large Language Models

This paper analyzes the false premise hallucination issue in large language models, where they generate incorrect responses to prompts with false premises. It proposes an effective mitigation method constraining a small subset of attention heads responsible for this failure mode, significantly improving model performance. The technique could bolster LLM reliability and safety across applications handling complex queries.

Authors: Hongbang Yuan, Pengfei Cao, Zhuoran Jin, Yubo Chen, Daojian Zeng, Kang Liu, Jun Zhao

Link: https://arxiv.org/abs/2402.19103v1

Date: 2024-02-29

Summary:

Large Language Models (LLMs) have shown impressive capabilities but still suffer from the issue of hallucinations. A significant type of this issue is the false premise hallucination, which we define as the phenomenon when LLMs generate hallucinated text when confronted with false premise questions. In this paper, we perform a comprehensive analysis of the false premise hallucination and elucidate its internal working mechanism: a small subset of attention heads (which we designate as false premise heads) disturb the knowledge extraction process, leading to the occurrence of false premise hallucination. Based on our analysis, we propose \textbf{FAITH} (\textbf{F}alse premise \textbf{A}ttention head constra\textbf{I}ining for mi\textbf{T}igating \textbf{H}allucinations), a novel and effective method to mitigate false premise hallucinations. It constrains the false premise attention heads during the model inference process. Impressively, extensive experiments demonstrate that constraining only approximately $1\%$ of the attention heads in the model yields a notable increase of nearly $20\%$ of model performance.

--------------------------------------------------------------------------------------------------------

Negative-Binomial Randomized Gamma Markov Processes for Heterogeneous Overdispersed Count Time Series

Analyzing count data time series is crucial across physical and social domains, but existing methods struggle with overdispersed, heterogeneous sequences. This paper proposes a negative-binomial randomized gamma Markov process to better capture these characteristics, improving predictive performance and inference algorithm convergence. The approach enables learning more explainable latent structures compared to previous techniques. Potential applications include accurate forecasting and missing data imputation for phenomena ranging from network traffic to disease incidence.

Authors: Rui Huang, Sikun Yang, Heinz Koeppl

Link: https://arxiv.org/abs/2402.18995v1

Date: 2024-02-29

Summary:

Modeling count-valued time series has been receiving increasing attention since count time series naturally arise in physical and social domains. Poisson gamma dynamical systems (PGDSs) are newly-developed methods, which can well capture the expressive latent transition structure and bursty dynamics behind count sequences. In particular, PGDSs demonstrate superior performance in terms of data imputation and prediction, compared with canonical linear dynamical system (LDS) based methods. Despite these advantages, PGDS cannot capture the heterogeneous overdispersed behaviours of the underlying dynamic processes. To mitigate this defect, we propose a negative-binomial-randomized gamma Markov process, which not only significantly improves the predictive performance of the proposed dynamical system, but also facilitates the fast convergence of the inference algorithm. Moreover, we develop methods to estimate both factor-structured and graph-structured transition dynamics, which enable us to infer more explainable latent structure, compared with PGDSs. Finally, we demonstrate the explainable latent structure learned by the proposed method, and show its superior performance in imputing missing data and forecasting future observations, compared with the related models.

--------------------------------------------------------------------------------------------------------

SemEval 2024 -- Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF)

Understanding emotions and what triggers emotion shifts in conversations is essential for applications like mental health support and opinion mining. This shared task evaluates systems on three subtasks: emotion recognition, reasoning emotion flips in English dialogues, and reasoning emotion flips in code-mixed Hindi-English dialogues. The best systems achieved promising F1 scores, highlighting the value of developing computational models for interpreting the emotional dynamics of conversations.

Authors: Shivani Kumar, Md Shad Akhtar, Erik Cambria, Tanmoy Chakraborty

Link: https://arxiv.org/abs/2402.18944v1

Date: 2024-02-29

Summary:

We present SemEval-2024 Task 10, a shared task centred on identifying emotions and finding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for code-mixed dialogues, emotion flip reasoning for code-mixed dialogues, and emotion flip reasoning for English dialogues. Participating systems were tasked to automatically execute one or more of these subtasks. The datasets for these tasks comprise manually annotated conversations focusing on emotions and triggers for emotion shifts (The task data is available at https://github.com/LCS2-IIITD/EDiReF-SemEval2024.git). A total of 84 participants engaged in this task, with the most adept systems attaining F1-scores of 0.70, 0.79, and 0.76 for the respective subtasks. This paper summarises the results and findings from 24 teams alongside their system descriptions.

--------------------------------------------------------------------------------------------------------

AdaMergeX: Cross-Lingual Transfer with Large Language Models via Adaptive Adapter Merging

While large language models demonstrate impressive capabilities across languages, adapting them to new language-task combinations remains challenging due to data scarcity. This paper proposes AdaMergeX, a cross-lingual transfer approach that decouples task and language abilities via adapter merging, outperforming prior methods. The technique could accelerate customizing large language models for applications requiring multilingual support without extensive per-language data collection.

Authors: Yiran Zhao, Wenxuan Zhang, Huiming Wang, Kenji Kawaguchi, Lidong Bing

Link: https://arxiv.org/abs/2402.18913v1

Date: 2024-02-29

Summary:

As an effective alternative to the direct fine-tuning on target tasks in specific languages, cross-lingual transfer addresses the challenges of limited training data by decoupling ''task ability'' and ''language ability'' by fine-tuning on the target task in the source language and another selected task in the target language, respectively. However, they fail to fully separate the task ability from the source language or the language ability from the chosen task. In this paper, we acknowledge the mutual reliance between task ability and language ability and direct our attention toward the gap between the target language and the source language on tasks. As the gap removes the impact of tasks, we assume that it remains consistent across tasks. Based on this assumption, we propose a new cross-lingual transfer method called $\texttt{AdaMergeX}$ that utilizes adaptive adapter merging. By introducing a reference task, we can determine that the divergence of adapters fine-tuned on the reference task in both languages follows the same distribution as the divergence of adapters fine-tuned on the target task in both languages. Hence, we can obtain target adapters by combining the other three adapters. Furthermore, we propose a structure-adaptive adapter merging method. Our empirical results demonstrate that our approach yields new and effective cross-lingual transfer, outperforming existing methods across all settings.

--------------------------------------------------------------------------------------------------------

How do Large Language Models Handle Multilingualism?

As large language models excel at multilingual tasks, understanding how they process different languages is valuable for interpretation and performance improvement. This work introduces a framework depicting LLMs' multilingual processing flow and investigates language-specific neuron activation patterns. The insights could guide enhancing multilingual abilities for applications like machine translation, cross-lingual information retrieval, and education.Authors: Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, Lidong Bing

Link: https://arxiv.org/abs/2402.18815v1

Date: 2024-02-29

Summary:

Large language models (LLMs) demonstrate remarkable performance across a spectrum of languages. In this work, we delve into the question: How do LLMs handle multilingualism? We introduce a framework that depicts LLMs' processing of multilingual inputs: In the first several layers, LLMs understand the question, converting multilingual inputs into English to facilitate the task-solving phase. In the intermediate layers, LLMs engage in problem-solving by thinking in English and incorporating multilingual knowledge to obtain factual content, leveraging the self-attention and feed-forward structures, respectively. In the last several layers, LLMs generate responses that align with the original language of the query. In addition, we investigate the existence of language-specific neurons when processing a certain language. To detect neurons activated by the input language, even without labels, we innovatively design a Parallel Language specific Neuron Detection ($\texttt{PLND}$) method that effectively measures the significance of neurons when handling multilingual inputs. By comprehensive ablation analysis through deactivating neurons of different layers and structures, we verify the framework that we propose. Additionally, we demonstrate that we can utilize such a framework to effectively enhance the multilingual ability with much less training effort.

--------------------------------------------------------------------------------------------------------

On the Decision-Making Abilities in Role-Playing using Large Language Models

While role-playing enables large language models to embody different personas, evaluating their acquired decision-making patterns is crucial for real-world applications. This paper quantifies LLM decision abilities across adaptability, exploration-exploitation, reasoning, and safety when role-playing personality types from MBTI. The analysis reveals correlations between decision profiles and roles, underscoring LLMs' capability to internalize sociological characteristics - insights valuable for virtual assistants, training simulations, and human-AI interaction design.

Authors: Chenglei Shen, Guofu Xie, Xiao Zhang, Jun Xu

Link: https://arxiv.org/abs/2402.18807v1

Date: 2024-02-29

Summary:

Large language models (LLMs) are now increasingly utilized for role-playing tasks, especially in impersonating domain-specific experts, primarily through role-playing prompts. When interacting in real-world scenarios, the decision-making abilities of a role significantly shape its behavioral patterns. In this paper, we concentrate on evaluating the decision-making abilities of LLMs post role-playing thereby validating the efficacy of role-playing. Our goal is to provide metrics and guidance for enhancing the decision-making abilities of LLMs in role-playing tasks. Specifically, we first use LLMs to generate virtual role descriptions corresponding to the 16 personality types of Myers-Briggs Type Indicator (abbreviated as MBTI) representing a segmentation of the population. Then we design specific quantitative operations to evaluate the decision-making abilities of LLMs post role-playing from four aspects: adaptability, exploration$\&$exploitation trade-off ability, reasoning ability, and safety. Finally, we analyze the association between the performance of decision-making and the corresponding MBTI types through GPT-4. Extensive experiments demonstrate stable differences in the four aspects of decision-making abilities across distinct roles, signifying a robust correlation between decision-making abilities and the roles emulated by LLMs. These results underscore that LLMs can effectively impersonate varied roles while embodying their genuine sociological characteristics.

--------------------------------------------------------------------------------------------------------

A revision on Multi-Criteria Decision Making methods for Multi-UAV Mission Planning Support

Planning complex multi-UAV missions requires optimizing multiple variables like makespan and risk simultaneously. This work designs a decision support system that ranks and filters optimal mission plans using various multi-criteria decision making methods. Results indicate fuzzy methods perform best overall, while all methods improve when user preferences prioritize specific variables over balance. The approach could aid mission operators in diverse UAV applications by reducing workload in selecting plans.

Authors: Cristian Ramirez-Atencia, Victor Rodriguez-Fernandez, David Camacho

Link: https://arxiv.org/abs/2402.18743v1

Date: 2024-02-28

Summary:

Over the last decade, Unmanned Aerial Vehicles (UAVs) have been extensively used in many commercial applications due to their manageability and risk avoidance. One of the main problems considered is the Mission Planning for multiple UAVs, where a solution plan must be found satisfying the different constraints of the problem. This problem has multiple variables that must be optimized simultaneously, such as the makespan, the cost of the mission or the risk. Therefore, the problem has a lot of possible optimal solutions, and the operator must select the final solution to be executed among them. In order to reduce the workload of the operator in this decision process, a Decision Support System (DSS) becomes necessary. In this work, a DSS consisting of ranking and filtering systems, which order and reduce the optimal solutions, has been designed. With regard to the ranking system, a wide range of Multi-Criteria Decision Making (MCDM) methods, including some fuzzy MCDM, are compared on a multi-UAV mission planning scenario, in order to study which method could fit better in a multi-UAV decision support system. Expert operators have evaluated the solutions returned, and the results show, on the one hand, that fuzzy methods generally achieve better average scores, and on the other, that all of the tested methods perform better when the preferences of the operators are biased towards a specific variable, and worse when their preferences are balanced. For the filtering system, a similarity function based on the proximity of the solutions has been designed, and on top of that, a threshold is tuned empirically to decide how to filter solutions without losing much of the hypervolume of the space of solutions.

--------------------------------------------------------------------------------------------------------

GAIA: Categorical Foundations of Generative AI

This paper proposes GAIA, a hierarchical generative AI architecture formulated using category theory concepts like simplicial complexes and lifting diagrams. The framework models aspects of deep learning like backpropagation in a novel categorical manner. While theoretical, GAIA illustrates how category theory could provide new perspectives on understanding and developing complex AI systems.

Authors: Sridhar Mahadevan

Link: https://arxiv.org/abs/2402.18732v1

Date: 2024-02-28

Summary:

In this paper, we propose GAIA, a generative AI architecture based on category theory. GAIA is based on a hierarchical model where modules are organized as a simplicial complex. Each simplicial complex updates its internal parameters biased on information it receives from its superior simplices and in turn relays updates to its subordinate sub-simplices. Parameter updates are formulated in terms of lifting diagrams over simplicial sets, where inner and outer horn extensions correspond to different types of learning problems. Backpropagation is modeled as an endofunctor over the category of parameters, leading to a coalgebraic formulation of deep learning.

--------------------------------------------------------------------------------------------------------

Learning Associative Memories with Gradient Descent

By studying how gradient descent trains associative memory modules storing token embeddings, this work derives insights into training dynamics. It uncovers phenomena like logarithmic margin growth in overparameterization and oscillatory regimes from imbalanced data. The analysis elucidates trade-offs between learning rate, convergence speed and suboptimal memorization. These findings could guide more efficient, robust associative memory training crucial for language models and other systems.

Authors: Vivien Cabannes, Berfin Simsek, Alberto Bietti

Link: https://arxiv.org/abs/2402.18724v1

Date: 2024-02-28

Summary:

This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic growth of the ``classification margins.'' Yet, we show that imbalance in token frequencies and memory interferences due to correlated embeddings lead to oscillatory transitory regimes. The oscillations are more pronounced with large step sizes, which can create benign loss spikes, although these learning rates speed up the dynamics and accelerate the asymptotic convergence. In underparameterized regimes, we illustrate how the cross-entropy loss can lead to suboptimal memorization schemes. Finally, we assess the validity of our findings on small Transformer models.

--------------------------------------------------------------------------------------------------------

Approaching Human-Level Forecasting with Language Models

Accurate forecasting of future events is crucial for informed policy and decision making across domains. This work investigates whether large language models can match the forecasting abilities of expert human forecasters. By developing a retrieval-augmented language model system to automatically gather information, generate forecasts, and aggregate predictions, the authors evaluate performance against aggregated human forecasts on a new test set. Remarkably, their system approaches and sometimes exceeds the accuracy of competitive human forecasters. These findings suggest language models could provide scalable, high-quality forecasting capabilities to guide institutional decision-making processes.

Authors: Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt

Link: https://arxiv.org/abs/2402.18563v1

Date: 2024-02-28

Summary:

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

--------------------------------------------------------------------------------------------------------

A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist

Financial trading is a complex endeavor requiring synthesizing diverse data streams like news, market data, and charts. While AI techniques are widely applied, key challenges remain in adequately handling multimodal inputs and generalizing across trading tasks. This paper introduces FinAgent, a pioneering multimodal foundation agent tailored for financial trading. Uniquely integrating market intelligence, dual-level reflection, diversified memory, and human-inspired strategies, FinAgent demonstrates substantial performance gains over prior methods on multiple financial metrics across stock and crypto trading datasets. FinAgent represents a significant milestone towards developing trusted, generalist AI assistants that can robustly navigate the dynamic, multimodal realm of financial markets.

Authors: Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, Longtao Zheng, Xinrun Wang, Bo An

Link: https://arxiv.org/abs/2402.18485v2

Date: 2024-02-29

Summary:

Financial trading is a crucial component of the markets, informed by a multimodal information landscape encompassing news, prices, and Kline charts, and encompasses diverse tasks such as quantitative trading and high-frequency trading with various assets. While advanced AI techniques like deep learning and reinforcement learning are extensively utilized in finance, their application in financial trading tasks often faces challenges due to inadequate handling of multimodal data and limited generalizability across various tasks. To address these challenges, we present FinAgent, a multimodal foundational agent with tool augmentation for financial trading. FinAgent's market intelligence module processes a diverse range of data-numerical, textual, and visual-to accurately analyze the financial market. Its unique dual-level reflection module not only enables rapid adaptation to market dynamics but also incorporates a diversified memory retrieval system, enhancing the agent's ability to learn from historical data and improve decision-making processes. The agent's emphasis on reasoning for actions fosters trust in its financial decisions. Moreover, FinAgent integrates established trading strategies and expert insights, ensuring that its trading approaches are both data-driven and rooted in sound financial principles. With comprehensive experiments on 6 financial datasets, including stocks and Crypto, FinAgent significantly outperforms 9 state-of-the-art baselines in terms of 6 financial metrics with over 36% average improvement on profit. Specifically, a 92.27% return (a 84.39% relative improvement) is achieved on one dataset. Notably, FinAgent is the first advanced multimodal foundation agent designed for financial trading tasks.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithMarch 4, 2024Comment