Week Ending 8.25.2024

 

RESEARCH WATCH: 8.25.2024

 

Avatar Visual Similarity for Social HCI: Increasing Self-Awareness

This research explores how adjusting the visual similarity of a user's avatar to their real appearance can be used to enhance self-awareness during VR training exercises. This could be beneficial for applications like public speaking practice or social anxiety treatment.

Authors:  Bernhard Hilpert, Claudio Alves da Silva, Leon Christidis, Chirag Bhuvaneshwara, Patrick Gebhard, Fabrizio Nunnari, Dimitra Tsovaltzi

Link:  https://arxiv.org/abs/2408.13084v1

Date: 2024-08-23

Summary:

Self-awareness is a critical factor in social human-human interaction and, hence, in social HCI interaction. Increasing self-awareness through mirrors or video recordings is common in face-to-face trainings, since it influences antecedents of self-awareness like explicit identification and implicit affective identification (affinity). However, increasing self-awareness has been scarcely examined in virtual trainings with virtual avatars, which allow for adjusting the similarity, e.g. to avoid negative effects of self-consciousness. Automatic visual similarity in avatars is an open issue related to high costs. It is important to understand which features need to be manipulated and which degree of similarity is necessary for self-awareness to leverage the added value of using avatars for self-awareness. This article examines the relationship between avatar visual similarity and increasing self-awareness in virtual training environments. We define visual similarity based on perceptually important facial features for human-human identification and develop a theory-based methodology to systematically manipulate visual similarity of virtual avatars and support self-awareness. Three personalized versions of virtual avatars with varying degrees of visual similarity to participants were created (weak, medium and strong facial features manipulation). In a within-subject study (N=33), we tested effects of degree of similarity on perceived similarity, explicit identification and implicit affective identification (affinity). Results show significant differences between the weak similarity manipulation, and both the strong manipulation and the random avatar for all three antecedents of self-awareness. An increasing degree of avatar visual similarity influences antecedents of self-awareness in virtual environments.

--------------------------------------------------------------------------------------------------------

IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

This paper proposes a new architecture that allows frozen LLMs (LLMs that are not fine-tuned on new data) to learn multimodal capabilities. This could be used to develop AI systems that can understand and respond to complex user queries that involve both text and visual information.

Authors:  Bin Wang, Chunyu Xie, Dawei Leng, Yuhui Yin

Link:  https://arxiv.org/abs/2408.12902v1

Date: 2024-08-23

Summary:

In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the fine-tuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models are available at https://github.com/360CVGroup/Inner-Adaptor-Architecture.

--------------------------------------------------------------------------------------------------------

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

This paper introduces a new text-to-video synthesis model that can generate high-quality videos while being more computationally efficient than previous models. This could be used for a variety of applications, such as creating explainer videos from text summaries or generating video content for social media platforms.

Authors:  Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong

Link:  https://arxiv.org/abs/2408.12590v1

Date: 2024-08-22

Summary:

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

--------------------------------------------------------------------------------------------------------

Graph Retrieval Augmented Trustworthiness Reasoning

This paper proposes a new framework for trustworthiness reasoning in multi-agent systems. This could be used to develop more sophisticated AI systems that can collaborate effectively with humans or other AI systems.

Authors:  Ying Zhu, Shengchang Li, Ziqian Kong, Peilan Xu

Link:  https://arxiv.org/abs/2408.12333v1

Date: 2024-08-22

Summary:

Trustworthiness reasoning is crucial in multiplayer games with incomplete information, enabling agents to identify potential allies and adversaries, thereby enhancing reasoning and decision-making processes. Traditional approaches relying on pre-trained models necessitate extensive domain-specific data and considerable reward feedback, with their lack of real-time adaptability hindering their effectiveness in dynamic environments. In this paper, we introduce the Graph Retrieval Augmented Reasoning (GRATR) framework, leveraging the Retrieval-Augmented Generation (RAG) technique to bolster trustworthiness reasoning in agents. GRATR constructs a dynamic trustworthiness graph, updating it in real-time with evidential information, and retrieves relevant trust data to augment the reasoning capabilities of Large Language Models (LLMs). We validate our approach through experiments on the multiplayer game "Werewolf," comparing GRATR against baseline LLM and LLM enhanced with Native RAG and Rerank RAG. Our results demonstrate that GRATR surpasses the baseline methods by over 30\% in winning rate, with superior reasoning performance. Moreover, GRATR effectively mitigates LLM hallucinations, such as identity and objective amnesia, and crucially, it renders the reasoning process more transparent and traceable through the use of the trustworthiness graph.

--------------------------------------------------------------------------------------------------------

Do Neural Scaling Laws Exist on Graph Self-Supervised Learning?

This paper investigates the effectiveness of existing SSL techniques for training graph-based deep learning models. This research could inform the development of new AI systems that can learn from and reason about complex relationships between entities.

Authors:  Qian Ma, Haitao Mao, Jingzhe Liu, Zhehua Zhang, Chunlin Feng, Yu Song, Yihan Shao, Tianfan Fu, Yao Ma

Link:  https://arxiv.org/abs/2408.11243v1

Date: 2024-08-20

Summary:

Self-supervised learning~(SSL) is essential to obtain foundation models in NLP and CV domains via effectively leveraging knowledge in large-scale unlabeled data. The reason for its success is that a suitable SSL design can help the model to follow the neural scaling law, i.e., the performance consistently improves with increasing model and dataset sizes. However, it remains a mystery whether existing SSL in the graph domain can follow the scaling behavior toward building Graph Foundation Models~(GFMs) with large-scale pre-training. In this study, we examine whether existing graph SSL techniques can follow the neural scaling behavior with the potential to serve as the essential component for GFMs. Our benchmark includes comprehensive SSL technique implementations with analysis conducted on both the conventional SSL setting and many new settings adopted in other domains. Surprisingly, despite the SSL loss continuously decreasing, no existing graph SSL techniques follow the neural scaling behavior on the downstream performance. The model performance only merely fluctuates on different data scales and model scales. Instead of the scales, the key factors influencing the performance are the choices of model architecture and pretext task design. This paper examines existing SSL techniques for the feasibility of Graph SSL techniques in developing GFMs and opens a new direction for graph SSL design with the new evaluation prototype. Our code implementation is available online to ease reproducibility on https://github.com/GraphSSLScaling/GraphSSLScaling.

--------------------------------------------------------------------------------------------------------

Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

This paper investigates the effectiveness of existing SSL techniques for training graph-based deep learning models. This research could inform the development of new AI systems that can learn from and reason about complex relationships between entities.

Authors:  Christos Constantinou, Georgios Ioannides, Aman Chadha, Aaron Elkins, Edwin Simpson

Link:  https://arxiv.org/abs/2408.11237v1

Date: 2024-08-20

Summary:

Detecting out-of-distribution (OOD) data is crucial in machine learning applications to mitigate the risk of model overconfidence, thereby enhancing the reliability and safety of deployed systems. The majority of existing OOD detection methods predominantly address uni-modal inputs, such as images or texts. In the context of multi-modal documents, there is a notable lack of extensive research on the performance of these methods, which have primarily been developed with a focus on computer vision tasks. We propose a novel methodology termed as attention head masking (AHM) for multi-modal OOD tasks in document classification systems. Our empirical results demonstrate that the proposed AHM method outperforms all state-of-the-art approaches and significantly decreases the false positive rate (FPR) compared to existing solutions up to 7.5\%. This methodology generalizes well to multi-modal data, such as documents, where visual and textual information are modeled under the same Transformer architecture. To address the scarcity of high-quality publicly available document datasets and encourage further research on OOD detection for documents, we introduce FinanceDocs, a new document AI dataset. Our code and dataset are publicly available.

--------------------------------------------------------------------------------------------------------

Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

This paper proposes a new method for identifying potential safety vulnerabilities in LLMs. This research could be used to develop techniques for mitigating the risk of LLMs generating harmful content.

Authors:  Haoyu Wang, Bingzhe Wu, Yatao Bian, Yongzhe Chang, Xueqian Wang, Peilin Zhao

Link:  https://arxiv.org/abs/2408.10668v2

Date: 2024-08-21

Summary:

Large Language Models (LLMs) are implicit troublemakers. While they provide valuable insights and assist in problem-solving, they can also potentially serve as a resource for malicious activities. Implementing safety alignment could mitigate the risk of LLMs generating harmful responses. We argue that: even when an LLM appears to successfully block harmful queries, there may still be hidden vulnerabilities that could act as ticking time bombs. To identify these underlying weaknesses, we propose to use a cost value model as both a detector and an attacker. Trained on external or self-generated harmful datasets, the cost value model could successfully influence the original safe LLM to output toxic content in decoding process. For instance, LLaMA-2-chat 7B outputs 39.18% concrete toxic content, along with only 22.16% refusals without any harmful suffixes. These potential weaknesses can then be exploited via prompt optimization such as soft prompts on images. We name this decoding strategy: Jailbreak Value Decoding (JVD), emphasizing that seemingly secure LLMs may not be as safe as we initially believe. They could be used to gather harmful data or launch covert attacks.

--------------------------------------------------------------------------------------------------------

Data Augmentation Integrating Dialogue Flow and Style to Adapt Spoken Dialogue Systems to Low-Resource User Groups

This paper proposes a new method for improving the performance of SDSs by using data augmentation techniques. This could be used to develop more inclusive AI systems that can effectively interact with a wider range of users.

Authors:  Zhiyang Qi, Michimasa Inaba

Link:  https://arxiv.org/abs/2408.10516v1

Date: 2024-08-20

Summary:

This study addresses the interaction challenges encountered by spoken dialogue systems (SDSs) when engaging with users who exhibit distinct conversational behaviors, particularly minors, in scenarios where data are scarce. We propose a novel data augmentation framework to enhance SDS performance for user groups with limited resources. Our approach leverages a large language model (LLM) to extract speaker styles and a pre-trained language model (PLM) to simulate dialogue act history. This method generates enriched and personalized dialogue data, facilitating improved interactions with unique user demographics. Extensive experiments validate the efficacy of our methodology, highlighting its potential to foster the development of more adaptive and inclusive dialogue systems.

--------------------------------------------------------------------------------------------------------

D-RMGPT: Robot-assisted collaborative tasks driven by large multimodal models

This paper proposes a new system that uses large multimodal models (LMMs) to enable robots to assist humans with complex collaborative tasks. This could revolutionize the way humans and robots work together in a variety of settings.

Authors:  M. Forlini, M. Babcinschi, G. Palmieri, P. Neto

Link:  https://arxiv.org/abs/2408.11761v1

Date: 2024-08-21

Summary:

Collaborative robots are increasingly popular for assisting humans at work and daily tasks. However, designing and setting up interfaces for human-robot collaboration is challenging, requiring the integration of multiple components, from perception and robot task control to the hardware itself. Frequently, this leads to highly customized solutions that rely on large amounts of costly training data, diverging from the ideal of flexible and general interfaces that empower robots to perceive and adapt to unstructured environments where they can naturally collaborate with humans. To overcome these challenges, this paper presents the Detection-Robot Management GPT (D-RMGPT), a robot-assisted assembly planner based on Large Multimodal Models (LMM). This system can assist inexperienced operators in assembly tasks without requiring any markers or previous training. D-RMGPT is composed of DetGPT-V and R-ManGPT. DetGPT-V, based on GPT-4V(vision), perceives the surrounding environment through one-shot analysis of prompted images of the current assembly stage and the list of components to be assembled. It identifies which components have already been assembled by analysing their features and assembly requirements. R-ManGPT, based on GPT-4, plans the next component to be assembled and generates the robot's discrete actions to deliver it to the human co-worker. Experimental tests on assembling a toy aircraft demonstrated that D-RMGPT is flexible and intuitive to use, achieving an assembly success rate of 83% while reducing the assembly time for inexperienced operators by 33% compared to the manual process. http://robotics-and-ai.github.io/LMMmodels/

--------------------------------------------------------------------------------------------------------

IDEA: Enhancing the rule learning ability of language agent through Induction, DEuction, and Abduction

Large language models (LLMs) struggle with learning rules through abduction (hypothesis generation) and refinement based on feedback. This paper introduces a new benchmark (RULEARN) and proposes the IDEA agent, which uses a structured approach (induction, deduction, abduction) to dynamically establish and apply rules, mimicking human-like reasoning.

Authors:  Kaiyu He, Zhiyu Chen

Link:  https://arxiv.org/abs/2408.10455v1

Date: 2024-08-19

Summary:

While large language models (LLMs) have been thoroughly evaluated for deductive and inductive reasoning, their proficiency in abductive reasoning and holistic rule learning in interactive environments remains less explored. This work introduces RULEARN, a novel benchmark specifically designed to assess the rule-learning ability of LLMs in interactive settings. In RULEARN, agents interact with the environment to gather observations and discern patterns, using these insights to solve problems. To further enhance the rule-learning capabilities of LLM agents within this benchmark, we propose IDEA agent, which integrates Induction, Deduction, and Abduction processes. IDEA agent refines this approach by leveraging a structured reasoning sequence: generating hypotheses through abduction, testing them via deduction, and refining them based on induction feedback. This sequence enables agents to dynamically establish and apply rules, mimicking human-like reasoning processes. Our evaluation of five representative LLMs indicates that while these models can generate plausible initial hypotheses, they often struggle with strategic interaction within the environment, effective incorporation of feedback, and adaptive refinement of their hypotheses. IDEA agent demonstrates significantly improved performance on the RULEARN benchmark, offering valuable insights for the development of agents capable of human-like rule-learning in real-world scenarios. We will release our code and data.

--------------------------------------------------------------------------------------------------------

cc-DRL: a Convex Combined Deep Reinforcement Learning Flight Control Design for a Morphing Quadrotor

Morphing quadrotors offer better flight performance compared to traditional ones, but their complex dynamics make it difficult to design control systems. This paper proposes a novel control algorithm (cc-DRL) that combines model-free deep reinforcement learning with convex optimization to achieve effective position and attitude control for a class of morphing quadrotors.

Authors:  Tao Yang, Huai-Ning Wu, Jun-Wei Wang

Link:  https://arxiv.org/abs/2408.13054v1

Date: 2024-08-23

Summary:

In comparison to common quadrotors, the shape change of morphing quadrotors endows it with a more better flight performance but also results in more complex flight dynamics. Generally, it is extremely difficult or even impossible for morphing quadrotors to establish an accurate mathematical model describing their complex flight dynamics. To figure out the issue of flight control design for morphing quadrotors, this paper resorts to a combination of model-free control techniques (e.g., deep reinforcement learning, DRL) and convex combination (CC) technique, and proposes a convex-combined-DRL (cc-DRL) flight control algorithm for position and attitude of a class of morphing quadrotors, where the shape change is realized by the length variation of four arm rods. In the proposed cc-DRL flight control algorithm, proximal policy optimization algorithm that is a model-free DRL algorithm is utilized to off-line train the corresponding optimal flight control laws for some selected representative arm length modes and hereby a cc-DRL flight control scheme is constructed by the convex combination technique. Finally, simulation results are presented to show the effectiveness and merit of the proposed flight control algorithm.

--------------------------------------------------------------------------------------------------------

Data-Centric Approach to Constrained Machine Learning: A Case Study on Conway's Game of Life

This paper explores a data-centric approach to machine learning using Conway's Game of Life as a case study. It demonstrates how a strategically designed training dataset can significantly improve the performance of a network in learning the game's transition rules, highlighting the importance of domain expertise in creating effective AI applications for real-world scenarios with limited parameters.

Authors:  Anton Bibin, Anton Dereventsov

Link:  https://arxiv.org/abs/2408.12778v1

Date: 2024-08-23

Summary:

This paper focuses on a data-centric approach to machine learning applications in the context of Conway's Game of Life. Specifically, we consider the task of training a minimal architecture network to learn the transition rules of Game of Life for a given number of steps ahead, which is known to be challenging due to restrictions on the allowed number of trainable parameters. An extensive quantitative analysis showcases the benefits of utilizing a strategically designed training dataset, with its advantages persisting regardless of other parameters of the learning configuration, such as network initialization weights or optimization algorithm. Importantly, our findings highlight the integral role of domain expert insights in creating effective machine learning applications for constrained real-world scenarios.

--------------------------------------------------------------------------------------------------------

A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language

Neural networks can exhibit sudden learning leaps ("emergence") as data, size, or computation increase. This paper proposes a definition for emergence in neural networks and explores it through Transformers trained on a formal language. The study suggests that emergence occurs when the model learns the underlying data structure, leading to better performance on specific tasks. It further proposes a model to predict this phenomenon.

Authors:  Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert P. Dick, Hidenori Tanaka

Link:  https://arxiv.org/abs/2408.12578v1

Date: 2024-08-22

Summary:

Increase in data, size, or compute can lead to sudden learning of specific capabilities by a neural network -- a phenomenon often called "emergence". Beyond scientific understanding, establishing the causal factors underlying such emergent capabilities is crucial to enable risk regulation frameworks for AI. In this work, we seek inspiration from study of emergent properties in other fields and propose a phenomenological definition for the concept in the context of neural networks. Our definition implicates the acquisition of specific structures underlying the data-generating process as a cause of sudden performance growth for specific, narrower tasks. We empirically investigate this definition by proposing an experimental system grounded in a context-sensitive formal language and find that Transformers trained to perform tasks on top of strings from this language indeed exhibit emergent capabilities. Specifically, we show that once the language's underlying grammar and context-sensitivity inducing structures are learned by the model, performance on narrower tasks suddenly begins to improve. We then analogize our network's learning dynamics with the process of percolation on a bipartite graph, establishing a formal phase transition model that predicts the shift in the point of emergence observed in experiment when changing the data structure. Overall, our experimental and theoretical frameworks yield a step towards better defining, characterizing, and predicting emergence in neural networks.

--------------------------------------------------------------------------------------------------------

Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers

Large neural networks are computationally expensive due to their vast number of parameters. This paper builds upon previous work that uses explanation methods to identify and prune unnecessary network components. The authors further optimize attribution methods for pruning and demonstrate its effectiveness on large transformer-based networks and convolutional neural networks while maintaining high performance.

Authors:  Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

Link:  https://arxiv.org/abs/2408.12568v1

Date: 2024-08-22

Summary:

To solve ever more complex problems, Deep Neural Networks are scaled to billions of parameters, leading to huge computational costs. An effective approach to reduce computational requirements and increase efficiency is to prune unnecessary components of these often over-parameterized networks. Previous work has shown that attribution methods from the field of eXplainable AI serve as effective means to extract and prune the least relevant network components in a few-shot fashion. We extend the current state by proposing to explicitly optimize hyperparameters of attribution methods for the task of pruning, and further include transformer-based networks in our analysis. Our approach yields higher model compression rates of large transformer- and convolutional architectures (VGG, ResNet, ViT) compared to previous works, while still attaining high performance on ImageNet classification tasks. Here, our experiments indicate that transformers have a higher degree of over-parameterization compared to convolutional neural networks. Code is available at $\href{https://github.com/erfanhatefi/Pruning-by-eXplaining-in-PyTorch}{\text{this https link}}$.

--------------------------------------------------------------------------------------------------------

Transformers As Approximations of Solomonoff Induction

Solomonoff Induction is an optimal algorithm for sequence prediction. This paper explores the hypothesis that Transformers, the foundation of large language models, approximate Solomonoff Induction better than any other current method. The authors present evidence for and against this claim, suggesting further research into modeling AI systems in this way.

Authors:  Nathan Young, Michael Witbrock

Link:  https://arxiv.org/abs/2408.12065v1

Date: 2024-08-22

Summary:

Solomonoff Induction is an optimal-in-the-limit unbounded algorithm for sequence prediction, representing a Bayesian mixture of every computable probability distribution and performing close to optimally in predicting any computable sequence.   Being an optimal form of computational sequence prediction, it seems plausible that it may be used as a model against which other methods of sequence prediction might be compared.   We put forth and explore the hypothesis that Transformer models - the basis of Large Language Models - approximate Solomonoff Induction better than any other extant sequence prediction method. We explore evidence for and against this hypothesis, give alternate hypotheses that take this evidence into account, and outline next steps for modelling Transformers and other kinds of AI in this way.

--------------------------------------------------------------------------------------------------------

Understanding Epistemic Language with a Bayesian Theory of Mind

Humans can understand and evaluate claims about others' beliefs even though these beliefs cannot be directly observed. This paper introduces a cognitive model (LaBToM) based on Bayesian theory of mind to explain how people interpret language about beliefs. LaBToM translates natural language into a language of thought and evaluates it against a model of rational action and perception. The model performs well in explaining human judgments about beliefs compared to large language models and simpler models.

Authors:  Lance Ying, Tan Zhi-Xuan, Lionel Wong, Vikash Mansinghka, Joshua B. Tenenbaum

Link:  https://arxiv.org/abs/2408.12022v1

Date: 2024-08-21

Summary:

How do people understand and evaluate claims about others' beliefs, even though these beliefs cannot be directly observed? In this paper, we introduce a cognitive model of epistemic language interpretation, grounded in Bayesian inferences about other agents' goals, beliefs, and intentions: a language-augmented Bayesian theory-of-mind (LaBToM). By translating natural language into an epistemic ``language-of-thought'', then evaluating these translations against the inferences produced by inverting a probabilistic generative model of rational action and perception, LaBToM captures graded plausibility judgments about epistemic claims. We validate our model in an experiment where participants watch an agent navigate a maze to find keys hidden in boxes needed to reach their goal, then rate sentences about the agent's beliefs. In contrast with multimodal LLMs (GPT-4o, Gemini Pro) and ablated models, our model correlates highly with human judgments for a wide range of expressions, including modal language, uncertainty expressions, knowledge claims, likelihood comparisons, and attributions of false belief.

--------------------------------------------------------------------------------------------------------

AnyGraph: Graph Foundation Model in the Wild

Graph data is becoming increasingly common, but current graph learning models often struggle to generalize to unseen data. This paper introduces AnyGraph, a unified graph foundation model designed to address key challenges like structural and feature heterogeneity, fast adaptation to new domains, and scaling law emergence (better performance with more data).

Authors:  Lianghao Xia, Chao Huang

Link:  https://arxiv.org/abs/2408.10700v1

Date: 2024-08-20

Summary:

The growing ubiquity of relational data structured as graphs has underscored the need for graph learning models with exceptional generalization capabilities. However, current approaches often struggle to effectively extract generalizable insights, frequently requiring extensive fine-tuning and limiting their versatility. Graph foundation models offer a transformative solution, with the potential to learn robust, generalizable representations from graph data. This enables more effective and adaptable applications across a wide spectrum of tasks and domains. In this work, we investigate a unified graph model, AnyGraph, designed to handle key challenges: i) Structure Heterogenity. Addressing distribution shift in graph structural information; ii) Feature Heterogenity. Handling diverse feature representation spaces across graph datasets; iii) Fast Adaptation. Efficiently adapting the model to new graph domains; iv) Scaling Law Emergence. Enabling the model to exhibit scaling law behavior, where its performance scales favorably with the amount of data and parameter sizes. To tackle these critical challenges, we build the AnyGraph upon a Graph Mixture-of-Experts (MoE) architecture. This approach empowers the model to effectively manage both the in-domain and cross-domain distribution shift concerning structure-level and feature-level heterogeneity. Furthermore, a lightweight graph expert routing mechanism is proposed to facilitate AnyGraph's fast adaptability to new data and domains. Our extensive experiments on diverse 38 graph datasets have demonstrated the strong zero-shot learning performance of AnyGraph across diverse graph domains with significant distribution shift. Furthermore, we have validated the model's fast adaptation ability and scaling law emergence, showcasing its versatility.

--------------------------------------------------------------------------------------------------------

A Monte Carlo Tree Search approach to QAOA: finding a needle in the haystack

Quantum Approximate Optimization Algorithms (QAOA) are promising for tackling classical optimization problems. However, efficiently optimizing their parameters is crucial. This paper proposes a novel approach that adapts Monte Carlo Tree Search (MCTS), a common AI technique, for parameter optimization in QAOA. This method leverages regular parameter patterns and allows for flexible and noise-resilient optimization, potentially improving the performance of VQA on near-term quantum devices.

Authors:  Andoni Agirre, Evert Van Nieuwenburg, Matteo M. Wauters

Link:  https://arxiv.org/abs/2408.12648v1

Date: 2024-08-22

Summary:

The search for quantum algorithms to tackle classical combinatorial optimization problems has long been one of the most attractive yet challenging research topics in quantum computing. In this context, variational quantum algorithms (VQA) are a promising family of hybrid quantum-classical methods tailored to cope with the limited capability of near-term quantum hardware. However, their effectiveness is hampered by the complexity of the classical parameter optimization which is prone to getting stuck either in local minima or in flat regions of the cost-function landscape. The clever design of efficient optimization methods is therefore of fundamental importance for fully leveraging the potential of VQAs. In this work, we approach parameter optimization as a sequential decision-making problem and tackle it with an adaptation of Monte Carlo Tree Search (MCTS), a common artificial intelligence technique designed for efficiently exploring complex decision graphs. We show that leveraging regular parameter patterns deeply affects the decision-tree structure and allows for a flexible and noise-resilient optimization strategy suitable for near-term quantum devices. Our results shed further light on the interplay between artificial intelligence and quantum information and provide a valuable addition to the toolkit of variational quantum circuits.

--------------------------------------------------------------------------------------------------------

Are big language models more successful than humans in the medical specialization exam (tus)?

This study evaluates the performance of three large language models (LLMs) in answering medical questions from the 2021 Turkish Medical Specialization Examination. The results show that all three models achieved high accuracy, with one model (ChatGPT-4o) even surpassing the scores of the highest-performing human candidates. This highlights the potential of LLMs in medical education and assessment.

Authors:  Yesim Aygul, Muge Olucoglu, Adil Alpkocak

Link:  https://arxiv.org/abs/2408.12305v1

Date: 2024-08-22

Summary:

The potential of artificial intelligence in medical education and assessment has been made evident by recent developments in natural language processing and artificial intelligence. Medical questions can now be successfully answered by artificial intelligence algorithms. It can help medical practitioners. This study evaluates the performance of three different artificial intelligence models in answering Turkish medical questions in the 2021 1st Term Medical Specialization Examination (MSE). MSE consists of a total of 240 questions across clinical (CMST) and basic (BMST) medical sciences. According to the results in CMST, it was concluded that Gemini correctly answered 82 questions, ChatGPT-4 answered 105 questions and ChatGPT-4o answered 117 questions. In BMST, Gemini and ChatGPT-4 answered 93 questions and ChatGPT-4o answered 107 questions correctly according to the answer key. ChatGPT-4o outperformed the candidate with the highest scores of 113 and 106 according to CMST and BMST respectively. This study highlights the importance of the potential of artificial intelligence in medical education and assessment. It demonstrates that advanced models can achieve high accuracy and contextual understanding, demonstrating their potential role in medical education and evaluation.

--------------------------------------------------------------------------------------------------------

Evidence-backed Fact Checking using RAG and Few-Shot In-Context Learning with LLMs

The spread of misinformation online is a major concern. Manually fact-checking everything is impossible. This paper proposes a system to automate fact-checking on social media. It uses large language models (LLMs) and a knowledge base to assess the truthfulness of claims and provide evidence to support its verdict. This system could be used to improve the accuracy of information online and combat the spread of fake news. The authors achieved a significant improvement in accuracy compared to existing methods, and their code is publicly available for further development.

Authors:  Ronit Singhal, Pransh Patwa, Parth Patwa, Aman Chadha, Amitava Das

Link:  https://arxiv.org/abs/2408.12060v1

Date: 2024-08-22

Summary:

Given the widespread dissemination of misinformation on social media, implementing fact-checking mechanisms for online claims is essential. Manually verifying every claim is highly challenging, underscoring the need for an automated fact-checking system. This paper presents our system designed to address this issue. We utilize the Averitec dataset to assess the veracity of claims. In addition to veracity prediction, our system provides supporting evidence, which is extracted from the dataset. We develop a Retrieve and Generate (RAG) pipeline to extract relevant evidence sentences from a knowledge base, which are then inputted along with the claim into a large language model (LLM) for classification. We also evaluate the few-shot In-Context Learning (ICL) capabilities of multiple LLMs. Our system achieves an 'Averitec' score of 0.33, which is a 22% absolute improvement over the baseline. All code will be made available on All code will be made available on https://github.com/ronit-singhal/evidence-backed-fact-checking-using-rag-and-few-shot-in-context-learning-with-llms.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.