Eye On AI

View Original

Week Ending 10.13.2024

RESEARCH WATCH: 10.13.2024

UniGlyph: A Seven-Segment Script for Universal Language Representation

UniGlyph introduces a novel constructed language designed to create a universal transliteration system using seven-segment characters. This innovative approach aims to facilitate cross-language communication by offering a flexible and consistent script capable of representing a wide range of phonetic sounds. UniGlyph addresses limitations in existing phonetic representation systems like the International Phonetic Alphabet, providing a compact yet versatile method for capturing linguistic diversity. Potential applications include enhancing natural language processing, multilingual speech recognition, and improving communication across different languages. The system's expandability to include animal phonetic sounds suggests its potential use in broader inter-species communication studies and AI-driven language applications.

Authors:  G. V. Bency Sherin, A. Abijesh Euphrine, A. Lenora Moreen, L. Arun Jose

Link:  https://arxiv.org/abs/2410.08974v1

Date: 2024-10-11

Summary:

UniGlyph is a constructed language (conlang) designed to create a universal transliteration system using a script derived from seven-segment characters. The goal of UniGlyph is to facilitate cross-language communication by offering a flexible and consistent script that can represent a wide range of phonetic sounds. This paper explores the design of UniGlyph, detailing its script structure, phonetic mapping, and transliteration rules. The system addresses imperfections in the International Phonetic Alphabet (IPA) and traditional character sets by providing a compact, versatile method to represent phonetic diversity across languages. With pitch and length markers, UniGlyph ensures accurate phonetic representation while maintaining a small character set. Applications of UniGlyph include artificial intelligence integrations, such as natural language processing and multilingual speech recognition, enhancing communication across different languages. Future expansions are discussed, including the addition of animal phonetic sounds, where unique scripts are assigned to different species, broadening the scope of UniGlyph beyond human communication. This study presents the challenges and solutions in developing such a universal script, demonstrating the potential of UniGlyph to bridge linguistic gaps in cross-language communication, educational phonetics, and AI-driven applications.

--------------------------------------------------------------------------------------------------------

Hybrid LLM-DDQN based Joint Optimization of V2I Communication and Autonomous Driving

This research explores the integration of Large Language Models (LLMs) with vehicular networks to jointly optimize vehicle-to-infrastructure (V2I) communications and autonomous driving (AD) policies. The study combines LLMs for AD decision-making with a double deep Q-learning algorithm (DDQN) for V2I optimization. This hybrid approach aims to maximize traffic flow, enhance road safety, and optimize data transmission rates while reducing frequent handovers in vehicular networks. By iteratively optimizing AD and V2I decisions, the research reveals the potential of using LLMs for network optimization and management. The proposed hybrid LLM-DDQN approach demonstrates faster convergence and higher average rewards compared to conventional DDQN algorithms, suggesting promising applications in smart transportation systems and connected vehicle technologies.

Authors:  Zijiang Yan, Hao Zhou, Hina Tabassum, Xue Liu

Link:  https://arxiv.org/abs/2410.08854v1

Date: 2024-10-11

Summary:

Large language models (LLMs) have received considerable interest recently due to their outstanding reasoning and comprehension capabilities. This work explores applying LLMs to vehicular networks, aiming to jointly optimize vehicle-to-infrastructure (V2I) communications and autonomous driving (AD) policies. We deploy LLMs for AD decision-making to maximize traffic flow and avoid collisions for road safety, and a double deep Q-learning algorithm (DDQN) is used for V2I optimization to maximize the received data rate and reduce frequent handovers. In particular, for LLM-enabled AD, we employ the Euclidean distance to identify previously explored AD experiences, and then LLMs can learn from past good and bad decisions for further improvement. Then, LLM-based AD decisions will become part of states in V2I problems, and DDQN will optimize the V2I decisions accordingly. After that, the AD and V2I decisions are iteratively optimized until convergence. Such an iterative optimization approach can better explore the interactions between LLMs and conventional reinforcement learning techniques, revealing the potential of using LLMs for network optimization and management. Finally, the simulations demonstrate that our proposed hybrid LLM-DDQN approach outperforms the conventional DDQN algorithm, showing faster convergence and higher average rewards.

--------------------------------------------------------------------------------------------------------

PILLAR: an AI-Powered Privacy Threat Modeling Tool

PILLAR introduces an innovative tool that combines Large Language Models (LLMs) with the LINDDUN framework to streamline and enhance privacy threat modeling. As modern applications increasingly handle sensitive user data, effective privacy protection has become crucial. PILLAR addresses the limitations of current privacy threat modeling methods by automating key parts of the LINDDUN process, such as generating data flow diagrams, classifying threats, and prioritizing risks. By leveraging LLMs, PILLAR can transform natural language descriptions of systems into comprehensive threat models with minimal user input. This tool has the potential to significantly reduce the workload on developers and privacy experts while improving the efficiency and accuracy of privacy risk assessment in system development.

Authors:  Majid Mollaeefar, Andrea Bissoli, Silvio Ranise

Link:  https://arxiv.org/abs/2410.08755v1

Date: 2024-10-11

Summary:

The rapid evolution of Large Language Models (LLMs) has unlocked new possibilities for applying artificial intelligence across a wide range of fields, including privacy engineering. As modern applications increasingly handle sensitive user data, safeguarding privacy has become more critical than ever. To protect privacy effectively, potential threats need to be identified and addressed early in the system development process. Frameworks like LINDDUN offer structured approaches for uncovering these risks, but despite their value, they often demand substantial manual effort, expert input, and detailed system knowledge. This makes the process time-consuming and prone to errors. Current privacy threat modeling methods, such as LINDDUN, typically rely on creating and analyzing complex data flow diagrams (DFDs) and system descriptions to pinpoint potential privacy issues. While these approaches are thorough, they can be cumbersome, relying heavily on the precision of the data provided by users. Moreover, they often generate a long list of threats without clear guidance on how to prioritize them, leaving developers unsure of where to focus their efforts. In response to these challenges, we introduce PILLAR (Privacy risk Identification with LINDDUN and LLM Analysis Report), a new tool that integrates LLMs with the LINDDUN framework to streamline and enhance privacy threat modeling. PILLAR automates key parts of the LINDDUN process, such as generating DFDs, classifying threats, and prioritizing risks. By leveraging the capabilities of LLMs, PILLAR can take natural language descriptions of systems and transform them into comprehensive threat models with minimal input from users, reducing the workload on developers and privacy experts while improving the efficiency and accuracy of the process.

--------------------------------------------------------------------------------------------------------

Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation

This paper introduces a novel framework called CroBIM (Cross-Modal Bidirectional Interaction Model) for referring remote sensing image segmentation (RRSIS). RRSIS aims to generate pixel-level masks of target objects identified by natural language expressions in remote sensing images. CroBIM addresses the challenges posed by complex geospatial relationships and varying object scales in remote sensing scenarios. The framework incorporates a context-aware prompt modulation module, a language-guided feature aggregation module, and a mutual-interaction decoder to enhance cross-modal feature alignment. Additionally, the researchers introduce RISBench, a large-scale benchmark dataset for RRSIS. This work has potential applications in various fields utilizing remote sensing data, such as urban planning, environmental monitoring, and disaster response.

Authors:  Zhe Dong, Yuzhe Sun, Yanfeng Gu, Tianzhu Liu

Link:  https://arxiv.org/abs/2410.08613v1

Date: 2024-10-11

Summary:

Given a natural language expression and a remote sensing image, the goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression. In contrast to natural scenarios, expressions in RRSIS often involve complex geospatial relationships, with target objects of interest that vary significantly in scale and lack visual saliency, thereby increasing the difficulty of achieving precise segmentation. To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM). Specifically, a context-aware prompt modulation (CAPM) module is designed to integrate spatial positional relationships and task-specific knowledge into the linguistic features, thereby enhancing the ability to capture the target object. Additionally, a language-guided feature aggregation (LGFA) module is introduced to integrate linguistic information into multi-scale visual features, incorporating an attention deficit compensation mechanism to enhance feature aggregation. Finally, a mutual-interaction decoder (MID) is designed to enhance cross-modal feature alignment through cascaded bidirectional cross-attention, thereby enabling precise segmentation mask prediction. To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets. Extensive benchmarking on RISBench and two other prevalent datasets demonstrates the superior performance of the proposed CroBIM over existing state-of-the-art (SOTA) methods. The source code for CroBIM and the RISBench dataset will be publicly available at https://github.com/HIT-SIRS/CroBIM

--------------------------------------------------------------------------------------------------------

Quality Prediction of AI Generated Images and Videos: Emerging Trends and Opportunities

This paper explores the critical issue of quality assessment for AI-generated and enhanced images and videos. As AI-based methods for content generation and enhancement gain prominence, ensuring high visual quality becomes crucial for widespread integration and user acceptance. The authors discuss the limitations of existing Image Quality Assessment (IQA) and Video Quality Assessment (VQA) models in evaluating "generative" artifacts. They highlight the need for new metrics and models specifically designed for AI-generated content, as well as more representative datasets and performance measures. This research has implications for various industries employing AI-generated visual content, including entertainment, advertising, and social media, where maintaining high visual quality is essential for user experience and engagement.

Authors:  Abhijay Ghildyal, Yuanhan Chen, Saman Zadtootaghaj, Nabajeet Barman, Alan C. Bovik

Link:  https://arxiv.org/abs/2410.08534v1

Date: 2024-10-11

Summary:

The advent of AI has influenced many aspects of human life, from self-driving cars and intelligent chatbots to text-based image and video generation models capable of creating realistic images and videos based on user prompts (text-to-image, image-to-image, and image-to-video). AI-based methods for image and video super resolution, video frame interpolation, denoising, and compression have already gathered significant attention and interest in the industry and some solutions are already being implemented in real-world products and services. However, to achieve widespread integration and acceptance, AI-generated and enhanced content must be visually accurate, adhere to intended use, and maintain high visual quality to avoid degrading the end user's quality of experience (QoE).   One way to monitor and control the visual "quality" of AI-generated and -enhanced content is by deploying Image Quality Assessment (IQA) and Video Quality Assessment (VQA) models. However, most existing IQA and VQA models measure visual fidelity in terms of "reconstruction" quality against a pristine reference content and were not designed to assess the quality of "generative" artifacts. To address this, newer metrics and models have recently been proposed, but their performance evaluation and overall efficacy have been limited by datasets that were too small or otherwise lack representative content and/or distortion capacity; and by performance measures that can accurately report the success of an IQA/VQA model for "GenAI". This paper examines the current shortcomings and possibilities presented by AI-generated and enhanced image and video content, with a particular focus on end-user perceived quality. Finally, we discuss open questions and make recommendations for future work on the "GenAI" quality assessment problems, towards further progressing on this interesting and relevant field of research.

--------------------------------------------------------------------------------------------------------

VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking

VOVTrack addresses the challenge of open-vocabulary multi-object tracking (OVMOT) in videos, which involves detecting and tracking diverse object categories, including both seen and unseen classes. Unlike existing approaches that often combine open-vocabulary object detection (OVD) and multi-object tracking (MOT) as separate modules, VOVTrack integrates object states relevant to MOT and employs video-centric training. The method introduces a prompt-guided attention mechanism for accurate localization and classification of time-varying objects, and leverages raw video data for self-supervised object similarity learning. This approach has potential applications in advanced video surveillance, autonomous driving systems, and video content analysis, where tracking diverse and previously unseen objects is crucial.

Authors:  Zekun Qian, Ruize Han, Junhui Hou, Linqi Song, Wei Feng

Link:  https://arxiv.org/abs/2410.08529v1

Date: 2024-10-11

Summary:

Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.

--------------------------------------------------------------------------------------------------------

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

This research explores the emergence of grounding abilities in large multimodal models (LMMs) without explicit grounding supervision. The authors introduce an "attend-and-segment" method that leverages attention maps from standard LMMs to perform pixel-level segmentation. They also propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, trained with weak supervision. This approach demonstrates competitive performance on both grounding-specific and general visual question answering benchmarks. The findings have significant implications for developing more generalizable and scalable multimodal AI systems, with potential applications in image understanding, visual reasoning tasks, and human-AI interaction where precise object localization and language-vision alignment are crucial.

Authors:  Shengcao Cao, Liang-Yan Gui, Yu-Xiong Wang

Link:  https://arxiv.org/abs/2410.08209v1

Date: 2024-10-10

Summary:

Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://groundLMM.github.io.

--------------------------------------------------------------------------------------------------------

SAKA: An Intelligent Platform for Semi-automated Knowledge Graph Construction and Application

SAKA introduces an intelligent and user-friendly platform for semi-automated knowledge graph (KG) construction and application. The platform addresses the challenges of manual KG construction by enabling users to semi-automatically build KGs from structured data across various domains. SAKA also incorporates an Audio-based KG Information Extraction (AGIE) method to construct KGs from audio data, expanding the potential sources of knowledge. Additionally, the platform creates a semantic parsing-based knowledge base question answering (KBQA) system based on user-created KGs. This innovative approach has potential applications in information management, data integration, and AI-powered question-answering systems across multiple industries, making KG technology more accessible to non-expert users.

Authors:  Hanrong Zhang, Xinyue Wang, Jiabao Pan, Hongwei Wang

Link:  https://arxiv.org/abs/2410.08094v1

Date: 2024-10-10

Summary:

Knowledge graph (KG) technology is extensively utilized in many areas, and many companies offer applications based on KG. Nonetheless, the majority of KG platforms necessitate expertise and tremendous time and effort of users to construct KG records manually, which poses great difficulties for ordinary people to use. Additionally, audio data is abundant and holds valuable information, but it is challenging to transform it into a KG. What's more, the platforms usually do not leverage the full potential of the KGs constructed by users. In this paper, we propose an intelligent and user-friendly platform for Semi-automated KG Construction and Application (SAKA) to address the problems aforementioned. Primarily, users can semi-automatically construct KGs from structured data of numerous areas by interacting with the platform, based on which multi-versions of KG can be stored, viewed, managed, and updated. Moreover, we propose an Audio-based KG Information Extraction (AGIE) method to establish KGs from audio data. Lastly, the platform creates a semantic parsing-based knowledge base question answering (KBQA) system based on the user-created KGs. We prove the feasibility of the semi-automatic KG construction method on the SAKA platform.

--------------------------------------------------------------------------------------------------------

Benchmarking Agentic Workflow Generation

This paper introduces WorFBench, a unified workflow generation benchmark, and WorFEval, a systemic evaluation protocol for assessing Large Language Model (LLM) agents' workflow generation capabilities. The research addresses limitations in existing workflow evaluation frameworks by providing multi-faceted scenarios and intricate graph workflow structures. The study reveals gaps between sequence planning and graph planning capabilities of LLM agents, even in advanced models like GPT-4. The findings have implications for improving AI agents' ability to decompose complex problems into executable workflows, with potential applications in task planning, project management, and automated reasoning systems across various domains.

Authors:  Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

Link:  https://arxiv.org/abs/2410.07869v1

Date: 2024-10-10

Summary:

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset will be available at https://github.com/zjunlp/WorFBench.

--------------------------------------------------------------------------------------------------------

Can Transformers Reason Logically? A Study in SAT Solving

This research investigates the logical reasoning capabilities of Large Language Models (LLMs) in the context of Boolean satisfiability (SAT) problem-solving. The authors construct a decoder-only Transformer that can solve SAT using backtracking and deduction via Chain-of-Thought (CoT), proving its correctness through trace equivalence to the DPLL SAT-solving algorithm. They also introduce PARAT, a compiler that translates procedural specifications into transformer models. The study empirically evaluates whether transformers can be trained to reason by learning from algorithmic traces of the DPLL algorithm. This work has implications for enhancing AI systems' logical reasoning abilities, with potential applications in automated theorem proving, formal verification, and complex problem-solving across various domains.

Authors:  Leyan Pan, Vijay Ganesh, Jacob Abernethy, Chris Esposo, Wenke Lee

Link:  https://arxiv.org/abs/2410.07432v1

Date: 2024-10-09

Summary:

We theoretically and empirically study the logical reasoning capabilities of LLMs in the context of the Boolean satisfiability (SAT) problem. First, we construct a decoder-only Transformer that can solve SAT using backtracking and deduction via Chain-of-Thought (CoT). We prove its correctness by showing trace equivalence to the well-known DPLL SAT-solving algorithm. Second, to support the implementation of this abstract construction, we design a compiler $\texttt{PARAT}$ that takes as input a procedural specification and outputs a transformer model implementing this specification. Third, rather than $\textit{programming}$ a transformer to reason, we evaluate empirically whether it can be $\textit{trained}$ to do so by learning directly from algorithmic traces ("reasoning paths") of the DPLL algorithm.

--------------------------------------------------------------------------------------------------------

InstructG2I: Synthesizing Images from Multimodal Attributed Graphs

InstructG2I addresses the challenge of generating images from multimodal attributed graphs (MMAGs), a task that involves complex dependencies among graph entities and requires controllability in graph conditions. The proposed approach uses a graph context-conditioned diffusion model that exploits graph structure and multimodal information for informative neighbor sampling. It also introduces graph classifier-free guidance for controllable generation. This research has potential applications in various domains where visual content needs to be generated from structured data, such as automated design, data visualization, and content creation tools that can translate complex relational information into visual representations.

Authors:  Bowen Jin, Ziqi Pang, Bingjun Guo, Yu-Xiong Wang, Jiaxuan You, Jiawei Han

Link:  https://arxiv.org/abs/2410.07157v1

Date: 2024-10-09

Summary:

In this paper, we approach an overlooked yet critical task Graph2Image: generating images from multimodal attributed graphs (MMAGs). This task poses significant challenges due to the explosion in graph size, dependencies among graph entities, and the need for controllability in graph conditions. To address these challenges, we propose a graph context-conditioned diffusion model called InstructG2I. InstructG2I first exploits the graph structure and multimodal information to conduct informative neighbor sampling by combining personalized page rank and re-ranking based on vision-language features. Then, a Graph-QFormer encoder adaptively encodes the graph nodes into an auxiliary set of graph prompts to guide the denoising process of diffusion. Finally, we propose graph classifier-free guidance, enabling controllable generation by varying the strength of graph guidance and multiple connected edges to a node. Extensive experiments conducted on three datasets from different domains demonstrate the effectiveness and controllability of our approach. The code is available at https://github.com/PeterGriffinJin/InstructG2I.

--------------------------------------------------------------------------------------------------------

M^3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes

M^3Bench introduces a new benchmark for whole-body motion generation in mobile manipulation tasks within 3D environments. The benchmark requires an embodied agent to understand its configuration, environmental constraints, and task objectives to generate coordinated whole-body motion trajectories for object rearrangement tasks. M^3Bench features 30,000 tasks across 119 diverse scenes and includes M^3BenchMaker, an automatic data generation tool. This research aims to facilitate advancements in robotics towards more adaptive and capable mobile manipulation in diverse, real-world environments, with potential applications in household robotics, industrial automation, and assistive technologies where robots need to navigate and manipulate objects in complex 3D spaces.

Authors:  Zeyu Zhang, Sixu Yan, Muzhi Han, Zaijin Wang, Xinggang Wang, Song-Chun Zhu, Hangxin Liu

Link:  https://arxiv.org/abs/2410.06678v1

Date: 2024-10-09

Summary:

We propose M^3Bench, a new benchmark for whole-body motion generation for mobile manipulation tasks. Given a 3D scene context, M^3Bench requires an embodied agent to understand its configuration, environmental constraints and task objectives, then generate coordinated whole-body motion trajectories for object rearrangement tasks. M^3Bench features 30k object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M^3BenchMaker. This automatic data generation tool produces coordinated whole-body motion trajectories from high-level task instructions, requiring only basic scene and robot information. Our benchmark incorporates various task splits to assess generalization across different dimensions and leverages realistic physics simulation for trajectory evaluation. Through extensive experimental analyses, we reveal that state-of-the-art models still struggle with coordinated base-arm motion while adhering to environment-context and task-specific constraints, highlighting the need to develop new models that address this gap. Through M^3Bench, we aim to facilitate future robotics research towards more adaptive and capable mobile manipulation in diverse, real-world environments.

--------------------------------------------------------------------------------------------------------

Boolean Nearest Neighbor Language in the Knowledge Compilation Map

This paper examines the Boolean Nearest Neighbor (BNN) representation of Boolean functions within the context of the Knowledge Compilation Map (KCM). The research aims to determine the position of the BNN language in the KCM by comparing its succinctness to several standard languages and determining the complexity status of standard queries and transformations for BNN inputs. This work contributes to the field of knowledge representation and reasoning, with potential applications in artificial intelligence, automated reasoning systems, and formal logic, where efficient representation and manipulation of Boolean functions are crucial.

Authors:  Ondřej Čepek, Jelena Glišić

Link:  https://arxiv.org/abs/2410.06332v1

Date: 2024-10-08

Summary:

The Boolean Nearest Neighbor (BNN) representation of Boolean functions was recently introduced by Hajnal, Liu and Turan. A BNN representation of $f$ is a pair $(P,N)$ of sets of Boolean vectors (called positive and negative prototypes) where $f(x)=1$ for every positive prototype $x \in P$, $f(x)=0$ for all every negative prototype $x \in N$, and the value $f(x)$ for $x \not\in P \cup N$ is determined by the type of the closest prototype. The main aim of this paper is to determine the position of the BNN language in the Knowledge Compilation Map (KCM). To this end, we derive results which compare the succinctness of the BNN language to several standard languages from KCM, and determine the complexity status of most standard queries and transformations for BNN inputs.

--------------------------------------------------------------------------------------------------------

Accelerated Preference Optimization for Large Language Model Alignment

This research introduces Accelerated Preference Optimization (APO), a framework that employs Nesterov's momentum technique to speed up the alignment of Large Language Models (LLMs) with human preferences. APO unifies many existing preference optimization algorithms and demonstrates faster convergence rates than standard iterative preference optimization methods. The framework shows superior performance over Direct Preference Optimization (DPO) and other baselines on the AlpacaEval 2.0 benchmark. This work has significant implications for improving the efficiency and effectiveness of LLM alignment processes, potentially leading to faster development of AI systems that better align with human values and preferences across various applications.

Authors:  Jiafan He, Huizhuo Yuan, Quanquan Gu

Link:  https://arxiv.org/abs/2410.06293v1

Date: 2024-10-08

Summary:

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO), one of the most popular approaches, formulates RLHF as a policy optimization problem without explicitly estimating the reward function. It overcomes the stability and efficiency issues of two-step approaches, which typically involve first estimating the reward function and then optimizing the policy via proximal policy optimization (PPO). Since RLHF is essentially an optimization problem, and it is well-known that momentum techniques can accelerate optimization both theoretically and empirically, a natural question arises: Can RLHF be accelerated by momentum? This paper answers this question in the affirmative. In detail, we first show that the iterative preference optimization method can be viewed as a proximal point method. Based on this observation, we propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms and employs Nesterov's momentum technique to speed up the alignment of LLMs. Theoretically, we demonstrate that APO can achieve a faster convergence rate than the standard iterative preference optimization methods, including DPO and Self-Play Preference Optimization (SPPO). Empirically, we show the superiority of APO over DPO, iterative DPO, and other strong baselines for RLHF on the AlpacaEval 2.0 benchmark.

--------------------------------------------------------------------------------------------------------

Extracting Finite State Machines from Transformers

This study investigates the ability of transformers to learn regular languages from a mechanistic interpretability perspective. The researchers use an extension of the L* algorithm to extract Moore machines from transformers trained on regular languages. The findings provide tighter lower bounds on the trainability of transformers when a finite number of symbols determine the state, and characterize the regular languages a one-layer transformer can learn with good length generalization. This research contributes to our understanding of the formal language learning capabilities of transformer models, with potential applications in improving natural language processing systems, formal verification of neural networks, and developing more interpretable AI models.

Authors:  Rik Adriaensen, Jaron Maene

Link:  https://arxiv.org/abs/2410.06045v1

Date: 2024-10-08

Summary:

Fueled by the popularity of the transformer architecture in deep learning, several works have investigated what formal languages a transformer can learn. Nonetheless, existing results remain hard to compare and a fine-grained understanding of the trainability of transformers on regular languages is still lacking. We investigate transformers trained on regular languages from a mechanistic interpretability perspective. Using an extension of the $L^*$ algorithm, we extract Moore machines from transformers. We empirically find tighter lower bounds on the trainability of transformers, when a finite number of symbols determine the state. Additionally, our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation. However, we also identify failure cases where the determining symbols get misrecognised due to saturation of the attention mechanism.

--------------------------------------------------------------------------------------------------------

Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning

This research addresses the limitations of Vision-Language Models (VLMs) in mathematical reasoning tasks. While VLMs have excelled in image retrieval and Visual Question Answering, they struggle with geometric reasoning, algebraic problem-solving, and counting. The study challenges the effectiveness of captioning pipelines for enhancing VLM performance in math-related tasks, particularly for larger models trained on downstream QnA tasks. Instead, the authors propose task-based prompting, which provides task-specific guidance within the prompt. This approach shows promise in improving VLM performance on math-heavy problems, potentially enhancing AI systems' capabilities in fields requiring visual mathematical reasoning, such as automated tutoring or scientific image analysis.

Authors:  Ayush Singh, Mansi Gupta, Shivank Garg, Abhinav Kumar, Vansh Agrawal

Link:  https://arxiv.org/abs/2410.05928v1

Date: 2024-10-08

Summary:

Vision-Language Models (VLMs) have transformed tasks requiring visual and reasoning abilities, such as image retrieval and Visual Question Answering (VQA). Despite their success, VLMs face significant challenges with tasks involving geometric reasoning, algebraic problem-solving, and counting. These limitations stem from difficulties effectively integrating multiple modalities and accurately interpreting geometry-related tasks. Various works claim that introducing a captioning pipeline before VQA tasks enhances performance. We incorporated this pipeline for tasks involving geometry, algebra, and counting. We found that captioning results are not generalizable, specifically with larger VLMs primarily trained on downstream QnA tasks showing random performance on math-related challenges. However, we present a promising alternative: task-based prompting, enriching the prompt with task-specific guidance. This approach shows promise and proves more effective than direct captioning methods for math-heavy problems.

--------------------------------------------------------------------------------------------------------

Core Tokensets for Data-efficient Sequential Training of Transformers

This paper introduces the concept of core tokensets to improve the efficiency of sequential training in transformer models. Traditional approaches use coresets, which retain entire samples like images or sentences. However, recognizing that not all tokens in transformer architectures are equally informative, the authors propose a deeper-level data summary at the token level. Core tokensets select the most informative data points and store only their most relevant features. This method demonstrates significant performance retention in tasks like incremental image classification, visual question answering, and continual image captioning while drastically reducing memory requirements. The approach could revolutionize data-efficient learning in AI, enabling more compact and effective model updates across various applications.

Authors:  Subarnaduti Paul, Manuel Brack, Patrick Schramowski, Kristian Kersting, Martin Mundt

Link:  https://arxiv.org/abs/2410.05800v1

Date: 2024-10-08

Summary:

Deep networks are frequently tuned to novel tasks and continue learning from ongoing data streams. Such sequential training requires consolidation of new and past information, a challenge predominantly addressed by retaining the most important data points - formally known as coresets. Traditionally, these coresets consist of entire samples, such as images or sentences. However, recent transformer architectures operate on tokens, leading to the famous assertion that an image is worth 16x16 words. Intuitively, not all of these tokens are equally informative or memorable. Going beyond coresets, we thus propose to construct a deeper-level data summary on the level of tokens. Our respectively named core tokensets both select the most informative data points and leverage feature attribution to store only their most relevant features. We demonstrate that core tokensets yield significant performance retention in incremental image classification, open-ended visual question answering, and continual image captioning with significantly reduced memory. In fact, we empirically find that a core tokenset of 1\% of the data performs comparably to at least a twice as large and up to 10 times larger coreset.

--------------------------------------------------------------------------------------------------------

Towards Robust Spacecraft Trajectory Optimization via Transformers

This research extends the capabilities of the Autonomous Rendezvous Transformer (ART) for spacecraft trajectory optimization. It addresses the challenge of solving non-convex optimal control problems in real-time for multi-spacecraft missions. The study applies ART to challenging rendezvous scenarios in Low Earth Orbit, focusing on fault-tolerant behavior under uncertainty. The proposed warm-starting strategy consistently produces high-quality reference trajectories, achieving significant improvements in cost and feasibility compared to conventional methods. Additionally, a post hoc evaluation framework is introduced to assess trajectory quality and mitigate runtime failures. This work represents a crucial step towards deploying AI-driven solutions in safety-critical autonomous systems for space exploration and satellite operations.

Authors:  Yuji Takubo, Tommaso Guffanti, Daniele Gammelli, Marco Pavone, Simone D'Amico

Link:  https://arxiv.org/abs/2410.05585v1

Date: 2024-10-08

Summary:

Future multi-spacecraft missions require robust autonomous trajectory optimization capabilities to ensure safe and efficient rendezvous operations. This capability hinges on solving non-convex optimal control problems in real time, although traditional iterative methods such as sequential convex programming impose significant computational challenges. To mitigate this burden, the Autonomous Rendezvous Transformer introduced a generative model trained to provide near-optimal initial guesses. This approach provides convergence to better local optima (e.g., fuel optimality), improves feasibility rates, and results in faster convergence speed of optimization algorithms through warm-starting. This work extends the capabilities of ART to address robust chance-constrained optimal control problems. Specifically, ART is applied to challenging rendezvous scenarios in Low Earth Orbit (LEO), ensuring fault-tolerant behavior under uncertainty. Through extensive experimentation, the proposed warm-starting strategy is shown to consistently produce high-quality reference trajectories, achieving up to 30% cost improvement and 50% reduction in infeasible cases compared to conventional methods, demonstrating robust performance across multiple state representations. Additionally, a post hoc evaluation framework is proposed to assess the quality of generated trajectories and mitigate runtime failures, marking an initial step toward the reliable deployment of AI-driven solutions in safety-critical autonomous systems such as spacecraft.

--------------------------------------------------------------------------------------------------------

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

This study investigates the mathematical reasoning capabilities of Large Language Models (LLMs) using an improved benchmark called GSM-Symbolic. While LLMs have shown improved performance on the GSM8K benchmark for grade-school-level math questions, the authors question whether this truly reflects advanced reasoning capabilities. GSM-Symbolic uses symbolic templates to generate diverse questions, enabling more controlled evaluations. The research reveals that LLMs exhibit variance in performance when faced with different instantiations of the same question and struggle with increased question complexity. The findings suggest that current LLMs may not perform genuine logical reasoning but rather replicate training data patterns. This work provides crucial insights for developing more robust AI systems capable of true mathematical reasoning.

Authors:  Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

Link:  https://arxiv.org/abs/2410.05229v1

Date: 2024-10-07

Summary:

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

--------------------------------------------------------------------------------------------------------

Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

Model-GLUE addresses the challenge of scaling Large Language Models (LLMs) by proposing a holistic guideline for aggregating pre-trained models. As LLMs continue to excel in various domains, combining disparate models often leads to decreased performance. This research benchmarks existing LLM scaling techniques and formulates a strategy for selecting and aggregating models from a heterogeneous model zoo. The approach involves clustering mergeable models, selecting optimal merging strategies, and integrating clusters through model mixture. Experiments on a Llama-2-based model zoo demonstrate an average performance enhancement of 5.61% without additional training. Model-GLUE could significantly impact the development of more powerful and efficient AI systems by enabling effective combination of specialized models.

Authors:  Xinyu Zhao, Guoheng Sun, Ruisi Cai, Yukun Zhou, Pingzhi Li, Peihao Wang, Bowen Tan, Yexiao He, Li Chen, Yi Liang, Beidi Chen, Binhang Yuan, Hongyi Wang, Ang Li, Zhangyang Wang, Tianlong Chen

Link:  https://arxiv.org/abs/2410.05357v1

Date: 2024-10-07

Summary:

As Large Language Models (LLMs) excel across tasks and specialized domains, scaling LLMs based on existing models has garnered significant attention, which faces the challenge of decreasing performance when combining disparate models. Various techniques have been proposed for the aggregation of pre-trained LLMs, including model merging, Mixture-of-Experts, and stacking. Despite their merits, a comprehensive comparison and synergistic application of them to a diverse model zoo is yet to be adequately addressed. In light of this research gap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First, our work starts with a benchmarking of existing LLM scaling techniques, especially selective merging, and variants of mixture. Utilizing the insights from the benchmark results, we formulate an strategy for the selection and aggregation of a heterogeneous model zoo characterizing different architectures and initialization. Our methodology involves the clustering of mergeable models and optimal merging strategy selection, and the integration of clusters through a model mixture. Finally, evidenced by our experiments on a diverse Llama-2-based model zoo, Model-GLUE shows an average performance enhancement of 5.61%, achieved without additional training. Codes are available at: https://github.com/Model-GLUE/Model-GLUE.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.