Week Ending 12.8.2024

RESEARCH WATCH: 12.8.2024

Transformers Struggle to Learn to Search

Despite transformers' remarkable capabilities in various domains, their ability to perform fundamental search tasks remains limited. This research investigates whether this limitation stems from data scarcity, model size, or inherent architectural constraints by using graph connectivity problems as a testing ground. By generating extensive training data, researchers discovered transformers can learn search capabilities, with each layer progressively expanding reachable vertex sets. However, performance degrades with larger input graphs, suggesting scale alone won't resolve search challenges. This study provides critical insights into transformers' computational limitations and could inform future neural network architecture design.

Authors: Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, He He

Link: https://arxiv.org/abs/2412.04703v1

Date: 2024-12-06

Summary:

Search is an ability foundational in many important tasks, and recent studies have shown that large language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use the foundational graph connectivity problem as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, when given the right training distribution, the transformer is able to learn to search. We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that for each vertex in the input graph, transformers compute the set of vertices reachable from that vertex. Each layer then progressively expands these sets, allowing the model to search over a number of vertices exponential in the number of layers. However, we find that as the input graph size increases, the transformer has greater difficulty in learning the task. This difficulty is not resolved even as the number of parameters is increased, suggesting that increasing model scale will not lead to robust search abilities. We also find that performing search in-context (i.e., chain-of-thought) does not resolve this inability to learn to search on larger graphs.

--------------------------------------------------------------------------------------------------------

Multiclass Post-Earthquake Building Assessment Integrating Optical and SAR Satellite Imagery, Ground Motion, and Soil Data with Transformers

Earthquake damage assessment is crucial for effective disaster response, but traditional manual inspections are time-consuming and dangerous. This research introduces an innovative transformer-based framework that combines satellite imagery with building-specific metadata like seismic intensity indicators and soil properties. By integrating multiple data sources, the model achieves superior multiclass building damage identification, demonstrated through analysis of the Turkey-Syria earthquake. The approach enables faster, more accurate damage evaluations, potentially revolutionizing disaster response strategies by providing precise, building-level assessments that can accelerate recovery efforts and improve community resilience.

Authors: Deepank Singh, Vedhus Hoskere, Pietro Milillo

Link: https://arxiv.org/abs/2412.04664v1

Date: 2024-12-05

Summary:

Timely and accurate assessments of building damage are crucial for effective response and recovery in the aftermath of earthquakes. Conventional preliminary damage assessments (PDA) often rely on manual door-to-door inspections, which are not only time-consuming but also pose significant safety risks. To safely expedite the PDA process, researchers have studied the applicability of satellite imagery processed with heuristic and machine learning approaches. These approaches output binary or, more recently, multiclass damage states at the scale of a block or a single building. However, the current performance of such approaches limits practical applicability. To address this limitation, we introduce a metadata-enriched, transformer based framework that combines high-resolution post-earthquake satellite imagery with building-specific metadata relevant to the seismic performance of the structure. Our model achieves state-of-the-art performance in multiclass post-earthquake damage identification for buildings from the Turkey-Syria earthquake on February 6, 2023. Specifically, we demonstrate that incorporating metadata, such as seismic intensity indicators, soil properties, and SAR damage proxy maps not only enhances the model's accuracy and ability to distinguish between damage classes, but also improves its generalizability across various regions. Furthermore, we conducted a detailed, class-wise analysis of feature importance to understand the model's decision-making across different levels of building damage. This analysis reveals how individual metadata features uniquely contribute to predictions for each damage class. By leveraging both satellite imagery and metadata, our proposed framework enables faster and more accurate damage assessments for precise, multiclass, building-level evaluations that can improve disaster response and accelerate recovery efforts for affected communities.

--------------------------------------------------------------------------------------------------------

Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

As contextual AI becomes integral to wearable technology like smart glasses, computational constraints of micro-controller units (MCUs) pose significant challenges. This research proposes a groundbreaking methodology for deploying transformer models on low-power devices by distributing inference across multiple MCUs. By partitioning computational tasks and maintaining stationary on-chip weights, the approach achieves remarkable energy efficiency and performance improvements. Successfully demonstrated on models like TinyLlama and MobileBERT, this technique could enable more sophisticated on-device intelligence for interactive wearable technologies, potentially transforming how we interact with sensors and embedded AI systems.

Authors: Severin Bochem, Victor J. B. Jung, Arpan Prasad, Francesco Conti, Luca Benini

Link: https://arxiv.org/abs/2412.04372v1

Date: 2024-12-05

Summary:

Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.8 ms, with a super-linear 4.7 x speedup when using 4 MCUs compared to a single-chip system.

--------------------------------------------------------------------------------------------------------

Monet: Mixture of Monosemantic Experts for Transformers

Interpreting large language models' internal computations remains challenging due to neurons' polysemantic nature. Monet introduces an innovative architecture that directly incorporates sparse dictionary learning into mixture-of-experts pretraining, enabling unprecedented insights into model behavior. By decomposing experts and scaling their count to 262,144 per layer, researchers achieved mutual exclusivity of knowledge across experts. This approach not only enhances mechanistic interpretability but also allows knowledge manipulation across domains, languages, and potentially mitigating toxic content generation. The research represents a significant step towards creating more transparent and controllable AI systems.

Authors: Jungwoo Park, Young Jin Ahn, Kee-Eung Kim, Jaewoo Kang

Link: https://arxiv.org/abs/2412.04139v1

Date: 2024-12-05

Summary:

Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance} mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust} model behavior. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet.

--------------------------------------------------------------------------------------------------------

Navigation World Models

Navigation is a fundamental skill for intelligent agents, and this research introduces a novel Navigation World Model (NWM) using a Conditional Diffusion Transformer. Trained on diverse egocentric videos from human and robotic agents, the billion-parameter model can predict visual observations and plan navigation trajectories dynamically. Unlike traditional supervised policies, NWM can incorporate constraints during planning and even imagine trajectories in unfamiliar environments from a single image. This breakthrough could revolutionize robotics, autonomous vehicles, and AI systems requiring sophisticated spatial reasoning and adaptive navigation capabilities.

Authors: Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun

Link: https://arxiv.org/abs/2412.03572v1

Date: 2024-12-04

Summary:

Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.

--------------------------------------------------------------------------------------------------------

Negative refraction of light in an atomic medium

Manipulating light propagation has long fascinated scientists seeking to overcome natural material limitations. This research demonstrates negative refraction of light in atomic media without artificial metamaterials, potentially opening new avenues for advanced optical technologies. By achieving high-transmission negative refraction in atomic arrays and providing an intuitive description based on collective excitation bands, the study shows this phenomenon's robustness to lattice imperfections. The findings could have significant implications for developing next-generation optical devices, potentially surpassing current diffraction limitations.

Authors: L. Ruks, K. E. Ballantine, J. Ruostekoski

Link: https://arxiv.org/abs/2412.03622v1

Date: 2024-12-04

Summary:

The quest to manipulate light propagation in ways not possible with natural media has driven the development of artificially structured metamaterials. One of the most striking effects is negative refraction, where the light beam deflects away from the boundary normal. However, due to material characteristics, the applications of this phenomenon, such as lensing that surpasses the diffraction limit, have been constrained. Here, we demonstrate negative refraction of light in an atomic medium without the use of artificial metamaterials, employing essentially exact simulations of light propagation. High transmission negative refraction is achieved in atomic arrays for different level structures and lattice constants, within the scope of currently realised experimental systems. We introduce an intuitive description of negative refraction based on collective excitation bands, whose transverse group velocities are antiparallel to the excitation quasi-momenta. We also illustrate how this phenomenon is robust to lattice imperfections and can be significantly enhanced through subradiance.

--------------------------------------------------------------------------------------------------------

Deep-Learning Based Docking Methods: Fair Comparisons to Conventional Docking Workflows

Molecular docking is critical in drug discovery, and this study critically evaluates recent deep learning approaches against conventional methods. By comparing DiffDock with established docking workflows, researchers revealed significant limitations in the deep learning approach. The study exposed that DiffDock's performance heavily relies on near-identical training cases, essentially performing a sophisticated table lookup rather than genuinely learning generalized docking strategies. This research underscores the importance of rigorous evaluation in computational drug discovery and highlights potential pitfalls in machine learning methodology.

Authors: Ajay N. Jain, Ann E. Cleves, W. Patrick Walters

Link: https://arxiv.org/abs/2412.02889v1

Date: 2024-12-03

Summary:

The diffusion learning method, DiffDock, for docking small-molecule ligands into protein binding sites was recently introduced. Results included comparisons to more conventional docking approaches, with DiffDock showing superior performance. Here, we employ a fully automatic workflow using the Surflex-Dock methods to generate a fair baseline for conventional docking approaches. Results were generated for the common and expected situation where a binding site location is known and also for the condition of an unknown binding site. For the known binding site condition, Surflex-Dock success rates at 2.0 Angstroms RMSD far exceeded those for DiffDock (Top-1/Top-5 success rates, respectively, were 68/81% compared with 45/51%). Glide performed with similar success rates (67/73%) to Surflex-Dock for the known binding site condition, and results for AutoDock Vina and Gnina followed this pattern. For the unknown binding site condition, using an automated method to identify multiple binding pockets, Surflex-Dock success rates again exceeded those of DiffDock, but by a somewhat lesser margin. DiffDock made use of roughly 17,000 co-crystal structures for learning (98% of PDBBind version 2020, pre-2019 structures) for a training set in order to predict on 363 test cases (2% of PDBBind 2020) from 2019 forward. DiffDock's performance was inextricably linked with the presence of near-neighbor cases of close to identical protein-ligand complexes in the training set for over half of the test set cases. DiffDock exhibited a 40 percentage point difference on near-neighbor cases (two-thirds of all test cases) compared with cases with no near-neighbor training case. DiffDock has apparently encoded a type of table-lookup during its learning process, rendering meaningful applications beyond its reach. Further, it does not perform even close to competitively with a competently run modern docking workflow.

--------------------------------------------------------------------------------------------------------

The Asymptotic Behavior of Attention in Transformers

Understanding transformer mechanics is crucial for advancing AI technologies. This mathematical analysis explores the asymptotic properties of attention mechanisms, revealing a fascinating tendency for all tokens to converge to each other. By providing rigorous theoretical insights and comparing results with empirical studies using GPT-2, the research offers profound understanding of how information propagates through transformer architectures. These findings could inform future model design, potentially improving information processing and representation learning in neural networks.

Authors: Álvaro Rodríguez Abella, João Pedro Silvestre, Paulo Tabuada

Link: https://arxiv.org/abs/2412.02682v1

Date: 2024-12-03

Summary:

A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token through a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.

--------------------------------------------------------------------------------------------------------

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Text-to-video generation faces significant challenges in creating realistic object interactions and movements. This research introduces an innovative approach using AI feedback to improve model performance. By leveraging vision-language models and developing a unified probabilistic objective, researchers demonstrated substantial improvements in video quality, particularly for complex multi-object interactions. The study's approach of using binary AI feedback represents a promising direction for enhancing generative AI's understanding of physical dynamics, with potential applications in animation, simulation, and creative content generation.

Authors: Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, Sherry Yang

Link: https://arxiv.org/abs/2412.02617v1

Date: 2024-12-03

Summary:

Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.

--------------------------------------------------------------------------------------------------------

Does your model understand genes? A benchmark of gene properties for biological and text models

Biological research increasingly relies on deep learning models trained on various data types. This study introduces an architecture-agnostic benchmarking approach to evaluate models' understanding of gene properties across different categories like genomic characteristics, regulatory functions, and protein properties. By creating hundreds of tasks and comparing text-based, protein language, and expression-based models, the research provides crucial insights into model capabilities. This framework could significantly advance AI's application in biological research, potentially accelerating therapeutic discovery and genetic understanding.

Authors: Yoav Kan-Tor, Michael Morris Danziger, Eden Zohar, Matan Ninio, Yishai Shimoni

Link: https://arxiv.org/abs/2412.04075v1

Date: 2024-12-05

Summary:

The application of deep learning methods, particularly foundation models, in biological research has surged in recent years. These models can be text-based or trained on underlying biological data, especially omics data of various types. However, comparing the performance of these models consistently has proven to be a challenge due to differences in training data and downstream tasks. To tackle this problem, we developed an architecture-agnostic benchmarking approach that, instead of evaluating the models directly, leverages entity representation vectors from each model and trains simple predictive models for each benchmarking task. This ensures that all types of models are evaluated using the same input and output types. Here we focus on gene properties collected from professionally curated bioinformatics databases. These gene properties are categorized into five major groups: genomic properties, regulatory functions, localization, biological processes, and protein properties. Overall, we define hundreds of tasks based on these databases, which include binary, multi-label, and multi-class classification tasks. We apply these benchmark tasks to evaluate expression-based models, large language models, protein language models, DNA-based models, and traditional baselines. Our findings suggest that text-based models and protein language models generally outperform expression-based models in genomic properties and regulatory functions tasks, whereas expression-based models demonstrate superior performance in localization tasks. These results should aid in the development of more informed artificial intelligence strategies for biological understanding and therapeutic discovery. To ensure the reproducibility and transparency of our findings, we have made the source code and benchmark data publicly accessible for further investigation and expansion at github.com/BiomedSciAI/gene-benchmark.

--------------------------------------------------------------------------------------------------------

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

Leveraging abundant video data, this research proposes a novel approach to robotic learning by emphasizing motion-related knowledge. The Moto framework converts video content into latent motion token sequences, learning a "language" of motion unsupervisedly. By pre-training a motion-focused model and implementing a co-fine-tuning strategy, the researchers demonstrated superior robustness in robot manipulation tasks. This approach could revolutionize robotic learning by enabling more efficient knowledge transfer from video data, potentially making robot training more adaptable and cost-effective.

Authors: Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu

Link: https://arxiv.org/abs/2412.04445v1

Date: 2024-12-05

Summary:

Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.

--------------------------------------------------------------------------------------------------------

HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting

Generating high-quality novel view renderings remains challenging, especially with scenes containing transient objects. This research introduces HybridGS, a novel representation using 2D Gaussians for transient objects and 3D Gaussians for static scenes. By decomposing scenes based on viewpoint consistency and presenting a multi-view regulated supervision method, the approach achieves state-of-the-art performance in view synthesis. This technique could significantly improve rendering technologies in fields like computer graphics, virtual reality, and simulation.

Authors: Jingyu Lin, Jiaqi Gu, Lubin Fan, Bojian Wu, Yujing Lou, Renjie Chen, Ligang Liu, Jieping Ye

Link: https://arxiv.org/abs/2412.03844v1

Date: 2024-12-05

Summary:

Generating high-quality novel view renderings of 3D Gaussian Splatting (3DGS) in scenes featuring transient objects is challenging. We propose a novel hybrid representation, termed as HybridGS, using 2D Gaussians for transient objects per image and maintaining traditional 3D Gaussians for the whole static scenes. Note that, the 3DGS itself is better suited for modeling static scenes that assume multi-view consistency, but the transient objects appear occasionally and do not adhere to the assumption, thus we model them as planar objects from a single view, represented with 2D Gaussians. Our novel representation decomposes the scene from the perspective of fundamental viewpoint consistency, making it more reasonable. Additionally, we present a novel multi-view regulated supervision method for 3DGS that leverages information from co-visible regions, further enhancing the distinctions between the transients and statics. Then, we propose a straightforward yet effective multi-stage training strategy to ensure robust training and high-quality view synthesis across various settings. Experiments on benchmark datasets show our state-of-the-art performance of novel view synthesis in both indoor and outdoor scenes, even in the presence of distracting elements.

--------------------------------------------------------------------------------------------------------

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Text-to-video generation struggles with creating complex dynamic scenes involving multiple objects and interactions. This research proposes GenMAC, an innovative multi-agent framework enabling compositional video generation through collaborative, iterative workflows. By decomposing tasks across specialized agents and implementing a self-routing mechanism, the approach addresses challenges in attribute binding, temporal dynamics, and object interactions. The method represents a significant advancement in generative AI, potentially transforming content creation in entertainment, education, and design.

Authors: Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu

Link: https://arxiv.org/abs/2412.04440v1

Date: 2024-12-05

Summary:

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.

--------------------------------------------------------------------------------------------------------

BodyMetric: Evaluating the Realism of Human Bodies in Text-to-Image Generation

Generating realistic human body images remains a significant challenge for text-to-image models. BodyMetric introduces a learnable metric predicting body realism by leveraging 3D body representations and textual descriptions. By developing an annotation pipeline and creating the BodyRealism dataset, researchers created a tool for systematically evaluating human body generation. This approach could dramatically improve text-to-image model evaluation, enabling more precise benchmarking and driving advancements in generative AI for fields like fashion, design, and digital content creation.

Authors: Nefeli Andreou, Varsha Vivek, Ying Wang, Alex Vorobiov, Tiffany Deng, Raja Bala, Larry Davis, Betty Mohler Tesch

Link: https://arxiv.org/abs/2412.04086v2

Date: 2024-12-06

Summary:

Accurately generating images of human bodies from text remains a challenging problem for state of the art text-to-image models. Commonly observed body-related artifacts include extra or missing limbs, unrealistic poses, blurred body parts, etc. Currently, evaluation of such artifacts relies heavily on time-consuming human judgments, limiting the ability to benchmark models at scale. We address this by proposing BodyMetric, a learnable metric that predicts body realism in images. BodyMetric is trained on realism labels and multi-modal signals including 3D body representations inferred from the input image, and textual descriptions. In order to facilitate this approach, we design an annotation pipeline to collect expert ratings on human body realism leading to a new dataset for this task, namely, BodyRealism. Ablation studies support our architectural choices for BodyMetric and the importance of leveraging a 3D human body prior in capturing body-related artifacts in 2D images. In comparison to concurrent metrics which evaluate general user preference in images, BodyMetric specifically reflects body-related artifacts. We demonstrate the utility of BodyMetric through applications that were previously infeasible at scale. In particular, we use BodyMetric to benchmark the generation ability of text-to-image models to produce realistic human bodies. We also demonstrate the effectiveness of BodyMetric in ranking generated images based on the predicted realism scores.

--------------------------------------------------------------------------------------------------------

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

As AI systems become increasingly sophisticated, understanding their self-reasoning capabilities becomes crucial. This research develops a comprehensive evaluation suite to assess instrumental self-reasoning across various scenarios, including self-modification and knowledge seeking. By testing state-of-the-art language models, researchers found that such abilities emerge only in the most capable systems and remain highly context-dependent. This work provides a critical framework for measuring AI's evolving reasoning capabilities, potentially informing ethical AI development and understanding emerging intelligent system behaviors.

Authors: Kai Fronsdal, David Lindner

Link: https://arxiv.org/abs/2412.03904v1

Date: 2024-12-05

Summary:

We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at https://github.com/kaifronsdal/Self-Reasoning-Evals.

--------------------------------------------------------------------------------------------------------

Experience-driven discovery of planning strategies

Cognitive efficiency has long puzzled researchers: how do humans navigate complex decision-making with limited mental resources? This study explores the fundamental mechanism of strategy discovery through metacognitive reinforcement learning. By investigating how individuals develop new planning approaches, the research offers unprecedented insights into human learning processes. The novel experimental design challenges existing understanding of strategy formation, demonstrating that humans can dynamically adapt their planning methods. While the proposed metacognitive reinforcement learning models show promise in explaining strategy discovery, they currently lag behind human learning speeds, presenting an exciting avenue for future computational cognitive science research.

Authors: Ruiqi He, Falk Lieder

Link: https://arxiv.org/abs/2412.03111v1

Date: 2024-12-04

Summary:

One explanation for how people can plan efficiently despite limited cognitive resources is that we possess a set of adaptive planning strategies and know when and how to use them. But how are these strategies acquired? While previous research has studied how individuals learn to choose among existing strategies, little is known about the process of forming new planning strategies. In this work, we propose that new planning strategies are discovered through metacognitive reinforcement learning. To test this, we designed a novel experiment to investigate the discovery of new planning strategies. We then present metacognitive reinforcement learning models and demonstrate their capability for strategy discovery as well as show that they provide a better explanation of human strategy discovery than alternative learning mechanisms. However, when fitted to human data, these models exhibit a slower discovery rate than humans, leaving room for improvement.

--------------------------------------------------------------------------------------------------------

Reinforcement Learning: An Overview

Reinforcement learning represents a critical frontier in artificial intelligence and machine learning, bridging computational strategies with adaptive decision-making. This comprehensive manuscript provides a crucial landscape view of the field, synthesizing contemporary approaches across value-based methods, policy-gradient techniques, and model-based strategies. By offering a big-picture perspective, the work serves as an essential reference for researchers and practitioners navigating the complex terrain of sequential decision-making. The inclusion of a brief discussion on reinforcement learning's intersection with large language models underscores the rapidly evolving nature of this domain, potentially guiding future interdisciplinary research and technological innovations.

Authors: Kevin Murphy

Link: https://arxiv.org/abs/2412.05265v1

Date: 2024-12-06

Summary:

This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based RL, policy-gradient methods, model-based methods, and various other topics (including a very brief discussion of RL+LLMs).

--------------------------------------------------------------------------------------------------------

Action Mapping for Reinforcement Learning in Continuous Environments with Constraints

Deep reinforcement learning has shown remarkable potential across numerous domains, yet applying it to constrained environments remains challenging. This research introduces a novel training strategy called action mapping, which addresses critical limitations in sample efficiency and convergence. By leveraging feasibility models and decoupling action learning from policy optimization, the approach enables agents to focus on selecting optimal actions within a reduced, feasible set. The method demonstrates significant performance improvements in continuous action spaces, particularly with imperfect feasibility models. This breakthrough could revolutionize robotic control, autonomous systems, and complex decision-making scenarios with intricate environmental constraints.

Authors: Mirco Theile, Lukas Dirnberger, Raphael Trumpp, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli

Link: https://arxiv.org/abs/2412.04327v1

Date: 2024-12-05

Summary:

Deep reinforcement learning (DRL) has had success across various domains, but applying it to environments with constraints remains challenging due to poor sample efficiency and slow convergence. Recent literature explored incorporating model knowledge to mitigate these problems, particularly through the use of models that assess the feasibility of proposed actions. However, integrating feasibility models efficiently into DRL pipelines in environments with continuous action spaces is non-trivial. We propose a novel DRL training strategy utilizing action mapping that leverages feasibility models to streamline the learning process. By decoupling the learning of feasible actions from policy optimization, action mapping allows DRL agents to focus on selecting the optimal action from a reduced feasible action set. We demonstrate through experiments that action mapping significantly improves training performance in constrained environments with continuous action spaces, especially with imperfect feasibility models.

--------------------------------------------------------------------------------------------------------

Demonstration Selection for In-Context Learning via Reinforcement Learning

The effectiveness of large language models in few-shot learning critically depends on strategically selected demonstrations. This innovative research introduces the Relevance-Diversity Enhanced Selection (RDES) approach, which uses reinforcement learning to optimize demonstration selection for text classification tasks. By employing a Q-learning framework and calculating diversity scores based on label distribution, RDES ensures balanced representation and improved classification accuracy. The study's comprehensive experiments across multiple datasets and diverse language models highlight the potential of adaptive demonstration selection. This approach could significantly enhance machine learning's ability to generalize, offering more robust and flexible learning strategies.

Authors: Xubin Wang, Jianfei Wu, Yichen Yuan, Mingzhe Li, Deyu Cai, Weijia Jia

Link: https://arxiv.org/abs/2412.03966v1

Date: 2024-12-05

Summary:

Diversity in demonstration selection is crucial for enhancing model generalization, as it enables a broader coverage of structures and concepts. However, constructing an appropriate set of demonstrations has remained a focal point of research. This paper presents the Relevance-Diversity Enhanced Selection (RDES), an innovative approach that leverages reinforcement learning to optimize the selection of diverse reference demonstrations for text classification tasks using Large Language Models (LLMs), especially in few-shot prompting scenarios. RDES employs a Q-learning framework to dynamically identify demonstrations that maximize both diversity and relevance to the classification objective by calculating a diversity score based on label distribution among selected demonstrations. This method ensures a balanced representation of reference data, leading to improved classification accuracy. Through extensive experiments on four benchmark datasets and involving 12 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances classification accuracy compared to ten established baselines. Furthermore, we investigate the incorporation of Chain-of-Thought (CoT) reasoning in the reasoning process, which further enhances the model's predictive performance. The results underscore the potential of reinforcement learning to facilitate adaptive demonstration selection and deepen the understanding of classification challenges.

--------------------------------------------------------------------------------------------------------

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

Human speech is characterized by nuanced prosodic variations that convey rich emotional and contextual information. DiffStyleTTS addresses the complex challenge of mapping text to diverse speech prosodies through an innovative acoustic model. Utilizing a conditional diffusion module and improved classifier-free guidance, the approach hierarchically models speech features and enables precise prosody control. By successfully outperforming baseline methods in naturalness and synthesis speed, this research opens new possibilities for more expressive and adaptable text-to-speech systems. The ability to adjust guidance intensity promises more natural, context-aware synthetic speech across various applications like assistive technologies and personalized digital interactions.

Authors: Jiaxuan Liu, Zhaoci Liu, Yajun Hu, Yingying Gao, Shilei Zhang, Zhenhua Ling

Link: https://arxiv.org/abs/2412.03388v1

Date: 2024-12-04

Summary:

Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional diffusion module and an improved classifier-free guidance, which hierarchically models speech prosodic features, and controls different prosodic styles to guide prosody prediction. Experiments show that our method outperforms all baselines in naturalness and achieves superior synthesis speed compared to three diffusion-based baselines. Additionally, by adjusting the guiding scale, DiffStyleTTS effectively controls the guidance intensity of the synthetic prosody.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithDecember 9, 2024Comment