Week Ending 9.15.2024

RESEARCH WATCH: 9.15.2024

Yes, Prime Minister, question order does matter -- and it's certainly not classical! But is it quantum?

This paper explores the intriguing intersection of quantum probability theory and cognitive behavior in the context of political polling. The authors investigate how leading questions can manipulate poll responses, a phenomenon that classical probability theory struggles to explain. By analyzing data from a poll inspired by the British political satire "Yes, Prime Minister," they demonstrate that quantum probability theory offers a potential framework for understanding these effects. This research has implications for survey design, political strategy, and our understanding of human decision-making processes. It highlights the complex nature of public opinion and the potential for quantum-like models to shed light on cognitive phenomena.

Authors: Dorje C. Brody

Link: https://arxiv.org/abs/2409.08930v1

Date: 2024-09-13

Summary:

Response to a poll can be manipulated by means of a series of leading questions. We show that such phenomena cannot be explained by use of classical probability theory, whereas quantum probability theory admits a possibility of offering an explanation. Admissible transformation rules in quantum probability, however, do impose some constraints on the modelling of cognitive behaviour, which are highlighted here. Focusing on a recent poll conducted by Ipsos on a set of questions posed by Sir Humphrey Appleby in an episode of the British political satire \textit{Yes, Prime Minister}, we show that the resulting data cannot be explained quite so simply using quantum rules, although it seems not impossible.

--------------------------------------------------------------------------------------------------------

Synthetic Human Memories: AI-Edited Images and Videos Can Implant False Memories and Distort Recollection

As AI-powered image and video editing tools become increasingly sophisticated and accessible, this study examines their potential impact on human memory. The researchers conducted a pre-registered study with 200 participants to investigate how AI-altered visuals can create false memories or distort existing ones. Their findings reveal that AI-edited images and videos significantly increase false recollections, with AI-generated videos of AI-edited images having the strongest effect. This research has profound implications for fields such as psychology, law enforcement, and media studies. It also raises important ethical questions about the use of AI in content creation and its potential to manipulate human perception and memory.

Authors: Pat Pataranutaporn, Chayapatr Archiwaranguprok, Samantha W. T. Chan, Elizabeth Loftus, Pattie Maes

Link: https://arxiv.org/abs/2409.08895v1

Date: 2024-09-13

Summary:

AI is increasingly used to enhance images and videos, both intentionally and unintentionally. As AI editing tools become more integrated into smartphones, users can modify or animate photos into realistic videos. This study examines the impact of AI-altered visuals on false memories--recollections of events that didn't occur or deviate from reality. In a pre-registered study, 200 participants were divided into four conditions of 50 each. Participants viewed original images, completed a filler task, then saw stimuli corresponding to their assigned condition: unedited images, AI-edited images, AI-generated videos, or AI-generated videos of AI-edited images. AI-edited visuals significantly increased false recollections, with AI-generated videos of AI-edited images having the strongest effect (2.05x compared to control). Confidence in false memories was also highest for this condition (1.19x compared to control). We discuss potential applications in HCI, such as therapeutic memory reframing, and challenges in ethical, legal, political, and societal domains.

--------------------------------------------------------------------------------------------------------

Text-To-Speech Synthesis In The Wild

This paper introduces a novel approach to text-to-speech (TTS) synthesis using data collected "in the wild" rather than traditional studio-quality recordings. The researchers present the TTS In the Wild (TITW) dataset, derived from the VoxCeleb1 dataset, and propose two training sets: TITW-Hard and TITW-Easy. By demonstrating that recent TTS models can be trained successfully using this data, the study opens up new possibilities for more natural and diverse speech synthesis. This approach could lead to more realistic and adaptable voice assistants, improved accessibility tools, and enhanced speech recognition systems capable of handling a wide range of real-world audio conditions.

Authors: Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe

Link: https://arxiv.org/abs/2409.08711v1

Date: 2024-09-13

Summary:

Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms. The recent literature nonetheless shows efforts to train TTS systems using data collected in the wild. While this approach allows for the use of massive quantities of natural speech, until now, there are no common datasets. We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, in this case, applied to the VoxCeleb1 dataset commonly used for speaker recognition. We further propose two training sets. TITW-Hard is derived from the transcription, segmentation, and selection of VoxCeleb1 source data. TITW-Easy is derived from the additional application of enhancement and additional data selection based on DNSMOS. We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard. Both the dataset and protocols are publicly available and support the benchmarking of TTS systems trained using TITW data.

--------------------------------------------------------------------------------------------------------

Hand-Object Interaction Pretraining from Videos

This research presents an innovative approach to learning robot manipulation skills from videos of human hand-object interactions. By using in-the-wild videos to generate 3D trajectories and retargeting human motions to robot actions, the researchers create a task-agnostic base policy for robotic manipulation. This method shows promise in improving sample efficiency, robustness, and generalizability in downstream robotic tasks. The approach could revolutionize robot learning by enabling robots to acquire a wide range of manipulation skills from readily available human demonstration videos, potentially accelerating the development of more capable and adaptable robotic systems for various industries and applications.

Authors: Himanshu Gaurav Singh, Antonio Loquercio, Carmelo Sferrazza, Jane Wu, Haozhi Qi, Pieter Abbeel, Jitendra Malik

Link: https://arxiv.org/abs/2409.08273v1

Date: 2024-09-12

Summary:

We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches. Qualitative experiments are available at: \url{https://hgaurav2k.github.io/hop/}.

--------------------------------------------------------------------------------------------------------

What Makes a Maze Look Like a Maze?

This paper tackles the challenge of visual abstraction understanding in AI systems. The researchers introduce Deep Schema Grounding (DSG), a framework that uses structured representations of visual abstractions for grounding and reasoning. By leveraging large language models to extract schemas and vision-language models to ground these schemas onto images, DSG significantly improves abstract visual reasoning performance. This research has potential applications in advanced image understanding systems, visual reasoning AI assistants, and improved human-AI interaction. It could lead to more sophisticated computer vision systems capable of interpreting complex visual concepts across various domains.

Authors: Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu

Link: https://arxiv.org/abs/2409.08202v1

Date: 2024-09-12

Summary:

A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.

--------------------------------------------------------------------------------------------------------

Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings

This study presents a novel approach to generating music that aligns with the emotions conveyed in visual art. The researchers developed a model that uses emotion labeling, image captioning, and language models to transform visual inputs into musical compositions. To address the lack of aligned art and music data, they created the Emotion Painting Music Dataset. This research has potential applications in enhancing accessibility for the visually impaired, creating immersive multimedia experiences, and developing new tools for art therapy and education. It represents a step towards more sophisticated AI-driven cross-modal creativity and emotional expression.

Authors: Tanisha Hisariya, Huan Zhang, Jinhua Liang

Link: https://arxiv.org/abs/2409.07827v1

Date: 2024-09-12

Summary:

Rapid advancements in artificial intelligence have significantly enhanced generative tasks involving music and images, employing both unimodal and multimodal approaches. This research develops a model capable of generating music that resonates with the emotions depicted in visual arts, integrating emotion labeling, image captioning, and language models to transform visual inputs into musical compositions. Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music Dataset, pairing paintings with corresponding music for effective training and evaluation. Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data. Performance is evaluated using metrics such as Fr\'echet Audio Distance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL divergence, with audio-emotion text similarity confirmed by the pre-trained CLAP model to demonstrate high alignment between generated music and text. This synthesis tool bridges visual art and music, enhancing accessibility for the visually impaired and opening avenues in educational and therapeutic applications by providing enriched multi-sensory experiences.

--------------------------------------------------------------------------------------------------------

Harnessing TI Feeds for Exploitation Detection

This paper addresses the challenge of automatically extracting actionable information from Threat Intelligence (TI) feeds to detect vulnerability exploitation. The researchers present a machine learning pipeline that uses embedding techniques and supervised learning to identify exploitation events across multiple TI feeds. This approach shows promise in accurately detecting exploitation events, even on previously unseen data sources. The research has significant implications for cybersecurity, potentially improving vulnerability risk assessment, threat detection, and incident response capabilities. It could lead to more efficient and effective security operations, particularly for organizations dealing with large volumes of threat intelligence data.

Authors: Kajal Patel, Zubair Shafiq, Mateus Nogueira, Daniel Sadoc Menasché, Enrico Lovat, Taimur Kashif, Ashton Woiwood, Matheus Martins

Link: https://arxiv.org/abs/2409.07709v1

Date: 2024-09-12

Summary:

Many organizations rely on Threat Intelligence (TI) feeds to assess the risk associated with security threats. Due to the volume and heterogeneity of data, it is prohibitive to manually analyze the threat information available in different loosely structured TI feeds. Thus, there is a need to develop automated methods to vet and extract actionable information from TI feeds. To this end, we present a machine learning pipeline to automatically detect vulnerability exploitation from TI feeds. We first model threat vocabulary in loosely structured TI feeds using state-of-the-art embedding techniques (Doc2Vec and BERT) and then use it to train a supervised machine learning classifier to detect exploitation of security vulnerabilities. We use our approach to identify exploitation events in 191 different TI feeds. Our longitudinal evaluation shows that it is able to accurately identify exploitation events from TI feeds only using past data for training and even on TI feeds withheld from training. Our proposed approach is useful for a variety of downstream tasks such as data-driven vulnerability risk assessment.

--------------------------------------------------------------------------------------------------------

Violence detection in videos using deep recurrent and convolutional neural networks

This study proposes a deep learning architecture for violence detection in videos, combining recurrent neural networks (RNNs) and 2D convolutional neural networks (CNNs). By incorporating optical flow data alongside video frames, the system can effectively capture both spatial and temporal characteristics of violent scenes. The approach shows promising results across multiple databases, matching or surpassing state-of-the-art techniques. This research has potential applications in automated video surveillance, content moderation for social media platforms, and public safety monitoring systems. It could contribute to more efficient and accurate detection of violent incidents in various settings.

Authors: Abdarahmane Traoré, Moulay A. Akhloufi

Link: https://arxiv.org/abs/2409.07581v1

Date: 2024-09-11

Summary:

Violence and abnormal behavior detection research have known an increase of interest in recent years, due mainly to a rise in crimes in large cities worldwide. In this work, we propose a deep learning architecture for violence detection which combines both recurrent neural networks (RNNs) and 2-dimensional convolutional neural networks (2D CNN). In addition to video frames, we use optical flow computed using the captured sequences. CNN extracts spatial characteristics in each frame, while RNN extracts temporal characteristics. The use of optical flow allows to encode the movements in the scenes. The proposed approaches reach the same level as the state-of-the-art techniques and sometime surpass them. It was validated on 3 databases achieving good results.

--------------------------------------------------------------------------------------------------------

Watts and Bots: The Energy Implications of AI Adoption

This paper examines the potential energy and environmental impacts of widespread AI adoption across industries. By combining economic activity data with estimates of AI adoption across occupations and industries, the researchers project increases in energy use and carbon dioxide emissions at both industry and aggregate levels for the US economy. While the estimated increases are relatively small (0.03% for energy use and 0.02% for CO2 emissions annually), this research provides valuable insights into the environmental implications of AI proliferation. It could inform policy decisions, guide sustainable AI development practices, and contribute to more comprehensive assessments of AI's societal impacts.

Authors: Anthony Harding, Juan Moreno-Cruz

Link: https://arxiv.org/abs/2409.06626v1

Date: 2024-09-10

Summary:

With the rapid expansion of Artificial Intelligence, there are expectations for a proportional expansion of economic activity due to increased productivity, and with it energy consumption and its associated environmental consequences like carbon dioxide emissions. Here, we combine data on economic activity, with early estimates of likely adoption of AI across occupations and industries, to estimate the increase in energy use and carbon dioxide emissions at the industry level and in aggregate for the US economy. At the industry level, energy use can increase between 0 and 12 PJ per year, while emissions increase between 47 tCO$_2$ and 272 ktCO$_2$. Aggregating across industries in the US economy, this totals an increase in energy consumption of 28 PJ per year, or around 0.03% of energy use per year in the US. We find this translates to an increase in carbon dioxide emissions of 896 ktCO$_2$ per year, or around 0.02% of the CO$_2$ emissions per year in the US.

--------------------------------------------------------------------------------------------------------

Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models

This paper addresses the limitations in temporal reasoning capabilities of Large Audio Language Models (LALMs) for Audio Question Answering tasks. The researchers introduce a data augmentation technique for generating reliable audio temporal questions and answers, along with a continued finetuning curriculum learning strategy to improve temporal reasoning. They also develop an LLM-assisted metric to evaluate model responses. This research has potential applications in improving voice assistants, audio content analysis tools, and accessibility features for audio-based information retrieval. It could lead to more sophisticated audio AI systems capable of nuanced temporal understanding and reasoning across various domains.

Authors: Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

Link: https://arxiv.org/abs/2409.06223v2

Date: 2024-09-13

Summary:

The Audio Question Answering task includes audio event classification, audio captioning, and open ended reasoning. Recently, Audio Question Answering has garnered attention due to the advent of Large Audio Language Models. Current literature focuses on constructing LALMs by integrating audio encoders with text only Large Language Models through a projection module. While Large Audio Language Models excel in general audio understanding, they are limited in temporal reasoning which may hinder their commercial applications and on device deployment. This paper addresses these challenges and limitations in audio temporal reasoning. First, we introduce a data augmentation technique for generating reliable audio temporal questions and answers using an LLM. Second, we propose a continued finetuning curriculum learning strategy to specialize in temporal reasoning without compromising performance on finetuned tasks. Finally, we develop a reliable and transparent automated metric, assisted by an LLM, to measure the correlation between Large Audio Language Model responses and ground truth data intelligently. We demonstrate the effectiveness of our proposed techniques using SOTA LALMs on public audio benchmark datasets.

--------------------------------------------------------------------------------------------------------

Towards Generalizable Scene Change Detection

This study introduces a Generalizable Scene Change Detection Framework (GeSCF) to address the limitations of current scene change detection methods, particularly their bias towards temporal order in training datasets and limited performance on unseen domains. The proposed framework leverages pre-trained foundation models and adaptive thresholding techniques to achieve better generalization across diverse environments. This research has potential applications in visual surveillance, mobile robotics, and autonomous systems that require robust scene change detection capabilities. It could lead to more reliable and adaptable computer vision systems for real-world applications in dynamic environments.

Authors: Jaewoo Kim, Uehwan Kim

Link: https://arxiv.org/abs/2409.06214v1

Date: 2024-09-10

Summary:

Scene Change Detection (SCD) is vital for applications such as visual surveillance and mobile robotics. However, current SCD methods exhibit a bias to the temporal order of training datasets and limited performance on unseen domains; coventional SCD benchmarks are not able to evaluate generalization or temporal consistency. To tackle these limitations, we introduce a Generalizable Scene Change Detection Framework (GeSCF) in this work. The proposed GeSCF leverages localized semantics of a foundation model without any re-training or fine-tuning -- for generalization over unseen domains. Specifically, we design an adaptive thresholding of the similarity distribution derived from facets of the pre-trained foundation model to generate initial pseudo-change mask. We further utilize Segment Anything Model's (SAM) class-agnostic masks to refine pseudo-masks. Moreover, our proposed framework maintains commutative operations in all settings to ensure complete temporal consistency. Finally, we define new metrics, evaluation dataset, and evaluation protocol for Generalizable Scene Change Detection (GeSCD). Extensive experiments demonstrate that GeSCF excels across diverse and challenging environments -- establishing a new benchmark for SCD performance.

--------------------------------------------------------------------------------------------------------

What is the Role of Small Models in the LLM Era: A Survey

This survey paper examines the ongoing relevance and potential of small models in the era of Large Language Models (LLMs). By systematically analyzing the relationship between LLMs and small models from collaboration and competition perspectives, the authors provide insights into the practical applications and advantages of smaller, more resource-efficient models. This research is particularly relevant for academic researchers and businesses with limited computational resources. It could inform more efficient AI development strategies, promote sustainable AI practices, and encourage innovative approaches to combining the strengths of both large and small models in various applications.

Authors: Lihu Chen, Gaël Varoquaux

Link: https://arxiv.org/abs/2409.06857v2

Date: 2024-09-12

Summary:

Large Language Models (LLMs) have made significant progress in advancing artificial general intelligence (AGI), leading to the development of increasingly large models such as GPT-4 and LLaMA-405B. However, scaling up model sizes results in exponentially higher computational costs and energy consumption, making these models impractical for academic researchers and businesses with limited resources. At the same time, Small Models (SMs) are frequently used in practical settings, although their significance is currently underestimated. This raises important questions about the role of small models in the era of LLMs, a topic that has received limited attention in prior research. In this work, we systematically examine the relationship between LLMs and SMs from two key perspectives: Collaboration and Competition. We hope this survey provides valuable insights for practitioners, fostering a deeper understanding of the contribution of small models and promoting more efficient use of computational resources. The code is available at https://github.com/tigerchen52/role_of_small_models

--------------------------------------------------------------------------------------------------------

Seeing Through the Mask: Rethinking Adversarial Examples for CAPTCHAs

This paper challenges the effectiveness of current CAPTCHA systems in the face of advanced image recognition models. The researchers demonstrate that by applying masks of various intensities to images, they can significantly reduce the accuracy of state-of-the-art image classifiers, including supposedly robust models like vision transformers. This research has implications for cybersecurity, particularly in the development of more effective human verification systems. It could lead to the creation of more robust CAPTCHAs and contribute to ongoing efforts to distinguish between human and machine interactions in online environments.

Authors: Yahya Jabary, Andreas Plesner, Turlan Kuzhagaliyev, Roger Wattenhofer

Link: https://arxiv.org/abs/2409.05558v1

Date: 2024-09-09

Summary:

Modern CAPTCHAs rely heavily on vision tasks that are supposedly hard for computers but easy for humans. However, advances in image recognition models pose a significant threat to such CAPTCHAs. These models can easily be fooled by generating some well-hidden "random" noise and adding it to the image, or hiding objects in the image. However, these methods are model-specific and thus can not aid CAPTCHAs in fooling all models. We show in this work that by allowing for more significant changes to the images while preserving the semantic information and keeping it solvable by humans, we can fool many state-of-the-art models. Specifically, we demonstrate that by adding masks of various intensities the Accuracy @ 1 (Acc@1) drops by more than 50%-points for all models, and supposedly robust models such as vision transformers see an Acc@1 drop of 80%-points. These masks can therefore effectively fool modern image classifiers, thus showing that machines have not caught up with humans -- yet.

--------------------------------------------------------------------------------------------------------

Input-to-State Stable Coupled Oscillator Networks for Closed-form Model-based Control in Latent Space

This paper introduces a novel Coupled Oscillator Network (CON) model for latent-space control of physical systems. The CON model addresses key limitations of existing latent-space models by preserving the mathematical structure of physical systems, maintaining stability properties, and providing an invertible mapping between input and latent-space forcing. This research has potential applications in robotics, particularly in the control of complex nonlinear systems and soft robots. It could lead to more efficient and effective control strategies for a wide range of mechanical systems, improving performance in areas such as manufacturing, healthcare robotics, and autonomous systems.

Authors: Maximilian Stölzle, Cosimo Della Santina

Link: https://arxiv.org/abs/2409.08439v1

Date: 2024-09-13

Summary:

Even though a variety of methods (e.g., RL, MPC, LQR) have been proposed in the literature, efficient and effective latent-space control of physical systems remains an open challenge. A promising avenue would be to leverage powerful and well-understood closed-form strategies from control theory literature in combination with learned dynamics, such as potential-energy shaping. We identify three fundamental shortcomings in existing latent-space models that have so far prevented this powerful combination: (i) they lack the mathematical structure of a physical system, (ii) they do not inherently conserve the stability properties of the real systems. Furthermore, (iii) these methods do not have an invertible mapping between input and latent-space forcing. This work proposes a novel Coupled Oscillator Network (CON) model that simultaneously tackles all these issues. More specifically, (i) we show analytically that CON is a Lagrangian system - i.e., it presses well-defined potential and kinetic energy terms. Then, (ii) we provide formal proof of global Input-to-State stability using Lyapunov arguments. Moving to the experimental side, (iii) we demonstrate that CON reaches SoA performance when learning complex nonlinear dynamics of mechanical systems directly from images. An additional methodological innovation contributing to achieving this third goal is an approximated closed-form solution for efficient integration of network dynamics, which eases efficient training. We tackle (iv) by approximating the forcing-to-input mapping with a decoder that is trained to reconstruct the input based on the encoded latent space force. Finally, we leverage these four properties and show that they enable latent-space control. We use an integral-saturated PID with potential force compensation and demonstrate high-quality performance on a soft robot using raw pixels as the only feedback information.

--------------------------------------------------------------------------------------------------------

Module-wise Adaptive Adversarial Training for End-to-end Autonomous Driving

This study presents a novel approach to enhancing the robustness of end-to-end autonomous driving models against adversarial attacks. The proposed Module-wise Adaptive Adversarial Training (MA2T) method introduces targeted noise injection and dynamic weight adaptation to improve model resilience across various modules. This research has significant implications for the safety and reliability of autonomous vehicles, potentially leading to more robust and secure self-driving systems. It could contribute to the development of adversarial defense strategies for complex AI systems in safety-critical applications beyond autonomous driving.

Authors: Tianyuan Zhang, Lu Wang, Jiaqi Kang, Xinwei Zhang, Siyuan Liang, Yuwei Chen, Aishan Liu, Xianglong Liu

Link: https://arxiv.org/abs/2409.07321v1

Date: 2024-09-11

Summary:

Recent advances in deep learning have markedly improved autonomous driving (AD) models, particularly end-to-end systems that integrate perception, prediction, and planning stages, achieving state-of-the-art performance. However, these models remain vulnerable to adversarial attacks, where human-imperceptible perturbations can disrupt decision-making processes. While adversarial training is an effective method for enhancing model robustness against such attacks, no prior studies have focused on its application to end-to-end AD models. In this paper, we take the first step in adversarial training for end-to-end AD models and present a novel Module-wise Adaptive Adversarial Training (MA2T). However, extending conventional adversarial training to this context is highly non-trivial, as different stages within the model have distinct objectives and are strongly interconnected. To address these challenges, MA2T first introduces Module-wise Noise Injection, which injects noise before the input of different modules, targeting training models with the guidance of overall objectives rather than each independent module loss. Additionally, we introduce Dynamic Weight Accumulation Adaptation, which incorporates accumulated weight changes to adaptively learn and adjust the loss weights of each module based on their contributions (accumulated reduction rates) for better balance and robust training. To demonstrate the efficacy of our defense, we conduct extensive experiments on the widely-used nuScenes dataset across several end-to-end AD models under both white-box and black-box attacks, where our method outperforms other baselines by large margins (+5-10%). Moreover, we validate the robustness of our defense through closed-loop evaluation in the CARLA simulation environment, showing improved resilience even against natural corruption.

--------------------------------------------------------------------------------------------------------

DeepFM-Crispr: Prediction of CRISPR On-Target Effects via Deep Learning

This paper introduces DeepFM-Crispr, a novel deep learning model for predicting the on-target efficiency and evaluating off-target effects of CRISPR-Cas13d, an RNA-targeting gene editing system. By leveraging large language models and transformer-based architectures, the model aims to enhance predictions of RNA secondary structures and overall sgRNA efficacy. This research has potential applications in gene therapy, drug development, and biotechnology. It could lead to more precise and efficient gene editing techniques, accelerating research in areas such as disease treatment and crop improvement.

Authors: Condy Bao, Fuxiao Liu

Link: https://arxiv.org/abs/2409.05938v1

Date: 2024-09-09

Summary:

Since the advent of CRISPR-Cas9, a groundbreaking gene-editing technology that enables precise genomic modifications via a short RNA guide sequence, there has been a marked increase in the accessibility and application of this technology across various fields. The success of CRISPR-Cas9 has spurred further investment and led to the discovery of additional CRISPR systems, including CRISPR-Cas13. Distinct from Cas9, which targets DNA, Cas13 targets RNA, offering unique advantages for gene modulation. We focus on Cas13d, a variant known for its collateral activity where it non-specifically cleaves adjacent RNA molecules upon activation, a feature critical to its function. We introduce DeepFM-Crispr, a novel deep learning model developed to predict the on-target efficiency and evaluate the off-target effects of Cas13d. This model harnesses a large language model to generate comprehensive representations rich in evolutionary and structural data, thereby enhancing predictions of RNA secondary structures and overall sgRNA efficacy. A transformer-based architecture processes these inputs to produce a predictive efficacy score. Comparative experiments show that DeepFM-Crispr not only surpasses traditional models but also outperforms recent state-of-the-art deep learning methods in terms of prediction accuracy and reliability.

--------------------------------------------------------------------------------------------------------

A Bayesian framework for active object recognition, pose estimation and shape transfer learning through touch

This study presents a unified Bayesian framework combining particle filters and Gaussian process implicit surfaces for tactile object recognition, pose estimation, and shape reconstruction. The framework can differentiate between known and novel objects, transfer knowledge about known shapes to learn new ones, and guide active data acquisition. This research has potential applications in robotics, particularly in unstructured environments where robots need to interact with both familiar and unfamiliar objects. It could lead to more adaptable and efficient robotic systems for tasks such as manufacturing, household assistance, and exploration in unknown environments.

Authors: Haodong Zheng, Andrei Jalba, Raymond H. Cuijpers, Wijnand IJsselsteijn, Sanne Schoenmakers

Link: https://arxiv.org/abs/2409.06912v2

Date: 2024-09-13

Summary:

As humans can explore and understand the world through the sense of touch, tactile sensing is also an important aspect of robotic perception. In unstructured environments, robots can encounter both known and novel objects, this calls for a method to address both known and novel objects. In this study, we combine a particle filter (PF) and Gaussian process implicit surface (GPIS) in a unified Bayesian framework. The framework can differentiate between known and novel objects, perform object recognition, estimate pose for known objects, and reconstruct shapes for unknown objects, in an active learning fashion. By grounding the selection of the GPIS prior with the maximum-likelihood-estimation (MLE) shape from the PF, the knowledge about known objects' shapes can be transferred to learn novel shapes. An exploration procedure with global shape estimation is proposed to guide active data acquisition and conclude the exploration when sufficient information is obtained. The performance of the proposed Bayesian framework is evaluated through simulations on known and novel objects, initialized with random poses. The results show that the proposed exploration procedure, utilizing global shape estimation, achieves faster exploration than a local exploration procedure based on rapidly explore random tree (RRT). Overall, our results indicate that the proposed framework is effective and efficient in object recognition, pose estimation and shape reconstruction. Moreover, we show that a learned shape can be included as a new prior and used effectively for future object recognition and pose estimation.

--------------------------------------------------------------------------------------------------------

Autonomous Vehicle Controllers From End-to-End Differentiable Simulation

This paper introduces an analytic policy gradients (APG) approach to training autonomous vehicle controllers using a differentiable simulator. By incorporating the simulator into an end-to-end training loop, the method allows for more efficient and grounded policy learning. The research demonstrates improved performance and robustness compared to behavioral cloning methods. This approach has significant implications for the development of autonomous driving systems, potentially leading to more reliable and adaptable self-driving vehicles. It could accelerate the deployment of autonomous vehicles in real-world scenarios by improving their ability to handle diverse and complex driving situations.

Authors: Asen Nachkov, Danda Pani Paudel, Luc Van Gool

Link: https://arxiv.org/abs/2409.07965v1

Date: 2024-09-12

Summary:

Current methods to learn controllers for autonomous vehicles (AVs) focus on behavioural cloning. Being trained only on exact historic data, the resulting agents often generalize poorly to novel scenarios. Simulators provide the opportunity to go beyond offline datasets, but they are still treated as complicated black boxes, only used to update the global simulation state. As a result, these RL algorithms are slow, sample-inefficient, and prior-agnostic. In this work, we leverage a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers on the large-scale Waymo Open Motion Dataset. Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of the environment dynamics serve as a useful prior to help the agent learn a more grounded policy. We combine this setup with a recurrent architecture that can efficiently propagate temporal information across long simulated trajectories. This APG method allows us to learn robust, accurate, and fast policies, while only requiring widely-available expert trajectories, instead of scarce expert actions. We compare to behavioural cloning and find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.

--------------------------------------------------------------------------------------------------------

A framework for measuring the training efficiency of a neural architecture

This paper presents an experimental framework for measuring the training efficiency of neural network architectures. By analyzing Convolutional Neural Networks and Bayesian equivalents on standard image classification tasks, the researchers provide insights into how training efficiency varies across different stopping criteria, model sizes, and learning tasks. This research has implications for the design and optimization of neural network architectures, potentially leading to more efficient and effective AI systems. It could inform decisions in model selection, hyperparameter tuning, and resource allocation in AI development projects across various domains.

Authors: Eduardo Cueto-Mendoza, John D. Kelleher

Link: https://arxiv.org/abs/2409.07925v1

Date: 2024-09-12

Summary:

Measuring Efficiency in neural network system development is an open research problem. This paper presents an experimental framework to measure the training efficiency of a neural architecture. To demonstrate our approach, we analyze the training efficiency of Convolutional Neural Networks and Bayesian equivalents on the MNIST and CIFAR-10 tasks. Our results show that training efficiency decays as training progresses and varies across different stopping criteria for a given neural model and learning task. We also find a non-linear relationship between training stopping criteria, training Efficiency, model size, and training Efficiency. Furthermore, we illustrate the potential confounding effects of overtraining on measuring the training efficiency of a neural architecture. Regarding relative training efficiency across different architectures, our results indicate that CNNs are more efficient than BCNNs on both datasets. More generally, as a learning task becomes more complex, the relative difference in training efficiency between different architectures becomes more pronounced.

--------------------------------------------------------------------------------------------------------

An End-to-End Approach for Chord-Conditioned Song Generation

This study introduces a novel approach to song generation that incorporates chord information to improve musical performance and control precision. The proposed Chord-Conditioned Song Generator (CSG) uses a robust cross-attention mechanism to integrate extracted chord information into the generation process. This research has potential applications in music production, AI-assisted composition, and interactive music creation tools. It could lead to more sophisticated and musically coherent AI-generated songs, opening up new possibilities for creative collaboration between humans and AI in the music industry.

Authors: Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu

Link: https://arxiv.org/abs/2409.06307v1

Date: 2024-09-10

Summary:

The Song Generation task aims to synthesize music composed of vocals and accompaniment from given lyrics. While the existing method, Jukebox, has explored this task, its constrained control over the generations often leads to deficiency in music performance. To mitigate the issue, we introduce an important concept from music composition, namely chords, to song generation networks. Chords form the foundation of accompaniment and provide vocal melody with associated harmony. Given the inaccuracy of automatic chord extractors, we devise a robust cross-attention mechanism augmented with dynamic weight sequence to integrate extracted chord information into song generations and reduce frame-level flaws, and propose a novel model termed Chord-Conditioned Song Generator (CSG) based on it. Experimental evidence demonstrates our proposed method outperforms other approaches in terms of musical performance and control precision of generated songs.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithSeptember 16, 2024Comment