Week Ending 6.30.2024

RESEARCH WATCH: 6.30.2024

EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models

This paper addresses the growing need for accessible and efficient chronic disease prediction tools. By leveraging five years of Electronic Health Records from Taiwan and utilizing Large Language Multimodal Models, the researchers have developed a platform that can predict chronic disease risk from clinical notes and blood test values. This innovative approach integrates with front-end web and mobile applications, allowing for real-time risk assessment. The platform's potential applications include empowering patients with self-monitoring tools, assisting healthcare providers in early intervention, and improving overall public health outcomes by enabling more proactive and personalized healthcare strategies.

Authors: Chun-Chieh Liao, Wei-Ting Kuo, I-Hsuan Hu, Yen-Chen Shih, Jun-En Ding, Feng Liu, Fang-Ming Hung

Link: https://arxiv.org/abs/2406.18087v1

Date: 2024-06-26

Summary:

Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and developing application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan's hospital database between 2017 and 2021 as an AI database. Furthermore, we developed an EHR-based chronic disease prediction platform utilizing Large Language Multimodal Models (LLMMs), successfully integrating with frontend web and mobile applications for prediction. This prediction platform can also connect to the hospital's backend database, providing physicians with real-time risk assessment diagnostics. The demonstration link can be found at https://www.youtube.com/watch?v=oqmL9DEDFgA.

--------------------------------------------------------------------------------------------------------

Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

In the realm of medical imaging, accurate segmentation is crucial for diagnosis and treatment planning. This paper tackles the challenge of Few-shot Semantic Segmentation in complex medical images, where traditional models often fall short. The proposed Detail Self-refined Prototype Network (DSPNet) offers a novel approach to constructing high-fidelity prototypes that better represent both foreground objects and complex backgrounds. This advancement has significant implications for medical image analysis, potentially improving diagnostic accuracy, treatment planning, and research in fields such as radiology and pathology. The model's ability to perform well with limited labeled data could accelerate the adoption of AI in various medical imaging applications.

Authors: Song Tang, Shaxu Yan, Xiaozhi Qi, Jianxin Gao, Mao Ye, Jianwei Zhang, Xiatian Zhu

Link: https://arxiv.org/abs/2406.18074v1

Date: 2024-06-26

Summary:

Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal for medical imaging with both conditions invalid. To address this problem, we propose a novel Detail Self-refined Prototype Network (DSPNet) to constructing high-fidelity prototypes representing the object foreground and the background more comprehensively. Specifically, to construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner. Considering that the background often has no apparent semantic relation in the spatial dimensions, we integrate channel-specific structural information under sparse channel-aware regulation. Extensive experiments on three challenging medical image benchmarks show the superiority of DSPNet over previous state-of-the-art methods.

--------------------------------------------------------------------------------------------------------

Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents

This research addresses a critical challenge in deep reinforcement learning (DRL): achieving robustness without sacrificing performance. The authors introduce S-DQN and S-PPO, novel algorithms that significantly improve the effectiveness of smoothed robust DRL agents. These advancements could have far-reaching implications in fields requiring reliable AI decision-making, such as autonomous vehicles, robotics, and financial trading systems. By enhancing both clean rewards and robustness, these algorithms pave the way for more trustworthy and efficient AI systems in real-world applications where safety and reliability are paramount.

Authors: Chung-En Sun, Sicun Gao, Tsui-Wei Weng

Link: https://arxiv.org/abs/2406.18062v1

Date: 2024-06-26

Summary:

Robustness remains a paramount concern in deep reinforcement learning (DRL), with randomized smoothing emerging as a key technique for enhancing this attribute. However, a notable gap exists in the performance of current smoothed DRL agents, often characterized by significantly low clean rewards and weak robustness. In response to this challenge, our study introduces innovative algorithms aimed at training effective smoothed robust DRL agents. We propose S-DQN and S-PPO, novel approaches that demonstrate remarkable improvements in clean rewards, empirical robustness, and robustness guarantee across standard RL benchmarks. Notably, our S-DQN and S-PPO agents not only significantly outperform existing smoothed agents by an average factor of $2.16\times$ under the strongest attack, but also surpass previous robustly-trained agents by an average factor of $2.13\times$. This represents a significant leap forward in the field. Furthermore, we introduce Smoothed Attack, which is $1.89\times$ more effective in decreasing the rewards of smoothed agents than existing adversarial attacks.

--------------------------------------------------------------------------------------------------------

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

As video content continues to proliferate, efficient summarization techniques become increasingly important. This paper introduces a comprehensive approach to video summarization that combines both visual and textual modalities. The proposed UBiSS framework and the accompanying BIDS dataset represent a significant step forward in multimodal video understanding. Potential applications include improved video search engines, content recommendation systems, and automated video editing tools. This research could transform how we interact with and extract information from video content across various domains, from entertainment and education to surveillance and scientific research.

Authors: Yuting Mei, Linli Yao, Qin Jin

Link: https://arxiv.org/abs/2406.16301v1

Date: 2024-06-24

Summary:

With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of the video. In this paper, we focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV). Specifically, we first construct a large-scale dataset, BIDS, in (video, VM-Summary, TM-Summary) triplet format. Unlike traditional processing methods, our construction procedure contains a VM-Summary extraction algorithm aiming to preserve the most salient content within long videos. Based on BIDS, we propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. We further optimize our model with a list-wise ranking-based objective to improve its capacity to capture highlights. Lastly, we propose a metric, $NDCG_{MS}$, to provide a joint evaluation of the bimodal summary. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines. Code and data are available at https://github.com/MeiYutingg/UBiSS.

--------------------------------------------------------------------------------------------------------

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

This paper addresses a significant gap in the capabilities of current multimodal large language models (MLLMs): understanding webpage screenshots and generating corresponding HTML code. The Web2Code benchmark and dataset provide a valuable resource for training and evaluating MLLMs in this domain. Potential applications include automated web development tools, accessibility improvements for visually impaired users, and enhanced web scraping capabilities. This research could accelerate the development of AI-assisted web design and coding tools, potentially revolutionizing how websites are created and maintained.

Authors: Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen

Link: https://arxiv.org/abs/2406.20098v1

Date: 2024-06-28

Summary:

Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code.

--------------------------------------------------------------------------------------------------------

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

MUMU introduces an innovative approach to multimodal image generation, allowing for the creation of images from prompts that combine both text and images. This research opens up exciting possibilities in creative tools, visual communication, and content creation. Potential applications include advanced photo editing software, educational tools for visual storytelling, and enhanced user interfaces for design applications. By enabling more intuitive and flexible image generation, MUMU could transform how we create and manipulate visual content across various industries and creative fields.

Authors: William Berman, Alexander Peysakhovich

Link: https://arxiv.org/abs/2406.18790v1

Date: 2024-06-26

Summary:

We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

--------------------------------------------------------------------------------------------------------

Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

This paper tackles the challenge of making large language models (LLMs) more accessible and deployable in resource-constrained environments. The proposed Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA) method offers a novel approach to model compression that preserves performance while significantly reducing model size. Potential applications include deploying powerful language models on mobile devices, improving AI accessibility in regions with limited computational resources, and reducing the environmental impact of AI by lowering energy consumption. This research could democratize access to advanced AI technologies across various sectors.

Authors: Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Xi Chen, Cunhang Fan, Zhao Lv, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui

Link: https://arxiv.org/abs/2406.16330v1

Date: 2024-06-24

Summary:

While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that uses manifold learning and the Normalized Pairwise Information Bottleneck (NPIB) measure to merge similar layers, reducing model size while preserving essential performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a compression ratio of 43.75% with a minimal performance decrease of only 2.82\%. The proposed MKA method offers a resource-efficient and performance-preserving model compression technique for LLMs.

--------------------------------------------------------------------------------------------------------

Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning

This paper addresses the challenges in Cross-lingual Cross-modal Retrieval (CCR), a crucial task for multilingual web search and content discovery. The proposed 1-to-K contrastive learning method improves consistency across languages and modalities in retrieval tasks. Potential applications include more accurate and fair multilingual image search engines, improved content recommendation systems for global platforms, and enhanced cross-cultural information retrieval tools. This research could significantly impact how we access and interact with multilingual and multimodal content on the internet, promoting better global information exchange.

Authors: Zhijie Nie, Richong Zhang, Zhangchi Feng, Hailang Huang, Xudong Liu

Link: https://arxiv.org/abs/2406.18254v1

Date: 2024-06-26

Summary:

Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.

--------------------------------------------------------------------------------------------------------

In vivo and in vitro study of resorbable magnesium wires for medical implants: Mg purity, surface quality, Zn alloying and polymer coating

This research explores the potential of magnesium-based wires for medical implants, addressing challenges such as corrosion and hydrogen gas generation. By investigating various compositions and surface treatments, the study aims to optimize magnesium wires for bone-support applications. Potential applications include biodegradable orthopedic implants, such as cerclage and fixation wires, that could eliminate the need for secondary removal surgeries. This work could revolutionize orthopedic and reconstructive surgeries by providing implants that support healing and then safely dissolve, improving patient outcomes and reducing healthcare costs.

Authors: K. Tesar, J. Lunackova, M. Jex, M. Zaloudkova, R. Vrbova, M. Bartos, P. Klein, L. Vistejnova, J. Duskova, E. Filova, Z. Sucharda, M. Steinerova, S. Habr, K. Balik, A. Singh

Link: https://arxiv.org/abs/2406.18172v1

Date: 2024-06-26

Summary:

Magnesium is an excellent material in terms of biocompatibility and its corrosion products can serve as an active source for new bone formation. However, localized corrosion and H2 generation limit the potential of Mg-based implants. Utilizing low-alloyed Mg-Zn wires can strongly reduce problems with large H2 bubbles and improve the mechanical properties considerably while maintaining excellent long-term biocompatibility. Acidic pickling and a polymer coating can be effectively used to lower the rate of in vivo degradation. In this work, microstructural, mechanical, and in vitro characterization of 250 um and 300 um extruded wires made from ultra-pure Mg, commercially pure Mg, Mg-0.15Zn, Mg-0.4Zn and Mg-1Zn was performed. Additionally, Mg-0.4Zn wires together with a variant coated with a copolymer of L-lactide and {\epsilon}-caprolactone were tested in vivo on artificially damaged Wistar rat femurs. Based on the observed Mg-induced osteogenesis, polymer-coated Mg wires with a small addition of Zn are a perspective material for bone-support applications, such as cerclage and fixation wires.

--------------------------------------------------------------------------------------------------------

Carrot and Stick: Inducing Self-Motivation with Positive & Negative Feedback

This paper introduces the CASTIC dataset, which explores self-motivation strategies through both positive and negative feedback. By providing a computational perspective on self-motivation, this research opens up new possibilities in fields such as education, workplace management, and personal development. Potential applications include AI-powered coaching tools, personalized learning systems that adapt to individual motivation styles, and improved human-computer interaction designs that better support user engagement and goal achievement. This work could contribute to more effective digital tools for personal and professional growth.

Authors: Jimin Sohn, Jeihee Cho, Junyong Lee, Songmu Heo, Ji-Eun Han, David R. Mortensen

Link: https://arxiv.org/abs/2406.16521v1

Date: 2024-06-24

Summary:

Positive thinking is thought to be an important component of self-motivation in various practical fields such as education and the workplace. Previous work, including sentiment transfer and positive reframing, has focused on the positive side of language. However, self-motivation that drives people to reach their goals has not yet been studied from a computational perspective. Moreover, negative feedback has not yet been explored, even though positive and negative feedback are both necessary to grow self-motivation. To facilitate self-motivation, we propose CArrot and STICk (CASTIC) dataset, consisting of 12,590 sentences with 5 different strategies for enhancing self-motivation. Our data and code are publicly available at here.

--------------------------------------------------------------------------------------------------------

Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

This research presents a novel approach to error detection in robot-assisted minimally invasive surgery (RMIS) using a Chain-of-Thought (COG) prompting framework. By mimicking the decision-making processes of expert surgeons, this method offers real-time error detection without relying on accurate gesture identification. Potential applications include enhanced safety measures in robotic surgery, improved surgical training programs, and the development of more intelligent surgical assistance systems. This work could significantly contribute to reducing surgical errors, improving patient outcomes, and advancing the field of robotic surgery.

Authors: Zhimin Shao, Jialang Xu, Danail Stoyanov, Evangelos B. Mazomenos, Yueming Jin

Link: https://arxiv.org/abs/2406.19217v1

Date: 2024-06-27

Summary:

Despite significant advancements in robotic systems and surgical data science, ensuring safe and optimal execution in robot-assisted minimally invasive surgery (RMIS) remains a complex challenge. Current surgical error detection methods involve two parts: identifying surgical gestures and then detecting errors within each gesture clip. These methods seldom consider the rich contextual and semantic information inherent in surgical videos, limiting their performance due to reliance on accurate gesture identification. Motivated by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Thought (COG) prompting, leveraging contextual information from surgical videos. This encompasses two reasoning modules designed to mimic the decision-making processes of expert surgeons. Concretely, we first design a Gestural-Visual Reasoning module, which utilizes transformer and attention architectures for gesture prompting, while the second, a Multi-Scale Temporal Reasoning module, employs a multi-stage temporal convolutional network with both slow and fast paths for temporal information extraction. We extensively validate our method on the public benchmark RMIS dataset JIGSAWS. Our method encapsulates the reasoning processes inherent to surgical activities enabling it to outperform the state-of-the-art by 4.6% in F1 score, 4.6% in Accuracy, and 5.9% in Jaccard index while processing each frame in 6.69 milliseconds on average, demonstrating the great potential of our approach in enhancing the safety and efficacy of RMIS procedures and surgical education. The code will be available.

--------------------------------------------------------------------------------------------------------

360 in the Wild: Dataset for Depth Prediction and View Synthesis

This paper introduces a large-scale dataset of 360-degree videos captured in diverse real-world environments. By providing camera pose and depth map information for each image, this dataset addresses the scarcity of panoramic image datasets with essential metadata. Potential applications include improved virtual reality experiences, more accurate 3D reconstruction from single images, and enhanced computer vision algorithms for omnidirectional cameras. This research could accelerate advancements in fields such as autonomous navigation, immersive media, and environmental mapping.

Authors: Kibaek Park, Francois Rameau, Jaesik Park, In So Kweon

Link: https://arxiv.org/abs/2406.18898v1

Date: 2024-06-27

Summary:

The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360$^{\circ}$ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.

--------------------------------------------------------------------------------------------------------

Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other

This paper presents Learnable Singular value Increment (LSI), an advanced solution for addressing accuracy decay in quantized Large Language Models (LLMs). By combining the merits of existing quantization error compensation methods, LSI achieves state-of-the-art performance across various quantization settings. Potential applications include deploying powerful language models on resource-constrained devices, enabling edge computing for natural language processing tasks, and reducing the computational costs of running LLMs in data centers. This research could significantly contribute to making advanced AI models more accessible and energy-efficient.

Authors: Yifei Gao, Jie Ou, Lei Wang, Yuting Xiao, Zhiyuan Xiang, Ruiting Dai, Jun Cheng

Link: https://arxiv.org/abs/2406.16299v1

Date: 2024-06-24

Summary:

Emergent Large Language Models (LLMs) use their extraordinary performance and powerful deduction capacity to discern from traditional language models. However, the expenses of computational resources and storage for these LLMs are stunning, quantization then arises as a trending conversation. To address accuracy decay caused by quantization, two streams of works in post-training quantization methods stand out. One uses other weights to compensate existing quantization error, while the other transfers the quantization difficulty to other parts in the model. Combining both merits, we introduce Learnable Singular value Increment (LSI) as an advanced solution. LSI uses Singular Value Decomposition to extract singular values of the weights and make them learnable to help weights compensate each other conditioned on activation. Incorporating LSI with existing techniques, we achieve state-of-the-art performance in diverse quantization settings, no matter in weight-only, weight-activation or extremely low bit scenarios. By unleashing the potential of LSI, efficient finetuning on quantized model is no longer a prohibitive problem.

--------------------------------------------------------------------------------------------------------

Video-Infinity: Distributed Long Video Generation

This paper introduces Video-Infinity, a distributed inference pipeline that enables the generation of long-form videos using multiple GPUs. By addressing the challenges of memory requirements and processing time, this method allows for the creation of videos up to 2,300 frames long in just minutes. Potential applications include advanced video editing tools, improved special effects generation for film and television, and more sophisticated AI-generated content for social media and advertising. This research could revolutionize video production workflows and open up new possibilities in visual storytelling and digital content creation.

Authors: Zhenxiong Tan, Xingyi Yang, Songhua Liu, Xinchao Wang

Link: https://arxiv.org/abs/2406.16260v1

Date: 2024-06-24

Summary:

Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. Specifically, we propose two coherent mechanisms: Clip parallelism and Dual-scope attention. Clip parallelism optimizes the gathering and sharing of context information across GPUs which minimizes communication overhead, while Dual-scope attention modulates the temporal self-attention to balance local and global contexts efficiently across the devices. Together, the two mechanisms join forces to distribute the workload and enable the fast generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.

--------------------------------------------------------------------------------------------------------

"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models

This paper highlights a critical security threat in Retrieval-Augmented Generative (RAG) models, demonstrating how adversaries can manipulate model behavior by injecting deceptive content into knowledge bases. This research exposes the vulnerabilities of systems that rely on publicly accessible data sources. Potential applications include developing more robust security measures for AI systems, improving the integrity of machine-generated content, and enhancing the trustworthiness of AI-assisted decision-making tools. This work could significantly impact the design and deployment of AI systems across various sectors, from information retrieval to automated content generation.

Authors: Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan Li, Song Wang, Jundong Li, Tianlong Chen, Huan Liu

Link: https://arxiv.org/abs/2406.19417v1

Date: 2024-06-26

Summary:

Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) by integrating external knowledge bases, improving their performance in applications like fact-checking and information searching. In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases by injecting deceptive content into the retrieval database, intentionally changing the model's behavior. This threat is critical as it mirrors real-world usage scenarios where RAG systems interact with publicly accessible knowledge bases, such as web scrapings and user-contributed data pools. To be more realistic, we target a realistic setting where the adversary has no knowledge of users' queries, knowledge base data, and the LLM parameters. We demonstrate that it is possible to exploit the model successfully through crafted content uploads with access to the retriever. Our findings emphasize an urgent need for security measures in the design and deployment of RAG systems to prevent potential manipulation and ensure the integrity of machine-generated content.

--------------------------------------------------------------------------------------------------------

DeepSense-V2V: A Vehicle-to-Vehicle Multi-Modal Sensing, Localization, and Communications Dataset

As intelligent transport systems evolve, high-speed vehicle-to-vehicle (V2V) communication becomes crucial for coordination and safety. This paper introduces the first large-scale multi-modal dataset for studying mmWave V2V communications. Using a two-vehicle testbed equipped with various sensors, the researchers collected data across diverse driving conditions. This dataset provides realistic scenarios for developing effective communication strategies, particularly for high-frequency bands where signal propagation is challenging. Potential applications include improving autonomous vehicle coordination, enhancing road safety systems, and developing more efficient traffic management solutions. This resource could accelerate advancements in V2V communication technologies and intelligent transportation systems.

Authors: Joao Morais, Gouranga Charan, Nikhil Srinivas, Ahmed Alkhateeb

Link: https://arxiv.org/abs/2406.17908v1

Date: 2024-06-25

Summary:

High data rate and low-latency vehicle-to-vehicle (V2V) communication are essential for future intelligent transport systems to enable coordination, enhance safety, and support distributed computing and intelligence requirements. Developing effective communication strategies, however, demands realistic test scenarios and datasets. This is important at the high-frequency bands where more spectrum is available, yet harvesting this bandwidth is challenged by the need for direction transmission and the sensitivity of signal propagation to blockages. This work presents the first large-scale multi-modal dataset for studying mmWave vehicle-to-vehicle communications. It presents a two-vehicle testbed that comprises data from a 360-degree camera, four radars, four 60 GHz phased arrays, a 3D lidar, and two precise GPSs. The dataset contains vehicles driving during the day and night for 120 km in intercity and rural settings, with speeds up to 100 km per hour. More than one million objects were detected across all images, from trucks to bicycles. This work further includes detailed dataset statistics that prove the coverage of various situations and highlights how this dataset can enable novel machine-learning applications.

--------------------------------------------------------------------------------------------------------

Dynamic Scheduling for Vehicle-to-Vehicle Communications Enhanced Federated Learning

This paper explores the potential of vehicular federated learning (VFL) enhanced by direct vehicle-to-vehicle (V2V) communications. The researchers propose a V2V-enhanced dynamic scheduling algorithm to optimize VFL training performance while considering energy constraints and vehicle mobility. This approach could significantly improve edge computing capabilities in connected vehicle networks. Potential applications include more efficient traffic prediction systems, enhanced autonomous driving algorithms, and improved real-time decision-making for connected vehicles. By leveraging the distributed nature of vehicular networks, this research could lead to more robust and adaptive intelligent transportation systems.

Authors: Jintao Yan, Tan Chen, Yuxuan Sun, Zhaojun Nan, Sheng Zhou, Zhisheng Niu

Link: https://arxiv.org/abs/2406.17470v1

Date: 2024-06-25

Summary:

Leveraging the computing and sensing capabilities of vehicles, vehicular federated learning (VFL) has been applied to edge training for connected vehicles. The dynamic and interconnected nature of vehicular networks presents unique opportunities to harness direct vehicle-to-vehicle (V2V) communications, enhancing VFL training efficiency. In this paper, we formulate a stochastic optimization problem to optimize the VFL training performance, considering the energy constraints and mobility of vehicles, and propose a V2V-enhanced dynamic scheduling (VEDS) algorithm to solve it. The model aggregation requirements of VFL and the limited transmission time due to mobility result in a stepwise objective function, which presents challenges in solving the problem. We thus propose a derivative-based drift-plus-penalty method to convert the long-term stochastic optimization problem to an online mixed integer nonlinear programming (MINLP) problem, and provide a theoretical analysis to bound the performance gap between the online solution and the offline optimal solution. Further analysis of the scheduling priority reduces the original problem into a set of convex optimization problems, which are efficiently solved using the interior-point method. Experimental results demonstrate that compared with the state-of-the-art benchmarks, the proposed algorithm enhances the image classification accuracy on the CIFAR-10 dataset by 3.18% and reduces the average displacement errors on the Argoverse trajectory prediction dataset by 10.21%.

--------------------------------------------------------------------------------------------------------

EMVD dataset: a dataset of extreme vocal distortion techniques used in heavy metal

This paper introduces the Extreme Metal Vocals Dataset, a unique collection of audio recordings featuring extreme vocal techniques used in heavy metal music. Comprising 760 audio excerpts from 27 different singers, the dataset covers various distortion techniques and vocal effects across different pitch ranges. This resource fills a gap in audio processing research by providing isolated recordings of these specialized vocal techniques. Potential applications include improving music genre classification algorithms, developing more accurate vocal effect processing tools, and enhancing automatic transcription systems for extreme metal music. This dataset could also contribute to voice pathology research and the development of vocal training applications.

Authors: Modan Tailleur, Julien Pinquier, Laurent Millot, Corsin Vogel, Mathieu Lagrange

Link: https://arxiv.org/abs/2406.17732v1

Date: 2024-06-24

Summary:

In this paper, we introduce the Extreme Metal Vocals Dataset, which comprises a collection of recordings of extreme vocal techniques performed within the realm of heavy metal music. The dataset consists of 760 audio excerpts of 1 second to 30 seconds long, totaling about 100 min of audio material, roughly composed of 60 minutes of distorted voices and 40 minutes of clear voice recordings. These vocal recordings are from 27 different singers and are provided without accompanying musical instruments or post-processing effects. The distortion taxonomy within this dataset encompasses four distinct distortion techniques and three vocal effects, all performed in different pitch ranges. Performance of a state-of-the-art deep learning model is evaluated for two different classification tasks related to vocal techniques, demonstrating the potential of this resource for the audio processing community.

--------------------------------------------------------------------------------------------------------

Ice on curved surfaces: defect rings and differential local dynamics

This paper examines the behavior of ice systems on curved surfaces, focusing on the role of defect rings in locally constrained dynamics. The researchers demonstrate how curvature affects the "flippability" of different polygon shapes in the ice lattice. This study provides insights into the fundamental physics of ice-like systems and could have implications for understanding similar phenomena in other materials. Potential applications include developing new materials with tunable properties, improving models of phase transitions in complex systems, and advancing our understanding of quantum many-body systems. This research could also contribute to the design of artificial spin ice systems for quantum computing applications.

Authors: Adhitya Sivaramakrishnan, R. Ganesh

Link: https://arxiv.org/abs/2406.19453v1

Date: 2024-06-27

Summary:

Ice systems are prototypes of locally constrained dynamics. This is exemplified in Coulomb-liquid phases where a large space of configurations is sampled, each satisfying local ice rules. Dynamics proceeds through `flipping' rings, i.e., through reversing arrows running along the edges of a polygon. We examine the role of defect rings in such phases, with square-ice as a testing ground. When placed on a curved surface, the underlying square lattice will form defects such as triangles or pentagons. We show that triangular defects are statistically more `flippable' than the background. In contrast, pentagons and larger polygons are less flippable. In fact, flippability decreases monotonically with ring size, as seen from a Pauling-like argument. As an explicit demonstration, we wrap the square ice model on a sphere. We start from an octahedron and perform repeated rectifications, producing a series of clusters with sphere-like geometry. They contain a fixed number of defect triangles in an otherwise square lattice. We numerically enumerate all ice-rule-satisfying configurations. Indeed, triangles are flippable in a larger fraction of configurations than quadrilaterals. The obtained flippabilities are in broad agreement with the Pauling-like estimates. As a minimal model for dynamics, we construct a Hamiltonian with quantum tunnelling terms that flip rings. The resulting ground state is a superposition of all ice configurations. The dominant contribution to its energy comes from localized resonance within triangles. Our results suggest local dynamics as a promising observable for experiments in spin ice and artificial ice systems. They also point to hierarchical dynamics in materials such as ice V that contain rings of multiple sizes.

--------------------------------------------------------------------------------------------------------

The Remarkable Robustness of LLMs: Stages of Inference?

This paper investigates the unexpected resilience of Large Language Models (LLMs) to structural interventions such as layer deletion and swapping. The researchers propose a four-stage inference process common across different LLM architectures. This work provides valuable insights into the inner workings of LLMs and their ability to maintain performance despite significant alterations. Potential applications include developing more efficient model compression techniques, improving model interpretability, and creating more robust AI systems. Understanding these stages of inference could also lead to the design of more adaptable and fault-tolerant language models for various natural language processing tasks.

Authors: Vedang Lad, Wes Gurnee, Max Tegmark

Link: https://arxiv.org/abs/2406.19384v1

Date: 2024-06-27

Summary:

We demonstrate and investigate the remarkable robustness of Large Language Models by deleting and swapping adjacent layers. We find that deleting and swapping interventions retain 72-95\% of the original model's prediction accuracy without fine-tuning, whereas models with more layers exhibit more robustness. Based on the results of the layer-wise intervention and further experiments, we hypothesize the existence of four universal stages of inference across eight different models: detokenization, feature engineering, prediction ensembling, and residual sharpening. The first stage integrates local information, lifting raw token representations into higher-level contextual representations. Next is the iterative refinement of task and entity-specific features. Then, the second half of the model begins with a phase transition, where hidden representations align more with the vocabulary space due to specialized model components. Finally, the last layer sharpens the following token distribution by eliminating obsolete features that add noise to the prediction.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithJuly 1, 2024Comment