Eye On AI

View Original

Week Ending 6.16.2024

RESEARCH WATCH: 6.16.2024

MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

MeshAnything is a novel approach to generating high-quality 3D meshes from various 3D representations. By treating mesh extraction as a generation problem, it produces artist-created meshes aligned with specified shapes, improving efficiency and quality. This could revolutionize 3D asset production for industries like gaming, animation, and visualization.

Authors:  Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, Guosheng Lin, Chi Zhang

Link:  https://arxiv.org/abs/2406.10163v1

Date: 2024-06-14

Summary:

Recently, 3D assets created via reconstruction and generation have matched the quality of manually crafted assets, highlighting their potential for replacement. However, this potential is largely unrealized because these assets always need to be converted to meshes for 3D industry applications, and the meshes produced by current mesh extraction methods are significantly inferior to Artist-Created Meshes (AMs), i.e., meshes created by human artists. Specifically, current mesh extraction methods rely on dense faces and ignore geometric features, leading to inefficiencies, complicated post-processing, and lower representation quality. To address these issues, we introduce MeshAnything, a model that treats mesh extraction as a generation problem, producing AMs aligned with specified shapes. By converting 3D assets in any 3D representation into AMs, MeshAnything can be integrated with various 3D asset production methods, thereby enhancing their application across the 3D industry. The architecture of MeshAnything comprises a VQ-VAE and a shape-conditioned decoder-only transformer. We first learn a mesh vocabulary using the VQ-VAE, then train the shape-conditioned decoder-only transformer on this vocabulary for shape-conditioned autoregressive mesh generation. Our extensive experiments show that our method generates AMs with hundreds of times fewer faces, significantly improving storage, rendering, and simulation efficiencies, while achieving precision comparable to previous methods.

--------------------------------------------------------------------------------------------------------

Biomarker based Cancer Classification using an Ensemble with Pre-trained Models

This paper introduces a novel ensemble model for biomarker-based cancer classification, leveraging pre-trained models. The proposed approach achieves high accuracy and robustness, especially on imbalanced datasets, paving the way for non-invasive and personalized cancer detection and monitoring through liquid biopsies.

Authors:  Chongmin Lee, Jihie Kim

Link:  https://arxiv.org/abs/2406.10087v1

Date: 2024-06-14

Summary:

Certain cancer types, namely pancreatic cancer is difficult to detect at an early stage; sparking the importance of discovering the causal relationship between biomarkers and cancer to identify cancer efficiently. By allowing for the detection and monitoring of specific biomarkers through a non-invasive method, liquid biopsies enhance the precision and efficacy of medical interventions, advocating the move towards personalized healthcare. Several machine learning algorithms such as Random Forest, SVM are utilized for classification, yet causing inefficiency due to the need for conducting hyperparameter tuning. We leverage a meta-trained Hyperfast model for classifying cancer, accomplishing the highest AUC of 0.9929 and simultaneously achieving robustness especially on highly imbalanced datasets compared to other ML algorithms in several binary classification tasks (e.g. breast invasive carcinoma; BRCA vs. non-BRCA). We also propose a novel ensemble model combining pre-trained Hyperfast model, XGBoost, and LightGBM for multi-class classification tasks, achieving an incremental increase in accuracy (0.9464) while merely using 500 PCA features; distinguishable from previous studies where they used more than 2,000 features for similar results.

--------------------------------------------------------------------------------------------------------

Localizing Events in Videos with Multimodal Queries

Localizing Events in Videos with Multimodal Queries presents a new benchmark, ICQ, for evaluating models' ability to localize events in videos using multimodal semantic queries composed of images and text. This could advance video understanding and retrieval, benefiting applications like video search and video foundation models.

Authors:  Gengyuan Zhang, Mang Ling Ada Fok, Yan Xia, Yansong Tang, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

Link:  https://arxiv.org/abs/2406.10079v1

Date: 2024-06-14

Summary:

Video understanding is a pivotal task in the digital era, yet the dynamic and multievent nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images' semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization.

--------------------------------------------------------------------------------------------------------

Group and Shuffle: Efficient Structured Orthogonal Parametrization

Group and Shuffle proposes a new class of structured matrices for efficient orthogonal parametrization, improving parameter and computational efficiency in tasks like fine-tuning pre-trained models. This could enhance the training and deployment of large neural networks across various domains.

Authors:  Mikhail Gorbunov, Nikolay Yudin, Vera Soboleva, Aibek Alanov, Alexey Naumov, Maxim Rakhuba

Link:  https://arxiv.org/abs/2406.10019v1

Date: 2024-06-14

Summary:

The increasing size of neural networks has led to a growing demand for methods of efficient fine-tuning. Recently, an orthogonal fine-tuning paradigm was introduced that uses orthogonal matrices for adapting the weights of a pretrained model. In this paper, we introduce a new class of structured matrices, which unifies and generalizes structured classes from previous works. We examine properties of this class and build a structured orthogonal parametrization upon it. We then use this parametrization to modify the orthogonal fine-tuning framework, improving parameter and computational efficiency. We empirically validate our method on different domains, including adapting of text-to-image diffusion models and downstream task fine-tuning in language modeling. Additionally, we adapt our construction for orthogonal convolutions and conduct experiments with 1-Lipschitz neural networks.

--------------------------------------------------------------------------------------------------------

Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming

Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming assesses state-of-the-art models' performance on standardized tests designed to evaluate computational thinking and problem-solving skills. This could guide the development of more capable generative models for educational applications and human-AI collaboration.

Authors:  Victor-Alexandru Pădurean, Adish Singla

Link:  https://arxiv.org/abs/2406.09891v1

Date: 2024-06-14

Summary:

Generative models have demonstrated human-level proficiency in various benchmarks across domains like programming, natural sciences, and general knowledge. Despite these promising results on competitive benchmarks, they still struggle with seemingly simple problem-solving tasks typically carried out by elementary-level students. How do state-of-the-art models perform on standardized tests designed to assess computational thinking and problem-solving skills at schools? In this paper, we curate a novel benchmark involving computational thinking tests grounded in elementary visual programming domains. Our initial results show that state-of-the-art models like GPT-4o and Llama3 barely match the performance of an average school student. To further boost the performance of these models, we fine-tune them using a novel synthetic data generation methodology. The key idea is to develop a comprehensive dataset using symbolic methods that capture different skill levels, ranging from recognition of visual elements to multi-choice quizzes to synthesis-style tasks. We showcase how various aspects of symbolic information in synthetic data help improve fine-tuned models' performance. We will release the full implementation and datasets to facilitate further research on enhancing computational thinking in generative models.

--------------------------------------------------------------------------------------------------------

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

Talking Heads explores the inter-layer communication mechanisms in transformer language models, revealing an intricate interpretable structure. This understanding could facilitate future analysis of complex behaviors in large language models and improve their performance on specific tasks.

Authors:  Jack Merullo, Carsten Eickhoff, Ellie Pavlick

Link:  https://arxiv.org/abs/2406.09519v1

Date: 2024-06-13

Summary:

Although it is known that transformer language models (LMs) pass features from early layers to later layers, it is not well understood how this information is represented and routed by the model. By analyzing particular mechanism LMs use to accomplish this, we find that it is also used to recall items from a list, and show that this mechanism can explain an otherwise arbitrary-seeming sensitivity of the model to the order of items in the prompt. Specifically, we find that models write into low-rank subspaces of the residual stream to represent features which are then read out by specific later layers, forming low-rank communication channels between layers. By decomposing attention head weight matrices with the Singular Value Decomposition (SVD), we find that previously described interactions between heads separated by one or more layers can be predicted via analysis of their weight matrices. We show that it is possible to manipulate the internal model representations as well as edit model weights based on the mechanism we discover in order to significantly improve performance on our synthetic Laundry List task, which requires recall from a list, often improving task accuracy by over 20%. Our analysis reveals a surprisingly intricate interpretable structure learned from language model pretraining, and helps us understand why sophisticated LMs sometimes fail in simple domains, facilitating future analysis of more complex behaviors.

--------------------------------------------------------------------------------------------------------

Towards Vision-Language Geo-Foundation Model: A Survey

Towards Vision-Language Geo-Foundation Model provides a comprehensive survey of Vision-Language Geo-Foundation Models (VLGFMs), which leverage multimodal geospatial data for diverse geo-perceptive capabilities. This could advance Earth observation and geospatial applications, such as disaster management and environmental monitoring.

Authors:  Yue Zhou, Litong Feng, Yiping Ke, Xue Jiang, Junchi Yan, Xue Yang, Wayne Zhang

Link:  https://arxiv.org/abs/2406.09385v1

Date: 2024-06-13

Summary:

Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at https://github.com/zytx121/Awesome-VLGFM.

--------------------------------------------------------------------------------------------------------

3D Building Generation in Minecraft via Large Language Models

3D Building Generation in Minecraft via Large Language Models demonstrates the potential of large language models for generating 3D buildings in the sandbox game Minecraft. This could pave the way for more sophisticated procedural content generation in gaming and virtual world creation.

Authors:  Shiying Hu, Zengrong Huang, Chengpeng Hu, Jialin Liu

Link:  https://arxiv.org/abs/2406.08751v1

Date: 2024-06-13

Summary:

Recently, procedural content generation has exhibited considerable advancements in the domain of 2D game level generation such as Super Mario Bros. and Sokoban through large language models (LLMs). To further validate the capabilities of LLMs, this paper explores how LLMs contribute to the generation of 3D buildings in a sandbox game, Minecraft. We propose a Text to Building in Minecraft (T2BM) model, which involves refining prompts, decoding interlayer representation and repairing. Facade, indoor scene and functional blocks like doors are supported in the generation. Experiments are conducted to evaluate the completeness and satisfaction of buildings generated via LLMs. It shows that LLMs hold significant potential for 3D building generation. Given appropriate prompts, LLMs can generate correct buildings in Minecraft with complete structures and incorporate specific building blocks such as windows and beds, meeting the specified requirements of human users.

--------------------------------------------------------------------------------------------------------

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

TC-Bench is a new benchmark that evaluates the temporal compositionality of video generation models, assessing their ability to incorporate concept emergence and relation transitions over time. This could drive improvements in video generation models and their applications in media production and content creation.

Authors:  Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, William Yang Wang

Link:  https://arxiv.org/abs/2406.08656v1

Date: 2024-06-12

Summary:

Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.

--------------------------------------------------------------------------------------------------------

HDNet: Physics-Inspired Neural Network for Flow Estimation based on Helmholtz Decomposition

HDNet is a physics-inspired neural network that performs Helmholtz decomposition of flow fields, separating them into divergence-only and curl-only components. This could enhance flow estimation in scientific imaging and fluid dynamics simulations.

Authors:  Miao Qi, Ramzi Idoughi, Wolfgang Heidrich

Link:  https://arxiv.org/abs/2406.08570v1

Date: 2024-06-12

Summary:

Flow estimation problems are ubiquitous in scientific imaging. Often, the underlying flows are subject to physical constraints that can be exploited in the flow estimation; for example, incompressible (divergence-free) flows are expected for many fluid experiments, while irrotational (curl-free) flows arise in the analysis of optical distortions and wavefront sensing. In this work, we propose a Physics- Inspired Neural Network (PINN) named HDNet, which performs a Helmholtz decomposition of an arbitrary flow field, i.e., it decomposes the input flow into a divergence-only and a curl-only component. HDNet can be trained exclusively on synthetic data generated by reverse Helmholtz decomposition, which we call Helmholtz synthesis. As a PINN, HDNet is fully differentiable and can easily be integrated into arbitrary flow estimation problems.

--------------------------------------------------------------------------------------------------------

TasTe: Teaching Large Language Models to Translate through Self-Reflection

TasTe introduces a novel framework for enhancing the machine translation capabilities of large language models (LLMs) through self-reflection. By iteratively generating preliminary translations, self-assessing, and refining them, TasTe enables LLMs to leverage their instruction-following abilities more effectively, yielding higher-quality translations comparable to supervised neural machine translation systems. This approach holds promise for advancing LLM-based translation applications.

Authors:  Yutong Wang, Jiali Zeng, Xuebo Liu, Fandong Meng, Jie Zhou, Min Zhang

Link:  https://arxiv.org/abs/2406.08434v1

Date: 2024-06-12

Summary:

Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks. Techniques like instruction tuning have effectively enhanced the proficiency of LLMs in the downstream task of machine translation. However, the existing approaches fail to yield satisfactory translation outputs that match the quality of supervised neural machine translation (NMT) systems. One plausible explanation for this discrepancy is that the straightforward prompts employed in these methodologies are unable to fully exploit the acquired instruction-following capabilities. To this end, we propose the TasTe framework, which stands for translating through self-reflection. The self-reflection process includes two stages of inference. In the first stage, LLMs are instructed to generate preliminary translations and conduct self-assessments on these translations simultaneously. In the second stage, LLMs are tasked to refine these preliminary translations according to the evaluation results. The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods. Our work presents a promising approach to unleash the potential of LLMs and enhance their capabilities in MT. The codes and datasets are open-sourced at https://github.com/YutongWang1216/ReflectionLLMMT.

--------------------------------------------------------------------------------------------------------

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

MMWorld is a comprehensive benchmark for evaluating the "world model" abilities of Multimodal Language Language Models (MLLMs) in interpreting and reasoning about complex real-world dynamics captured in videos. Encompassing multi-discipline and multi-faceted reasoning tasks, MMWorld provides a challenging testbed for assessing and driving improvements in MLLMs' video understanding capabilities, benefiting applications like autonomous systems and multimedia analysis.

Authors:  Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang

Link:  https://arxiv.org/abs/2406.08407v2

Date: 2024-06-13

Summary:

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

--------------------------------------------------------------------------------------------------------

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

This paper proposes the MCT Self-Refine (MCTSr) algorithm, an innovative combination of Large Language Models (LLMs) and Monte Carlo Tree Search (MCTS), to enhance LLM performance in complex mathematical reasoning tasks like Olympiad-level problems. By leveraging systematic exploration and self-refinement mechanisms, MCTSr improves decision-making accuracy and reliability, advancing LLM applications in domains requiring strategic and mathematical reasoning.

Authors:  Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang

Link:  https://arxiv.org/abs/2406.07394v2

Date: 2024-06-13

Summary:

This paper introduces the MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS), designed to enhance performance in complex mathematical reasoning tasks. Addressing the challenges of accuracy and reliability in LLMs, particularly in strategic and mathematical reasoning, MCTSr leverages systematic exploration and heuristic self-refine mechanisms to improve decision-making frameworks within LLMs. The algorithm constructs a Monte Carlo search tree through iterative processes of Selection, self-refine, self-evaluation, and Backpropagation, utilizing an improved Upper Confidence Bound (UCB) formula to optimize the exploration-exploitation balance. Extensive experiments demonstrate MCTSr's efficacy in solving Olympiad-level mathematical problems, significantly improving success rates across multiple datasets, including GSM8K, GSM Hard, MATH, and Olympiad-level benchmarks, including Math Odyssey, AIME, and OlympiadBench. The study advances the application of LLMs in complex reasoning tasks and sets a foundation for future AI integration, enhancing decision-making accuracy and reliability in LLM-driven applications.

--------------------------------------------------------------------------------------------------------

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

MLLMGuard is a multi-dimensional safety evaluation suite for Multimodal Large Language Models (MLLMs), addressing the critical need for comprehensive and rigorous assessment of safety risks. With a bilingual image-text evaluation dataset, utilities, and a lightweight evaluator, MLLMGuard covers key safety dimensions like privacy, bias, and toxicity, enabling more responsible deployment of MLLMs across practical applications.

Authors:  Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang

Link:  https://arxiv.org/abs/2406.07594v2

Date: 2024-06-13

Summary:

Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible.

--------------------------------------------------------------------------------------------------------

Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

MultiTrust is the first comprehensive and unified benchmark for evaluating the trustworthiness of Multimodal Large Language Models (MLLMs) across aspects like truthfulness, safety, robustness, fairness, and privacy. By revealing unexplored trustworthiness issues and risks, and providing a scalable toolbox, MultiTrust facilitates research into enhancing the reliability of MLLMs for high-stakes applications requiring trustworthy AI systems.

Authors:  Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, Hang Su, Yinpeng Dong, Jun Zhu

Link:  https://arxiv.org/abs/2406.07057v1

Date: 2024-06-11

Summary:

Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity for advanced methodologies to enhance their reliability. For instance, typical proprietary models still struggle with the perception of visually confusing images and are vulnerable to multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to disclose privacy in text and reveal ideological and cultural biases even when paired with irrelevant images in inference, indicating that the multimodality amplifies the internal risks from base LLMs. Additionally, we release a scalable toolbox for standardized trustworthiness research, aiming to facilitate future advancements in this important field. Code and resources are publicly available at: https://multi-trust.github.io/.

--------------------------------------------------------------------------------------------------------

Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

This paper introduces a novel method for appearance transfer in diffusion models, enabling the generation of images with the structure of a target image but painted with colors from a reference image, following semantic correspondences. This technique could benefit various applications in image editing, artistic creation, and visual content generation.

Authors:  Sooyeon Go, Kyungmook Choi, Minjung Shin, Youngjung Uh

Link:  https://arxiv.org/abs/2406.07008v1

Date: 2024-06-11

Summary:

As pretrained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. In this paper, we introduce a method to produce results with the same structure of a target image but painted with colors from a reference image, i.e., appearance transfer, especially following the semantic correspondence between the result and the reference. E.g., the result wing takes color from the reference wing, not the reference head. Existing methods rely on the query-key similarity within self-attention layer, usually producing defective results. To this end, we propose to find semantic correspondences and explicitly rearrange the features according to the semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the color from the reference according to the semantic correspondences, even when the two images are not aligned.

--------------------------------------------------------------------------------------------------------

Evolving Subnetwork Training for Large Language Models

Evolving Subnetwork Training (EST) is a novel training paradigm for large language models that samples and gradually increases the size of subnetworks during training, leading to substantial computational savings without performance degradation. This approach addresses the high training costs of large language models, facilitating their further development and wider adoption across applications.

Authors:  Hanqi Li, Lu Chen, Da Ma, Zijian Wu, Su Zhu, Kai Yu

Link:  https://arxiv.org/abs/2406.06962v1

Date: 2024-06-11

Summary:

Large language models have ushered in a new era of artificial intelligence research. However, their substantial training costs hinder further development and widespread adoption. In this paper, inspired by the redundancy in the parameters of large language models, we propose a novel training paradigm: Evolving Subnetwork Training (EST). EST samples subnetworks from the layers of the large language model and from commonly used modules within each layer, Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP). By gradually increasing the size of the subnetworks during the training process, EST can save the cost of training. We apply EST to train GPT2 model and TinyLlama model, resulting in 26.7\% FLOPs saving for GPT2 and 25.0\% for TinyLlama without an increase in loss on the pre-training dataset. Moreover, EST leads to performance improvements in downstream tasks, indicating that it benefits generalization. Additionally, we provide intuitive theoretical studies based on training dynamics and Dropout theory to ensure the feasibility of EST. Our code is available at https://github.com/OpenDFM/EST.

--------------------------------------------------------------------------------------------------------

Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context

This paper proposes a framework grounded in behavioral economics to evaluate the decision-making behaviors of large language models (LLMs) under uncertainty, assessing risk preference, probability weighting, and loss aversion. By uncovering potential biases and misalignments with human norms, this work advocates for developing guidelines to ensure ethical deployment of LLMs in decision-making scenarios.

Authors:  Jingru Jia, Zehua Yuan, Junhao Pan, Paul McNamara, Deming Chen

Link:  https://arxiv.org/abs/2406.05972v1

Date: 2024-06-10

Summary:

When making decisions under uncertainty, individuals often deviate from rational behavior, which can be evaluated across three dimensions: risk preference, probability weighting, and loss aversion. Given the widespread use of large language models (LLMs) in decision-making processes, it is crucial to assess whether their behavior aligns with human norms and ethical expectations or exhibits potential biases. Several empirical studies have investigated the rationality and social behavior performance of LLMs, yet their internal decision-making tendencies and capabilities remain inadequately understood. This paper proposes a framework, grounded in behavioral economics, to evaluate the decision-making behaviors of LLMs. Through a multiple-choice-list experiment, we estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: ChatGPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities. However, there are significant variations in the degree to which these behaviors are expressed across different LLMs. We also explore their behavior when embedded with socio-demographic features, uncovering significant disparities. For instance, when modeled with attributes of sexual minority groups or physical disabilities, Claude-3-Opus displays increased risk aversion, leading to more conservative choices. These findings underscore the need for careful consideration of the ethical implications and potential biases in deploying LLMs in decision-making scenarios. Therefore, this study advocates for developing standards and guidelines to ensure that LLMs operate within ethical boundaries while enhancing their utility in complex decision-making environments.

--------------------------------------------------------------------------------------------------------

Automated Molecular Concept Generation and Labeling with Large Language Models

AutoMolCo is a novel framework that leverages Large Language Models (LLMs) to automatically generate and label predictive molecular concepts, iteratively refining them through LLM interactions. This automated approach surpasses the limitations of traditional concept-based models while maintaining explainability, promising to drive new discoveries in molecular science research by offering insights into predictions.

Authors:  Shichang Zhang, Botao Xia, Zimin Zhang, Qianli Wu, Fang Sun, Ziniu Hu, Yizhou Sun

Link:  https://arxiv.org/abs/2406.09612v1

Date: 2024-06-13

Summary:

Artificial intelligence (AI) is significantly transforming scientific research. Explainable AI methods, such as concept-based models (CMs), are promising for driving new scientific discoveries because they make predictions based on meaningful concepts and offer insights into the prediction process. In molecular science, however, explainable CMs are not as common compared to black-box models like Graph Neural Networks (GNNs), primarily due to their requirement for predefined concepts and manual label for each instance, which demand domain knowledge and can be labor-intensive. This paper introduces a novel framework for Automated Molecular Concept (AutoMolCo) generation and labeling. AutoMolCo leverages the knowledge in Large Language Models (LLMs) to automatically generate predictive molecular concepts and label them for each molecule. Such procedures are repeated through iterative interactions with LLMs to refine concepts, enabling simple linear models on the refined concepts to outperform GNNs and LLM in-context learning on several benchmarks. The whole AutoMolCo framework is automated without any human knowledge inputs in either concept generation, labeling, or refinement, thereby surpassing the limitations of extant CMs while maintaining their explainability and allowing easy intervention. Through systematic experiments on MoleculeNet and High-Throughput Experimentation (HTE) datasets, we demonstrate that the AutoMolCo-induced explainable CMs are beneficial and promising for molecular science research.

--------------------------------------------------------------------------------------------------------

JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models

JailbreakEval is a user-friendly toolkit for evaluating jailbreak attempts against Large Language Models (LLMs), addressing the lack of consensus on assessing the harmfulness of LLM responses to forbidden instructions. By providing a systematic taxonomy of evaluators and enabling customization, JailbreakEval simplifies the evaluation process and fosters inclusive standards for jailbreak research, benefiting the responsible development of LLMs.

Authors:  Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, Anyu Wang

Link:  https://arxiv.org/abs/2406.09321v1

Date: 2024-06-13

Summary:

Jailbreak attacks aim to induce Large Language Models (LLMs) to generate harmful responses for forbidden instructions, presenting severe misuse threats to LLMs. Up to now, research into jailbreak attacks and defenses is emerging, however, there is (surprisingly) no consensus on how to evaluate whether a jailbreak attempt is successful. In other words, the methods to assess the harmfulness of an LLM's response are varied, such as manual annotation or prompting GPT-4 in specific ways. Each approach has its own set of strengths and weaknesses, impacting their alignment with human values, as well as the time and financial cost. This diversity in evaluation presents challenges for researchers in choosing suitable evaluation methods and conducting fair comparisons across different jailbreak attacks and defenses. In this paper, we conduct a comprehensive analysis of jailbreak evaluation methodologies, drawing from nearly ninety jailbreak research released between May 2023 and April 2024. Our study introduces a systematic taxonomy of jailbreak evaluators, offering in-depth insights into their strengths and weaknesses, along with the current status of their adaptation. Moreover, to facilitate subsequent research, we propose JailbreakEval, a user-friendly toolkit focusing on the evaluation of jailbreak attempts. It includes various well-known evaluators out-of-the-box, so that users can obtain evaluation results with only a single command. JailbreakEval also allows users to customize their own evaluation workflow in a unified framework with the ease of development and comparison. In summary, we regard JailbreakEval to be a catalyst that simplifies the evaluation process in jailbreak research and fosters an inclusive standard for jailbreak evaluation within the community.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.