Week Ending 7.7.2024

RESEARCH WATCH: 7.7.2024

Fine-Grained Multi-View Hand Reconstruction Using Inverse Rendering

This paper addresses the challenge of creating high-fidelity 3D hand models with detailed textures, which is crucial for advancing human-object interaction in virtual and augmented reality applications. The authors propose a novel method combining Graph Convolutional Networks and inverse rendering to reconstruct hand poses and intricate details from multi-view images. Their approach introduces a Hand Albedo and Mesh optimization module and a mesh-based neural rendering scheme to generate photo-realistic images while optimizing mesh geometry. This research could significantly improve hand tracking and gesture recognition in VR/AR environments, enhancing user experiences in gaming, remote collaboration, and medical training simulations.

Authors: Qijun Gan, Wentong Li, Jinwei Ren, Jianke Zhu

Link: https://arxiv.org/abs/2407.05680v1

Date: 2024-07-08

Summary:

Reconstructing high-fidelity hand models with intricate textures plays a crucial role in enhancing human-object interaction and advancing real-world applications. Despite the state-of-the-art methods excelling in texture generation and image rendering, they often face challenges in accurately capturing geometric details. Learning-based approaches usually offer better robustness and faster inference, which tend to produce smoother results and require substantial amounts of training data. To address these issues, we present a novel fine-grained multi-view hand mesh reconstruction method that leverages inverse rendering to restore hand poses and intricate details. Firstly, our approach predicts a parametric hand mesh model through Graph Convolutional Networks (GCN) based method from multi-view images. We further introduce a novel Hand Albedo and Mesh (HAM) optimization module to refine both the hand mesh and textures, which is capable of preserving the mesh topology. In addition, we suggest an effective mesh-based neural rendering scheme to simultaneously generate photo-realistic image and optimize mesh geometry by fusing the pre-trained rendering network with vertex features. We conduct the comprehensive experiments on InterHand2.6M, DeepHandMesh and dataset collected by ourself, whose promising results show that our proposed approach outperforms the state-of-the-art methods on both reconstruction accuracy and rendering quality. Code and dataset are publicly available at https://github.com/agnJason/FMHR.

--------------------------------------------------------------------------------------------------------

See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition

As large language models continue to grow in size and capability, fine-tuning them for specific tasks becomes increasingly expensive and impractical. This paper tackles the challenge of parameter-efficient fine-tuning (PEFT) by providing a comprehensive mathematical analysis of existing methods from a decomposition perspective. The authors introduce two novel PEFT methods and a framework to enhance performance across various applications. This research could lead to more efficient and cost-effective ways of adapting large language models to specific domains or tasks, potentially democratizing access to state-of-the-art AI technologies for researchers and developers with limited computational resources.

Authors: Chongjie Si, Xiaokang Yang, Wei Shen

Link: https://arxiv.org/abs/2407.05417v1

Date: 2024-07-07

Summary:

The rapid expansion of large foundation models within the pre-training and fine-tuning framework has underscored that larger models often yield better results. However, the scaling up of large foundation models has led to soaring costs in fine-tuning and parameter storage, rendering extensive adaptations impractical. This challenge has sparked the development of parameter-efficient fine-tuning (PEFT), which focuses on optimizing a select subset of parameters while keeping the rest fixed, significantly lowering computational and storage overheads. While recent years have witnessed a significant success in PEFT, a deep understanding of the fundamental principles behind these methods remains unexplored. To this end, here we take the first step to unify all approaches by dissecting them from a decomposition perspective. We initiate a comprehensive mathematical analysis of these methods, allowing us to delve deeply into their underlying mechanisms, and we explore the reasons behind the variations in performance among different techniques. Furthermore, inspired by our theoretical analysis, we introduce two novel PEFT methods alongside a simple yet effective framework designed to enhance the performance of PEFT techniques across various applications. Our empirical validations, conducted across multiple datasets, demonstrate the efficacy of these methods, showcasing both theoretical validity and practical performance improvements under the guidance of our analytical findings. We believe our work will deepen researchers' understanding of PEFT and other techniques, prompting further contemplation and advancing the research across the whole community.

--------------------------------------------------------------------------------------------------------

Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns?

This paper explores the capabilities of Large Language Models (LLMs) in solving complex mathematical problems with multiple unknowns. The authors introduce a new benchmark, BeyondX, designed to challenge LLMs with progressively more complex problems. They also propose a "Formulate-and-Solve" strategy to handle problems with an arbitrary number of unknowns. This research could have significant implications for advancing AI's mathematical reasoning abilities, potentially leading to more powerful tools for scientific computing, engineering simulations, and automated problem-solving in fields such as physics, economics, and operations research.

Authors: Kuei-Chun Kao, Ruochen Wang, Cho-Jui Hsieh

Link: https://arxiv.org/abs/2407.05134v1

Date: 2024-07-06

Summary:

Large Language Models (LLMs) have demonstrated remarkable performance in solving math problems, a hallmark of human intelligence. Despite high success rates on current benchmarks; however, these often feature simple problems with only one or two unknowns, which do not sufficiently challenge their reasoning capacities. This paper introduces a novel benchmark, BeyondX, designed to address these limitations by incorporating problems with multiple unknowns. Recognizing the challenges in proposing multi-unknown problems from scratch, we developed BeyondX using an innovative automated pipeline that progressively increases complexity by expanding the number of unknowns in simpler problems. Empirical study on BeyondX reveals that the performance of existing LLMs, even those fine-tuned specifically on math tasks, significantly decreases as the number of unknowns increases - with a performance drop of up to 70\% observed in GPT-4. To tackle these challenges, we propose the Formulate-and-Solve strategy, a generalized prompting approach that effectively handles problems with an arbitrary number of unknowns. Our findings reveal that this strategy not only enhances LLM performance on the BeyondX benchmark but also provides deeper insights into the computational limits of LLMs when faced with more complex mathematical challenges.

--------------------------------------------------------------------------------------------------------

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

This paper presents a novel approach to enhancing multimodal large language models (MLLMs) by integrating fine-grained external knowledge from specialized vision models. The authors propose embedding this knowledge directly into a spatial embedding map as a visual prompt, improving the model's ability to understand detailed visual elements. This research could lead to more accurate and context-aware visual question answering systems, potentially improving applications in areas such as medical image analysis, satellite imagery interpretation, and automated visual inspection in manufacturing.

Authors: Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip Torr, Lu Yuan

Link: https://arxiv.org/abs/2407.04681v1

Date: 2024-07-05

Summary:

In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs, limiting their ability to answer questions requiring an understanding of detailed or localized visual elements. Drawing inspiration from the Retrieval-Augmented Generation (RAG) concept, this paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models (e.g., instance segmentation/OCR models), into MLLMs. This is a promising yet underexplored direction for enhancing MLLMs' performance. Our approach diverges from concurrent works, which transform external knowledge into additional text prompts, necessitating the model to indirectly learn the correspondence between visual content and text coordinates. Instead, we propose embedding fine-grained knowledge information directly into a spatial embedding map as a visual prompt. This design can be effortlessly incorporated into various MLLMs, such as LLaVA and Mipha, considerably improving their visual understanding performance. Through rigorous experiments, we demonstrate that our method can enhance MLLM performance across nine benchmarks, amplifying their fine-grained context-aware capabilities.

--------------------------------------------------------------------------------------------------------

How Reliable and Stable are Explanations of XAI Methods?

This study investigates the reliability and stability of various Explainable Artificial Intelligence (XAI) methods when faced with data perturbations. The authors create a pipeline to test different XAI methods using the diabetes dataset and four machine learning models. Their findings reveal that most XAI methods are sensitive to perturbations, highlighting the need for caution when interpreting explanations generated by these methods. This research is crucial for developing more robust and trustworthy AI systems, particularly in high-stakes applications such as healthcare diagnostics, financial decision-making, and autonomous vehicle control.

Authors: José Ribeiro, Lucas Cardoso, Vitor Santos, Eduardo Carvalho, Níkolas Carneiro, Ronnie Alves

Link: https://arxiv.org/abs/2407.03108v1

Date: 2024-07-03

Summary:

Black box models are increasingly being used in the daily lives of human beings living in society. Along with this increase, there has been the emergence of Explainable Artificial Intelligence (XAI) methods aimed at generating additional explanations regarding how the model makes certain predictions. In this sense, methods such as Dalex, Eli5, eXirt, Lofo and Shap emerged as different proposals and methodologies for generating explanations of black box models in an agnostic way. Along with the emergence of these methods, questions arise such as "How Reliable and Stable are XAI Methods?". With the aim of shedding light on this main question, this research creates a pipeline that performs experiments using the diabetes dataset and four different machine learning models (LGBM, MLP, DT and KNN), creating different levels of perturbations of the test data and finally generates explanations from the eXirt method regarding the confidence of the models and also feature relevances ranks from all XAI methods mentioned, in order to measure their stability in the face of perturbations. As a result, it was found that eXirt was able to identify the most reliable models among all those used. It was also found that current XAI methods are sensitive to perturbations, with the exception of one specific method.

--------------------------------------------------------------------------------------------------------

Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text

This paper addresses the growing concern of AI-generated text misuse by proposing a RoBERTa-BiLSTM based classifier to distinguish between AI-generated and human-written text. The authors' model achieved an accuracy of 80.83% in the SemEval 2024 competition. This research contributes to the development of automated systems for identifying machine-generated text, which could have significant applications in journalism, academia, and online content moderation. Such tools could help maintain the integrity of written content and combat the spread of AI-generated misinformation.

Authors: Jainit Sushil Bafna, Hardik Mittal, Suyash Sethia, Manish Shrivastava, Radhika Mamidi

Link: https://arxiv.org/abs/2407.02978v1

Date: 2024-07-03

Summary:

Large Language Models (LLMs) have showcased impressive abilities in generating fluent responses to diverse user queries. However, concerns regarding the potential misuse of such texts in journalism, educational, and academic contexts have surfaced. SemEval 2024 introduces the task of Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection, aiming to develop automated systems for identifying machine-generated text and detecting potential misuse. In this paper, we i) propose a RoBERTa-BiLSTM based classifier designed to classify text into two categories: AI-generated or human ii) conduct a comparative study of our model with baseline approaches to evaluate its effectiveness. This paper contributes to the advancement of automatic text detection systems in addressing the challenges posed by machine-generated text misuse. Our architecture ranked 46th on the official leaderboard with an accuracy of 80.83 among 125.

--------------------------------------------------------------------------------------------------------

Zero-X: A Blockchain-Enabled Open-Set Federated Learning Framework for Zero-Day Attack Detection in IoV

This paper introduces Zero-X, an innovative security framework for detecting both zero-day and known attacks in the Internet of Vehicles (IoV). The authors combine deep neural networks with Open-Set Recognition and use blockchain technology to facilitate trusted and decentralized federated learning. This research addresses the critical need for robust cybersecurity measures in connected vehicle ecosystems, potentially enhancing the safety and reliability of future intelligent transportation systems. The framework's privacy-preserving approach could also promote collaboration between different stakeholders in the automotive and security industries.

Authors: Abdelaziz Amara korba, Abdelwahab Boualouache, Yacine Ghamri-Doudane

Link: https://arxiv.org/abs/2407.02969v1

Date: 2024-07-03

Summary:

The Internet of Vehicles (IoV) is a crucial technology for Intelligent Transportation Systems (ITS) that integrates vehicles with the Internet and other entities. The emergence of 5G and the forthcoming 6G networks presents an enormous potential to transform the IoV by enabling ultra-reliable, low-latency, and high-bandwidth communications. Nevertheless, as connectivity expands, cybersecurity threats have become a significant concern. The issue has been further exacerbated by the rising number of zero-day (0-day) attacks, which can exploit unknown vulnerabilities and bypass existing Intrusion Detection Systems (IDSs). In this paper, we propose Zero-X, an innovative security framework that effectively detects both 0-day and N-day attacks. The framework achieves this by combining deep neural networks with Open-Set Recognition (OSR). Our approach introduces a novel scheme that uses blockchain technology to facilitate trusted and decentralized federated learning (FL) of the ZeroX framework. This scheme also prioritizes privacy preservation, enabling both CAVs and Security Operation Centers (SOCs) to contribute their unique knowledge while protecting the privacy of their sensitive data. To the best of our knowledge, this is the first work to leverage OSR in combination with privacy-preserving FL to identify both 0-day and N-day attacks in the realm of IoV. The in-depth experiments on two recent network traffic datasets show that the proposed framework achieved a high detection rate while minimizing the false positive rate. Comparison with related work showed that the Zero-X framework outperforms existing solutions.

--------------------------------------------------------------------------------------------------------

Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

This paper presents Meta 3D AssetGen, an advanced text-to-3D generation system that produces high-quality meshes with texture and material control. The system outputs physically-based rendering (PBR) materials, allowing for realistic relighting of generated 3D objects. This technology could revolutionize content creation for video games, virtual reality experiences, and 3D modeling in industries such as architecture and product design. By enabling rapid generation of detailed 3D assets from text descriptions, AssetGen could significantly reduce production time and costs in these fields.

Authors: Yawar Siddiqui, Tom Monnier, Filippos Kokkinos, Mahendra Kariya, Yanir Kleiman, Emilien Garreau, Oran Gafni, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, David Novotny

Link: https://arxiv.org/abs/2407.02445v1

Date: 2024-07-02

Summary:

We present Meta 3D AssetGen (AssetGen), a significant advancement in text-to-3D generation which produces faithful, high-quality meshes with texture and material control. Compared to works that bake shading in the 3D object's appearance, AssetGen outputs physically-based rendering (PBR) materials, supporting realistic relighting. AssetGen generates first several views of the object with factored shaded and albedo appearance channels, and then reconstructs colours, metalness and roughness in 3D, using a deferred shading loss for efficient supervision. It also uses a sign-distance function to represent 3D shape more reliably and introduces a corresponding loss for direct shape supervision. This is implemented using fused kernels for high memory efficiency. After mesh extraction, a texture refinement transformer operating in UV space significantly improves sharpness and details. AssetGen achieves 17% improvement in Chamfer Distance and 40% in LPIPS over the best concurrent work for few-view reconstruction, and a human preference of 72% over the best industry competitors of comparable speed, including those that support PBR. Project page with generated assets: https://assetgen.github.io

--------------------------------------------------------------------------------------------------------

SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

This paper introduces SoP, a framework for designing jailbreak prompts to bypass the safety alignments of large language models (LLMs). Inspired by the concept of social facilitation, SoP generates and optimizes multiple jailbreak characters to overcome LLM guardrails. While this research raises ethical concerns, it contributes to understanding LLM vulnerabilities and could lead to the development of more robust safety measures. The findings could inform the design of future LLMs and help create more effective defense strategies against malicious attacks on AI systems.

Authors: Yan Yang, Zeguan Xiao, Xin Lu, Hongru Wang, Hailiang Huang, Guanhua Chen, Yun Chen

Link: https://arxiv.org/abs/2407.01902v1

Date: 2024-07-02

Summary:

The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. Inspired by the social facilitation concept, SoP generates and optimizes multiple jailbreak characters to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SoP can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SoP. Code is available at https://github.com/Yang-Yan-Yang-Yan/SoP.

--------------------------------------------------------------------------------------------------------

Multi-Object Hallucination in Vision-Language Models

This study investigates the phenomenon of multi-object hallucination in large vision language models (LVLMs), where models invent non-existent objects or become distracted when focusing on multiple objects simultaneously. The authors introduce a new evaluation protocol called ROPE and analyze factors contributing to hallucination. This research is crucial for improving the reliability of LVLMs in real-world applications such as visual assistive technologies, autonomous navigation systems, and image-based search engines, where accurate multi-object recognition is essential.

Authors: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai

Link: https://arxiv.org/abs/2407.06192v1

Date: 2024-07-08

Summary:

Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1) LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. (2) The tested object class distribution affects hallucination behaviors, indicating that LVLMs may follow shortcuts and spurious correlations.(3) Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors. We hope to enable LVLMs to recognize and reason about multiple objects that often occur in realistic visual scenes, provide insights, and quantify our progress towards mitigating the issues.

--------------------------------------------------------------------------------------------------------

Vision-Language Models under Cultural and Inclusive Considerations

This paper addresses the need for more inclusive and culturally diverse vision-language models (VLMs) to assist visually impaired individuals. The authors create a culture-centric evaluation benchmark by filtering the VizWiz dataset and evaluating several VLMs in culturally diverse settings. While showing promise, the study identifies challenges such as hallucination and misalignment between automatic metrics and human judgment. This research could lead to more accurate and culturally sensitive image description technologies, improving accessibility for visually impaired users from diverse backgrounds and enhancing cross-cultural communication in visual assistance applications.

Authors: Antonia Karamolegkou, Phillip Rust, Yong Cao, Ruixiang Cui, Anders Søgaard, Daniel Hershcovich

Link: https://arxiv.org/abs/2407.06177v1

Date: 2024-07-08

Summary:

Large vision-language models (VLMs) can assist visually impaired people by describing images from their daily lives. Current evaluation datasets may not reflect diverse cultural user backgrounds or the situational context of this use case. To address this problem, we create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind. We then evaluate several VLMs, investigating their reliability as visual assistants in a culturally diverse setting. While our results for state-of-the-art models are promising, we identify challenges such as hallucination and misalignment of automatic evaluation metrics with human judgment. We make our survey, data, code, and model outputs publicly available.

--------------------------------------------------------------------------------------------------------

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

This study proposes a novel approach to classifying traffic accidents using scene graphs integrated with vision-language models. By representing traffic scenes as graphs with vehicles as nodes and their relationships as edges, the authors develop a multi-stage, multimodal pipeline for accident classification. The method achieves improved accuracy on the Detection of Traffic Anomaly (DoTA) benchmark. This research could significantly enhance autonomous driving and road monitoring systems, leading to better accident prevention, faster emergency response, and improved traffic management by enabling more accurate and context-aware accident classification.

Authors: Aaron Lohner, Francesco Compagno, Jonathan Francis, Alessandro Oltramari

Link: https://arxiv.org/abs/2407.05910v1

Date: 2024-07-08

Summary:

Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from reoccurring. The task of being able to classify a traffic scene as a specific type of accident is the focus of this work. We approach the problem by likening a traffic scene to a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of an accident can be referred to as a scene graph, and is used as input for an accident classifier. Better results can be obtained with a classifier that fuses the scene graph input with representations from vision and language. This work introduces a multi-stage, multimodal pipeline to pre-process videos of traffic accidents, encode them as scene graphs, and align this representation with vision and language modalities for accident classification. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.

--------------------------------------------------------------------------------------------------------

Knowledge Composition using Task Vectors with Learned Anisotropic Scaling

This paper introduces aTLAS, an algorithm for combining parameter blocks in pre-trained models using learned anisotropic scaling. The method enhances knowledge composition and transfer, improving task arithmetic, few-shot recognition, and test-time adaptation. The authors demonstrate that their approach allows for more disentangled task vectors, better generalization with limited data, and reduced memory footprint. This research could lead to more efficient and flexible transfer learning techniques, potentially revolutionizing how AI models are adapted for new tasks and domains, especially in resource-constrained environments.

Authors: Frederic Z. Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, Ehsan Abbasnejad

Link: https://arxiv.org/abs/2407.02880v1

Date: 2024-07-03

Summary:

Pre-trained models produce strong generic representations that can be adapted via fine-tuning. The learned weight difference relative to the pre-trained model, known as a task vector, characterises the direction and stride of fine-tuning. The significance of task vectors is such that simple arithmetic operations on them can be used to combine diverse representations from different domains. This paper builds on these properties of task vectors and aims to answer (1) whether components of task vectors, particularly parameter blocks, exhibit similar characteristics, and (2) how such blocks can be used to enhance knowledge composition and transfer. To this end, we introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsic dimensionality of pre-trained models, with only a few coefficients being the learnable parameters. Furthermore, composition of parameter blocks leverages the already learned representations, thereby reducing the dependency on large amounts of data. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives. In particular, we show that (1) learned anisotropic scaling allows task vectors to be more disentangled, causing less interference in composition; (2) task vector composition excels with scarce or no labeled data and is less prone to domain shift, thus leading to better generalisability; (3) mixing the most informative parameter blocks across different task vectors prior to training can reduce the memory footprint and improve the flexibility of knowledge transfer. Moreover, we show the potential of aTLAS as a PEFT method, particularly with less data, and demonstrate that its scalibility.

--------------------------------------------------------------------------------------------------------

Without Pain -- Clustering Categorical Data Using a Bayesian Mixture of Finite Mixtures of Latent Class Analysis Models

This paper presents a Bayesian approach for clustering multivariate categorical data using a two-layer mixture of finite mixtures model. The method allows for unknown cluster numbers and within-cluster variable associations. The authors demonstrate its effectiveness using artificial data and a low back pain dataset. This research could have significant applications in various fields dealing with categorical data, such as medical diagnosis, market segmentation, and social science research, by providing more accurate and interpretable clustering results for complex categorical datasets.

Authors: Gertraud Malsiner-Walli, Bettina Grün, Sylvia Frühwirth-Schnatter

Link: https://arxiv.org/abs/2407.05431v1

Date: 2024-07-07

Summary:

We propose a Bayesian approach for model-based clustering of multivariate categorical data where variables are allowed to be associated within clusters and the number of clusters is unknown. The approach uses a two-layer mixture of finite mixtures model where the cluster distributions are approximated using latent class analysis models. A careful specification of priors with suitable hyperparameter values is crucial to identify the two-layer structure and obtain a parsimonious cluster solution. We outline the Bayesian estimation based on Markov chain Monte Carlo sampling with the telescoping sampler and describe how to obtain an identified clustering model by resolving the label switching issue. Empirical demonstrations in a simulation study using artificial data as well as a data set on low back pain indicate the good clustering performance of the proposed approach, provided hyperparameters are selected which induce sufficient shrinkage.

--------------------------------------------------------------------------------------------------------

InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

This paper introduces InverseCoder, a novel approach to improving instruction-tuned code large language models (LLMs) by generating data from the model itself rather than relying on closed-source LLMs. The authors propose INVERSE-INSTRUCT, which summarizes instructions from code snippets and uses them for further fine-tuning. This method shows improved performance across various coding tasks. The research could lead to more powerful and accessible code generation tools, potentially revolutionizing software development practices and making advanced coding capabilities more widely available to developers of all skill levels.

Authors: Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen

Link: https://arxiv.org/abs/2407.05700v1

Date: 2024-07-08

Summary:

Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.

--------------------------------------------------------------------------------------------------------

Harnessing the Power of LLMs: Automating Unit Test Generation for High-Performance Computing

This study explores the use of large language models (LLMs) for generating unit tests in high-performance computing software, addressing the challenges of complex logic and parallel processing. The authors demonstrate that LLMs can produce mostly correct and comprehensive unit tests for C++ parallel programs, despite some limitations. This research could significantly improve software quality assurance in scientific and high-performance computing applications, reducing the time and expertise required for creating effective unit tests and potentially leading to more robust and reliable software in these critical domains.

Authors: Rabimba Karanjai, Aftab Hussain, Md Rafiqul Islam Rabin, Lei Xu, Weidong Shi, Mohammad Amin Alipour

Link: https://arxiv.org/abs/2407.05202v1

Date: 2024-07-06

Summary:

Unit testing is crucial in software engineering for ensuring quality. However, it's not widely used in parallel and high-performance computing software, particularly scientific applications, due to their smaller, diverse user base and complex logic. These factors make unit testing challenging and expensive, as it requires specialized knowledge and existing automated tools are often ineffective. To address this, we propose an automated method for generating unit tests for such software, considering their unique features like complex logic and parallel processing. Recently, large language models (LLMs) have shown promise in coding and testing. We explored the capabilities of Davinci (text-davinci-002) and ChatGPT (gpt-3.5-turbo) in creating unit tests for C++ parallel programs. Our results show that LLMs can generate mostly correct and comprehensive unit tests, although they have some limitations, such as repetitive assertions and blank test cases.

--------------------------------------------------------------------------------------------------------

LaRa: Efficient Large-Baseline Radiance Fields

This paper presents LaRa, a novel method for efficient large-baseline radiance field reconstruction. The approach combines local and global reasoning in transformer layers, representing scenes as Gaussian Volumes with Group Attention Layers. The method shows high fidelity in reconstructing 360-degree radiance fields and robustness to zero-shot and out-of-domain testing. This research could significantly advance 3D scene reconstruction and novel view synthesis, with potential applications in virtual and augmented reality, computer graphics, and robotics, enabling more realistic and efficient 3D environment modeling from diverse viewpoints.

Authors: Anpei Chen, Haofei Xu, Stefano Esposito, Siyu Tang, Andreas Geiger

Link: https://arxiv.org/abs/2407.04699v1

Date: 2024-07-05

Summary:

Radiance field methods have achieved photorealistic novel view synthesis and geometry reconstruction. But they are mostly applied in per-scene optimization or small-baseline settings. While several recent works investigate feed-forward reconstruction with large baselines by utilizing transformers, they all operate with a standard global attention mechanism and hence ignore the local nature of 3D reconstruction. We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. Our model represents scenes as Gaussian Volumes and combines this with an image encoder and Group Attention Layers for efficient feed-forward reconstruction. Experimental results demonstrate that our model, trained for two days on four GPUs, demonstrates high fidelity in reconstructing 360&deg radiance fields, and robustness to zero-shot and out-of-domain testing.

--------------------------------------------------------------------------------------------------------

Waterfall: Framework for Robust and Scalable Text Watermarking

This paper introduces Waterfall, a training-free framework for robust and scalable text watermarking applicable across multiple text types and languages. The method uses LLMs as paraphrasers for watermarking and achieves better scalability, verifiability, and efficiency compared to existing methods. This research could have significant implications for protecting intellectual property in text and code, potentially revolutionizing how digital content is secured against unauthorized use or modification, particularly in the face of increasingly sophisticated AI-based attacks.

Authors: Gregory Kang Ruey Lau, Xinyuan Niu, Hieu Dao, Jiangwei Chen, Chuan-Sheng Foo, Bryan Kian Hsiang Low

Link: https://arxiv.org/abs/2407.04411v1

Date: 2024-07-05

Summary:

Protecting intellectual property (IP) of text such as articles and code is increasingly important, especially as sophisticated attacks become possible, such as paraphrasing by large language models (LLMs) or even unauthorized training of LLMs on copyrighted text to infringe such IP. However, existing text watermarking methods are not robust enough against such attacks nor scalable to millions of users for practical implementation. In this paper, we propose Waterfall, the first training-free framework for robust and scalable text watermarking applicable across multiple text types (e.g., articles, code) and languages supportable by LLMs, for general text and LLM data provenance. Waterfall comprises several key innovations, such as being the first to use LLM as paraphrasers for watermarking along with a novel combination of techniques that are surprisingly effective in achieving robust verifiability and scalability. We empirically demonstrate that Waterfall achieves significantly better scalability, robust verifiability, and computational efficiency compared to SOTA article-text watermarking methods, and also showed how it could be directly applied to the watermarking of code.

--------------------------------------------------------------------------------------------------------

Improving LLM Abilities in Idiomatic Translation

This study focuses on enhancing large language models' ability to translate idiomatic expressions while preserving cultural nuances and linguistic style. The authors propose two methods: one using cosine similarity scores between idiom meanings, and another using LLMs to find corresponding idioms in the target language. The research also includes the development of a low-resource Urdu idiom dataset. This work could significantly improve machine translation quality, particularly for idiomatic expressions, leading to better cross-cultural communication and more accurate preservation of cultural context in translated texts.

Authors: Sundesh Donthi, Maximilian Spencer, Om Patel, Joon Doh, Eid Rodan

Link: https://arxiv.org/abs/2407.03518v1

Date: 2024-07-03

Summary:

For large language models (LLMs) like NLLB and GPT, translating idioms remains a challenge. Our goal is to enhance translation fidelity by improving LLM processing of idiomatic language while preserving the original linguistic style. This has a significant social impact, as it preserves cultural nuances and ensures translated texts retain their intent and emotional resonance, fostering better cross-cultural communication. Previous work has utilized knowledge bases like IdiomKB by providing the LLM with the meaning of an idiom to use in translation. Although this method yielded better results than a direct translation, it is still limited in its ability to preserve idiomatic writing style across languages. In this research, we expand upon the knowledge base to find corresponding idioms in the target language. Our research performs translations using two methods: The first method employs the SentenceTransformers model to semantically generate cosine similarity scores between the meanings of the original and target language idioms, selecting the best idiom (Cosine Similarity method). The second method uses an LLM to find a corresponding idiom in the target language for use in the translation (LLM-generated idiom method). As a baseline, we performed a direct translation without providing additional information. Human evaluations on the English -> Chinese, and Chinese -> English show the Cosine Similarity Lookup method out-performed others in all GPT4o translations. To further build upon IdiomKB, we developed a low-resource Urdu dataset containing Urdu idioms and their translations. Despite dataset limitations, the Cosine Similarity Lookup method shows promise, potentially overcoming language barriers and enabling the exploration of diverse literary works in Chinese and Urdu. For access to the code and replication of our experiments, please visit (https://github.com/ANON13222/ITR).

--------------------------------------------------------------------------------------------------------

Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models

This paper explores the use of vision-language models (VLMs) to enhance the adaptability of legged robots in complex environments. The authors propose VLM-Predictive Control, a system that combines in-context adaptation and future planning to enable robots to handle unexpected obstacles and scenarios. The method is evaluated on real-world obstacle courses using a quadruped robot. This research could revolutionize robotics in challenging environments, with potential applications in search and rescue operations, disaster response, and autonomous exploration of hazardous or inaccessible areas.

Authors: Annie S. Chen, Alec M. Lessing, Andy Tang, Govind Chada, Laura Smith, Sergey Levine, Chelsea Finn

Link: https://arxiv.org/abs/2407.02666v1

Date: 2024-07-02

Summary:

Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions. For example, in a search and rescue mission, a legged robot could climb over debris, crawl through gaps, and navigate out of dead ends. However, the robot's controller needs to respond intelligently to such varied obstacles, and this requires handling unexpected and unusual scenarios successfully. This presents an open challenge to current learning methods, which often struggle with generalization to the long tail of unexpected situations without heavy human supervision. To address this issue, we investigate how to leverage the broad knowledge about the structure of the world and commonsense reasoning capabilities of vision-language models (VLMs) to aid legged robots in handling difficult, ambiguous situations. We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection with VLMs: (1) in-context adaptation over previous robot interactions and (2) planning multiple skills into the future and replanning. We evaluate VLM-PC on several challenging real-world obstacle courses, involving dead ends and climbing and crawling, on a Go1 quadruped robot. Our experiments show that by reasoning over the history of interactions and future plans, VLMs enable the robot to autonomously perceive, navigate, and act in a wide range of complex scenarios that would otherwise require environment-specific engineering or human guidance.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithJuly 9, 2024Comment