Week Ending 6.23.2024

RESEARCH WATCH: 6.23.2024

Improving Interpretability and Robustness for the Detection of AI-Generated Images

This paper addresses the growing challenge of detecting AI-generated images as generative models become more sophisticated. The researchers focus on improving the robustness of detection methods, particularly those using CLIP embeddings. They propose techniques to enhance interpretability and cross-model generalization, including removing harmful embedding components and selecting the best-performing attention heads. The study also introduces a new dataset for AI-generated image detection. This research has important applications in media forensics, combating disinformation, and maintaining trust in visual content in an era of increasingly realistic AI-generated imagery.

Authors: Tatiana Gaintseva, Laida Kushnareva, German Magai, Irina Piontkovskaya, Sergey Nikolenko, Martin Benning, Serguei Barannikov, Gregory Slabaugh

Link: https://arxiv.org/abs/2406.15035v1

Date: 2024-06-21

Summary:

With growing abilities of generative models, artificial content detection becomes an increasingly important and difficult task. However, all popular approaches to this problem suffer from poor generalization across domains and generative models. In this work, we focus on the robustness of AI-generated image (AIGI) detectors. We analyze existing state-of-the-art AIGI detection methods based on frozen CLIP embeddings and show how to interpret them, shedding light on how images produced by various AI generators differ from real ones. Next we propose two ways to improve robustness: based on removing harmful components of the embedding vector and based on selecting the best performing attention heads in the image encoder model. Our methods increase the mean out-of-distribution (OOD) classification score by up to 6% for cross-model transfer. We also propose a new dataset for AIGI detection and use it in our evaluation; we believe this dataset will help boost further research. The dataset and code are provided as a supplement.

--------------------------------------------------------------------------------------------------------

Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video

This study evaluates the sports understanding capabilities of large language models (LLMs) through a comprehensive benchmark. The researchers test LLMs on tasks ranging from basic sports rules to complex, context-specific reasoning, using various learning strategies. They also assess video language models for multimodal sports understanding. The findings highlight critical challenges in sports-related NLP tasks. This research has potential applications in sports analytics, automated commentary, and enhancing AI assistants' ability to engage in sports-related discussions, potentially revolutionizing how we interact with and analyze sports content.

Authors: Zhengbang Yang, Haotian Xia, Jingxi Li, Zezhi Chen, Zhuangdi Zhu, Weining Shen

Link: https://arxiv.org/abs/2406.14877v1

Date: 2024-06-21

Summary:

Understanding sports is crucial for the advancement of Natural Language Processing (NLP) due to its intricate and dynamic nature. Reasoning over complex sports scenarios has posed significant challenges to current NLP technologies which require advanced cognitive capabilities. Toward addressing the limitations of existing benchmarks on sports understanding in the NLP field, we extensively evaluated mainstream large language models for various sports tasks. Our evaluation spans from simple queries on basic rules and historical facts to complex, context-specific reasoning, leveraging strategies from zero-shot to few-shot learning, and chain-of-thought techniques. In addition to unimodal analysis, we further assessed the sports reasoning capabilities of mainstream video language models to bridge the gap in multimodal sports understanding benchmarking. Our findings highlighted the critical challenges of sports understanding for NLP. We proposed a new benchmark based on a comprehensive overview of existing sports datasets and provided extensive error analysis which we hope can help identify future research priorities in this field.

--------------------------------------------------------------------------------------------------------

V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data

This paper presents a novel approach to consistently remove glasses from videos while preserving identity and video content. The researchers use synthetic data generated from a pretrained diffusion model to train their system, overcoming the lack of paired real-world data. The method demonstrates significant improvements over existing approaches and shows potential for generalization to other local video editing tasks. This technology could have applications in film and video production, virtual try-on systems for eyewear, and privacy-preserving video editing, offering new possibilities for consistent and realistic video manipulation.

Authors: Rotem Shalev-Arkushin, Aharon Azulay, Tavi Halperin, Eitan Richardson, Amit H. Bermano, Ohad Fried

Link: https://arxiv.org/abs/2406.14510v1

Date: 2024-06-20

Summary:

Diffusion-based generative models have recently shown remarkable image and video editing capabilities. However, local video editing, particularly removal of small attributes like glasses, remains a challenge. Existing methods either alter the videos excessively, generate unrealistic artifacts, or fail to perform the requested edit consistently throughout the video. In this work, we focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. Due to the lack of paired data, we adopt a weakly supervised approach and generate synthetic imperfect data, using an adjusted pretrained diffusion model. We show that despite data imperfection, by learning from our generated data and leveraging the prior of pretrained diffusion models, our model is able to perform the desired edit consistently while preserving the original video content. Furthermore, we exemplify the generalization ability of our method to other local video editing tasks by applying it successfully to facial sticker-removal. Our approach demonstrates significant improvement over existing methods, showcasing the potential of leveraging synthetic data and strong video priors for local video editing tasks.

--------------------------------------------------------------------------------------------------------

PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions

PoseBench is a comprehensive benchmark designed to evaluate the robustness of pose estimation models against real-world corruptions. The study tests 60 representative models across three datasets for human and animal pose estimation, using various types of corruption. The findings reveal vulnerabilities in state-of-the-art models and explore design considerations to improve robustness. This research has significant implications for improving the reliability of pose estimation in real-world applications such as human-machine interaction, autonomous driving, and computer vision systems in challenging environments.

Authors: Sihan Ma, Jing Zhang, Qiong Cao, Dacheng Tao

Link: https://arxiv.org/abs/2406.14367v1

Date: 2024-06-20

Summary:

Pose estimation aims to accurately identify anatomical keypoints in humans and animals using monocular images, which is crucial for various applications such as human-machine interaction, embodied AI, and autonomous driving. While current models show promising results, they are typically trained and tested on clean data, potentially overlooking the corruption during real-world deployment and thus posing safety risks in practical scenarios. To address this issue, we introduce PoseBench, a comprehensive benchmark designed to evaluate the robustness of pose estimation models against real-world corruption. We evaluated 60 representative models, including top-down, bottom-up, heatmap-based, regression-based, and classification-based methods, across three datasets for human and animal pose estimation. Our evaluation involves 10 types of corruption in four categories: 1) blur and noise, 2) compression and color loss, 3) severe lighting, and 4) masks. Our findings reveal that state-of-the-art models are vulnerable to common real-world corruptions and exhibit distinct behaviors when tackling human and animal pose estimation tasks. To improve model robustness, we delve into various design considerations, including input resolution, pre-training datasets, backbone capacity, post-processing, and data augmentations. We hope that our benchmark will serve as a foundation for advancing research in robust pose estimation. The benchmark and source code will be released at https://xymsh.github.io/PoseBench

--------------------------------------------------------------------------------------------------------

Enhancing robustness of data-driven SHM models: adversarial training with circle loss

This paper addresses the vulnerability of machine learning models used in Structural Health Monitoring (SHM) to adversarial examples. The researchers propose an adversarial training method using circle loss to optimize feature distances and improve model robustness. The approach shows substantial improvements over existing defense mechanisms. This research has important applications in improving the reliability and safety of SHM systems used in aerospace, civil, and mechanical infrastructure, potentially enhancing the detection of structural issues and preventing failures in critical structures.

Authors: Xiangli Yang, Xijie Deng, Hanwei Zhang, Yang Zou, Jianxi Yang

Link: https://arxiv.org/abs/2406.14232v1

Date: 2024-06-20

Summary:

Structural health monitoring (SHM) is critical to safeguarding the safety and reliability of aerospace, civil, and mechanical infrastructure. Machine learning-based data-driven approaches have gained popularity in SHM due to advancements in sensors and computational power. However, machine learning models used in SHM are vulnerable to adversarial examples -- even small changes in input can lead to different model outputs. This paper aims to address this problem by discussing adversarial defenses in SHM. In this paper, we propose an adversarial training method for defense, which uses circle loss to optimize the distance between features in training to keep examples away from the decision boundary. Through this simple yet effective constraint, our method demonstrates substantial improvements in model robustness, surpassing existing defense mechanisms.

--------------------------------------------------------------------------------------------------------

MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models

MR-BEN is a new benchmark designed to assess the meta-reasoning capabilities of large language models (LLMs). It requires LLMs to locate and analyze errors in automatically generated reasoning steps across various subjects. The benchmark reveals limitations in current LLMs and highlights the gap between open-source and closed-source models in reasoning capabilities. This research has implications for improving LLMs' problem-solving and decision-making abilities, with potential applications in education, scientific research, and complex problem-solving across multiple domains.

Authors: Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia

Link: https://arxiv.org/abs/2406.13975v1

Date: 2024-06-20

Summary:

Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we present a process-based benchmark MR-BEN that demands a meta reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts, covering various subjects such as physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, open-source models are seemingly comparable to GPT-4 on outcome-based benchmarks, but they lag far behind on our benchmark, revealing the underlying reasoning capability gap between them. Our dataset and codes are available on https://randolph-zeng.github.io/Mr-Ben.github.io/.

--------------------------------------------------------------------------------------------------------

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

ClinicalLab introduces a comprehensive suite for evaluating and aligning medical agents powered by large language models (LLMs) in clinical diagnostics. It includes ClinicalBench, a multi-departmental diagnostic evaluation benchmark, and novel metrics for assessing LLMs in clinical tasks. The researchers also propose ClinicalAgent, an end-to-end clinical agent aligned with real-world practices. This research has significant potential to improve the accuracy and reliability of AI-assisted medical diagnostics across multiple specialties, potentially enhancing patient care and clinical decision-making.

Authors: Weixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, Jiayi Wang, Weishan Zhao, Yixin Zhang, Renjun Zhang, Li Zhu

Link: https://arxiv.org/abs/2406.13890v1

Date: 2024-06-19

Summary:

LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.

--------------------------------------------------------------------------------------------------------

LangTopo: Aligning Language Descriptions of Graphs with Tokenized Topological Modeling

This paper introduces LangTopo, a framework that aligns graph structure modeling with natural language understanding at the token level. The approach enables large language models (LLMs) to learn graph structure modeling capabilities from Graph Neural Networks (GNNs), allowing LLMs to handle graph-structured data independently. This research has potential applications in enhancing LLMs' performance on graph-related tasks, with implications for social network analysis, molecular structure prediction, and other fields that rely on graph data processing.

Authors: Zhong Guan, Hongke Zhao, Likang Wu, Ming He, Jianpin Fan

Link: https://arxiv.org/abs/2406.13250v1

Date: 2024-06-19

Summary:

Recently, large language models (LLMs) have been widely researched in the field of graph machine learning due to their outstanding abilities in language comprehension and learning. However, the significant gap between natural language tasks and topological structure modeling poses a nonnegligible challenge. Specifically, since natural language descriptions are not sufficient for LLMs to understand and process graph-structured data, fine-tuned LLMs perform even worse than some traditional GNN models on graph tasks, lacking inherent modeling capabilities for graph structures. Existing research overly emphasizes LLMs' understanding of semantic information captured by external models, while inadequately exploring graph topological structure modeling, thereby overlooking the genuine capabilities that LLMs lack. Consequently, in this paper, we introduce a new framework, LangTopo, which aligns graph structure modeling with natural language understanding at the token level. LangTopo quantifies the graph structure modeling capabilities of GNNs and LLMs by constructing a codebook for the graph modality and performs consistency maximization. This process aligns the text description of LLM with the topological modeling of GNN, allowing LLM to learn the ability of GNN to capture graph structures, enabling LLM to handle graph-structured data independently. We demonstrate the effectiveness of our proposed method on multiple datasets.

--------------------------------------------------------------------------------------------------------

Informed along the road: roadway capacity driven graph convolution network for network-wide traffic prediction

This study presents the Roadway Capacity Driven Graph Convolution Network (RCDGCN) model for predicting network-wide traffic states. The model incorporates static and dynamic roadway capacity attributes in spatio-temporal settings. Evaluated on real-world datasets, RCDGCN outperformed baseline methods in forecasting accuracy. This research has significant applications in transportation system management, potentially improving traffic flow prediction, urban planning, and intelligent transportation systems.

Authors: Zilin Bian, Jingqin Gao, Kaan Ozbay, Fan Zuo, Dachuan Zuo, Zhenning Li

Link: https://arxiv.org/abs/2406.13057v1

Date: 2024-06-18

Summary:

While deep learning has shown success in predicting traffic states, most methods treat it as a general prediction task without considering transportation aspects. Recently, graph neural networks have proven effective for this task, but few incorporate external factors that impact roadway capacity and traffic flow. This study introduces the Roadway Capacity Driven Graph Convolution Network (RCDGCN) model, which incorporates static and dynamic roadway capacity attributes in spatio-temporal settings to predict network-wide traffic states. The model was evaluated on two real-world datasets with different transportation factors: the ICM-495 highway network and an urban network in Manhattan, New York City. Results show RCDGCN outperformed baseline methods in forecasting accuracy. Analyses, including ablation experiments, weight analysis, and case studies, investigated the effect of capacity-related factors. The study demonstrates the potential of using RCDGCN for transportation system management.

--------------------------------------------------------------------------------------------------------

Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?

This paper challenges the assumption that more detailed image captions generated by Large Vision-Language Models (LVLMs) always lead to more object hallucinations. The researchers propose a new decoding strategy, Differentiated Beam Decoding (DBD), and novel evaluation metrics. Their approach demonstrates that it's possible to generate detailed descriptions while maintaining low hallucination levels. This research has implications for improving the accuracy and reliability of image captioning systems, with potential applications in accessibility technologies, content description for visually impaired users, and automated image analysis in various fields.

Authors: Mingqian Feng, Yunlong Tang, Zeliang Zhang, Chenliang Xu

Link: https://arxiv.org/abs/2406.12663v1

Date: 2024-06-18

Summary:

Large Vision-Language Models (LVLMs) excel in integrating visual and linguistic contexts to produce detailed content, facilitating applications such as image captioning. However, using LVLMs to generate descriptions often faces the challenge of object hallucination (OH), where the output text misrepresents actual objects in the input image. While previous studies attribute the occurrence of OH to the inclusion of more details, our study finds technical flaws in existing metrics, leading to unreliable evaluations of models and conclusions about OH. This has sparked a debate on the question: Do more details always introduce more hallucinations in LVLM-based image captioning? In this paper, we address this debate by proposing a novel decoding strategy, Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation metrics: CLIP-Precision, CLIP-Recall, and CLIP-F1. DBD decodes the wealth of information hidden in visual input into distinct language representations called unit facts in parallel. This decoding is achieved via a well-designed differential score that guides the parallel search and candidate screening. The selected unit facts are then aggregated to generate the final caption. Our proposed metrics evaluate the comprehensiveness and accuracy of image captions by comparing the embedding groups of ground-truth image regions and generated text partitions. Extensive experiments on the Visual Genome dataset validate the effectiveness of our approach, demonstrating that it produces detailed descriptions while maintaining low hallucination levels.

--------------------------------------------------------------------------------------------------------

PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers

This paper explores using large language models (LLMs) for complex decision-making tasks. The researchers introduce a new benchmark, Decision QA (DQA), and propose PlanRAG, an innovative retrieval-augmented generation technique. PlanRAG generates a decision-making plan before retrieving relevant data, outperforming existing methods in two scenarios derived from strategy video games. This approach has potential applications in business analytics, strategic planning, and automated decision-making systems. It could revolutionize how AI assists in complex decision-making processes across various industries, from finance to logistics, by combining language understanding with structured data analysis.

Authors: Myeonghwa Lee, Seonho An, Min-Soo Kim

Link: https://arxiv.org/abs/2406.12430v1

Date: 2024-06-18

Summary:

In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define Decision QA as the task of answering the best decision, $d_{best}$, for a decision-making question $Q$, business rules $R$ and a database $D$. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, DQA. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the iterative plan-then-retrieval augmented generation (PlanRAG). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at https://github.com/myeon9h/PlanRAG.

--------------------------------------------------------------------------------------------------------

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model

This study addresses the critical challenge of safety alignment in Vision Language Models (VLMs). The researchers introduce SPA-VL, a large-scale dataset covering various harmfulness domains and categories. By training VLMs on this dataset, the models show improved harmlessness and helpfulness while maintaining core capabilities. This research has significant implications for developing safer AI systems in multimodal applications, such as content moderation, assistive technologies for the visually impaired, and AI-powered image analysis tools. It could help mitigate risks associated with AI-generated content and ensure more responsible deployment of VLMs in real-world scenarios.

Authors: Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, Jing Shao

Link: https://arxiv.org/abs/2406.12030v1

Date: 2024-06-17

Summary:

The emergence of Vision Language Models (VLMs) has brought unprecedented advances in understanding multimodal information. The combination of textual and visual semantics in VLMs is highly complex and diverse, making the safety alignment of these models challenging. Furthermore, due to the limited study on the safety alignment of VLMs, there is a lack of large-scale, high-quality datasets. To address these limitations, we propose a Safety Preference Alignment dataset for Vision Language Models named SPA-VL. In terms of breadth, SPA-VL covers 6 harmfulness domains, 13 categories, and 53 subcategories, and contains 100,788 samples of the quadruple (question, image, chosen response, rejected response). In terms of depth, the responses are collected from 12 open- (e.g., QwenVL) and closed-source (e.g., Gemini) VLMs to ensure diversity. The experimental results indicate that models trained with alignment techniques on the SPA-VL dataset exhibit substantial improvements in harmlessness and helpfulness while maintaining core capabilities. SPA-VL, as a large-scale, high-quality, and diverse dataset, represents a significant milestone in ensuring that VLMs achieve both harmlessness and helpfulness. We have made our code https://github.com/EchoseChen/SPA-VL-RLHF and SPA-VL dataset url https://huggingface.co/datasets/sqrti/SPA-VL publicly available.

--------------------------------------------------------------------------------------------------------

Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs

This paper proposes a novel framework to enhance the planning capabilities of large language models (LLMs) in complex question-answering tasks. By utilizing planning data derived from knowledge graphs, the researchers improve LLMs' ability to handle multi-step reasoning and information retrieval. This approach has potential applications in advanced search engines, intelligent tutoring systems, and AI-powered research assistants. It could significantly enhance the ability of AI systems to break down complex queries, retrieve relevant information, and provide more accurate and comprehensive answers across various domains of knowledge.

Authors: Junjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, Huajun Chen

Link: https://arxiv.org/abs/2406.14282v1

Date: 2024-06-20

Summary:

Improving the performance of large language models (LLMs) in complex question-answering (QA) scenarios has always been a research focal point. Recent studies have attempted to enhance LLMs' performance by combining step-wise planning with external retrieval. While effective for advanced models like GPT-3.5, smaller LLMs face challenges in decomposing complex questions, necessitating supervised fine-tuning. Previous work has relied on manual annotation and knowledge distillation from teacher LLMs, which are time-consuming and not accurate enough. In this paper, we introduce a novel framework for enhancing LLMs' planning capabilities by using planning data derived from knowledge graphs (KGs). LLMs fine-tuned with this data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval. Evaluations on multiple datasets, including our newly proposed benchmark, highlight the effectiveness of our framework and the benefits of KG-derived planning data.

--------------------------------------------------------------------------------------------------------

Adaptive Selection for Homogeneous Tools: An Instantiation in the RAG Scenario

This study focuses on improving the cost-effectiveness of tool selection in AI systems, particularly in the context of Retrieval-Augmented Generation (RAG). The researchers propose a method to predict both performance and cost for task completion, optimizing tool selection. This approach has potential applications in resource-efficient AI systems, cloud computing services, and automated workflow optimization. By balancing performance and cost, this method could lead to more economical and efficient AI-powered systems in various industries, from customer service chatbots to large-scale data analysis platforms.

Authors: Feiteng Mu, Yong Jiang, Liwen Zhang, Chu Liu, Wenjie Li, Pengjun Xie, Fei Huang

Link: https://arxiv.org/abs/2406.12429v1

Date: 2024-06-18

Summary:

Current research on tool learning primarily focuses on selecting the most effective tool from a wide array of options, often overlooking cost-effectiveness, a crucial factor in human problem-solving. In this paper, we address the selection of homogeneous tools by predicting both their performance and the associated cost required to accomplish a given task. We then assign queries to the optimal tools in a cost-effective manner. Our experimental results demonstrate that our method achieves higher performance at a lower cost compared to strong baseline approaches.

--------------------------------------------------------------------------------------------------------

Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey

This comprehensive survey explores the state of Automatic Music Transcription (AMT) using machine learning techniques. The researchers review current methodologies, progress, and limitations in converting audio signals to symbolic music notation. This study has implications for music education, digital music libraries, and music production tools. Improved AMT systems could revolutionize how we interact with and analyze music, enabling more efficient music archiving, assisting in music composition, and enhancing accessibility for musicians with hearing impairments.

Authors: Fatemeh Jamshidi, Gary Pike, Amit Das, Richard Chapman

Link: https://arxiv.org/abs/2406.15249v1

Date: 2024-06-20

Summary:

In the domain of Music Information Retrieval (MIR), Automatic Music Transcription (AMT) emerges as a central challenge, aiming to convert audio signals into symbolic notations like musical notes or sheet music. This systematic review accentuates the pivotal role of AMT in music signal analysis, emphasizing its importance due to the intricate and overlapping spectral structure of musical harmonies. Through a thorough examination of existing machine learning techniques utilized in AMT, we explore the progress and constraints of current models and methodologies. Despite notable advancements, AMT systems have yet to match the accuracy of human experts, largely due to the complexities of musical harmonies and the need for nuanced interpretation. This review critically evaluates both fully automatic and semi-automatic AMT systems, emphasizing the importance of minimal user intervention and examining various methodologies proposed to date. By addressing the limitations of prior techniques and suggesting avenues for improvement, our objective is to steer future research towards fully automated AMT systems capable of accurately and efficiently translating intricate audio signals into precise symbolic representations. This study not only synthesizes the latest advancements but also lays out a road-map for overcoming existing challenges in AMT, providing valuable insights for researchers aiming to narrow the gap between current systems and human-level transcription accuracy.

--------------------------------------------------------------------------------------------------------

Ranking LLMs by compression

This paper proposes a novel method for evaluating large language models (LLMs) based on their ability to compress information. By conceptualizing understanding as information compression, the researchers demonstrate a correlation between compression ratio and model performance on various NLP tasks. This approach offers a potentially more efficient and generalizable way to assess LLM capabilities. It could have significant implications for model selection, benchmarking, and optimization in AI research and development, potentially leading to more efficient and effective language models across various applications.

Authors: Peijia Guo, Ziguang Li, Haibo Hu, Chao Huang, Ming Li, Rui Zhang

Link: https://arxiv.org/abs/2406.14171v1

Date: 2024-06-20

Summary:

We conceptualize the process of understanding as information compression, and propose a method for ranking large language models (LLMs) based on lossless data compression. We demonstrate the equivalence of compression length under arithmetic coding with cumulative negative log probabilities when using a large language model as a prior, that is, the pre-training phase of the model is essentially the process of learning the optimal coding length. At the same time, the evaluation metric compression ratio can be obtained without actual compression, which greatly saves overhead. In this paper, we use five large language models as priors for compression, then compare their performance on challenging natural language processing tasks, including sentence completion, question answering, and coreference resolution. Experimental results show that compression ratio and model performance are positively correlated, so it can be used as a general metric to evaluate large language models.

--------------------------------------------------------------------------------------------------------

Insect Identification in the Wild: The AMI Dataset

This study introduces a large-scale machine learning benchmark for insect recognition in the wild, addressing the critical need for better insect monitoring tools. The researchers provide curated datasets and baseline algorithms for this challenging task. This work has important applications in biodiversity conservation, agricultural pest management, and ecological research. Improved insect identification systems could enhance our understanding of ecosystem health, aid in early detection of invasive species, and contribute to more sustainable agricultural practices through targeted pest control.

Authors: Aditya Jain, Fagner Cunha, Michael James Bunsen, Juan Sebastián Cañas, Léonard Pasi, Nathan Pinoy, Flemming Helsing, JoAnne Russo, Marc Botham, Michael Sabourin, Jonathan Fréchette, Alexandre Anctil, Yacksecari Lopez, Eduardo Navarro, Filonila Perez Pimentel, Ana Cecilia Zamora, José Alejandro Ramirez Silva, Jonathan Gagnon, Tom August, Kim Bjerge, Alba Gomez Segura, Marc Bélisle, Yves Basset, Kent P. McFarland, David Roy, Toke Thomas Høye, Maxim Larrivée, David Rolnick

Link: https://arxiv.org/abs/2406.12452v1

Date: 2024-06-18

Summary:

Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study insects, and have proposed computer vision algorithms as an answer for scalable data processing. However, insect monitoring in the wild poses unique challenges that have not yet been addressed within computer vision, including the combination of long-tailed data, extremely similar classes, and significant distribution shifts. We provide the first large-scale machine learning benchmarks for fine-grained insect recognition, designed to match real-world tasks faced by ecologists. Our contributions include a curated dataset of images from citizen science platforms and museums, and an expert-annotated dataset drawn from automated camera traps across multiple continents, designed to test out-of-distribution generalization under field conditions. We train and evaluate a variety of baseline algorithms and introduce a combination of data augmentation techniques that enhance generalization across geographies and hardware setups. Code and datasets are made publicly available.

--------------------------------------------------------------------------------------------------------

Landscape More Secure Than Portrait? Zooming Into the Directionality of Digital Images With Security Implications

This paper explores how image orientation affects security in various digital image applications. The researchers demonstrate that accounting for directionality can improve the performance of security-related tasks such as steganalysis and synthetic image detection. This work has implications for digital forensics, cybersecurity, and media authentication. By considering image orientation in security algorithms, we could develop more robust systems for detecting manipulated or artificially generated images, enhancing trust in digital media and improving security measures in various digital platforms.

Authors: Benedikt Lorch, Rainer Böhme

Link: https://arxiv.org/abs/2406.15206v1

Date: 2024-06-21

Summary:

The orientation in which a source image is captured can affect the resulting security in downstream applications. One reason for this is that many state-of-the-art methods in media security assume that image statistics are similar in the horizontal and vertical directions, allowing them to reduce the number of features (or trainable weights) by merging coefficients. We show that this artificial symmetrization tends to suppress important properties of natural images and common processing operations, causing a loss of performance. We also observe the opposite problem, where unaddressed directionality causes learning-based methods to overfit to a single orientation. These are vulnerable to manipulation if an adversary chooses inputs with the less common orientation. This paper takes a comprehensive approach, identifies and systematizes causes of directionality at several stages of a typical acquisition pipeline, measures their effect, and demonstrates for three selected security applications (steganalysis, forensic source identification, and the detection of synthetic images) how the performance of state-of-the-art methods can be improved by properly accounting for directionality.

--------------------------------------------------------------------------------------------------------

Talking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition

This study investigates the capabilities of large language models (LLMs) in recognizing lexical entailment relations among verbs. The researchers find that while LLMs perform moderately well, they still face challenges in fully solving this task. This work has implications for natural language understanding, semantic analysis, and AI-powered language tools. Improving LLMs' ability to recognize lexical entailment could enhance machine translation, text summarization, and question-answering systems, leading to more nuanced and accurate language processing across various applications.

Authors: Candida M. Greco, Lucio La Cava, Andrea Tagarelli

Link: https://arxiv.org/abs/2406.14894v1

Date: 2024-06-21

Summary:

Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a longstanding challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases, namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the models' performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence for further research developments on this topic.

--------------------------------------------------------------------------------------------------------

Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata

This paper introduces Multi-Meta-RAG, a method to improve Retrieval-Augmented Generation (RAG) for complex, multi-hop queries. By using database filtering with LLM-extracted metadata, the researchers enhance the selection of relevant documents for answering multi-step questions. This approach has potential applications in advanced search engines, intelligent assistants, and AI-powered research tools. It could significantly improve the ability of AI systems to handle complex queries that require connecting multiple pieces of information, enhancing their usefulness in fields such as scientific research, legal analysis, and investigative journalism.

Authors: Mykhailo Poliakov, Nadiya Shvai

Link: https://arxiv.org/abs/2406.13213v1

Date: 2024-06-19

Summary:

The retrieval-augmented generation (RAG) enables retrieval of relevant information from an external knowledge source and allows large language models (LLMs) to answer queries over previously unseen document collections. However, it was demonstrated that traditional RAG applications perform poorly in answering multi-hop questions, which require retrieving and reasoning over multiple elements of supporting evidence. We introduce a new method called Multi-Meta-RAG, which uses database filtering with LLM-extracted metadata to improve the RAG selection of the relevant documents from various sources, relevant to the question. While database filtering is specific to a set of questions from a particular domain and format, we found out that Multi-Meta-RAG greatly improves the results on the MultiHop-RAG benchmark. The code is available at https://github.com/mxpoliakov/Multi-Meta-RAG.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithJune 24, 2024Comment