Week Ending 6.2.2024
RESEARCH WATCH: 6.2.2024
MALT: Multi-scale Action Learning Transformer for Online Action Detection
Online action detection aims to identify ongoing actions from streaming video in real-time. This has applications in surveillance, human-computer interaction, and assistive technologies. The proposed MALT model captures multi-scale action features and filters irrelevant frames efficiently, improving performance on benchmark datasets.
Authors: Zhipeng Yang, Ruoyu Wang, Yang Tan, Liping Xie
Link: https://arxiv.org/abs/2405.20892v1
Date: 2024-05-31
Summary:
Online action detection (OAD) aims to identify ongoing actions from streaming video in real-time, without access to future frames. Since these actions manifest at varying scales of granularity, ranging from coarse to fine, projecting an entire set of action frames to a single latent encoding may result in a lack of local information, necessitating the acquisition of action features across multiple scales. In this paper, we propose a multi-scale action learning transformer (MALT), which includes a novel recurrent decoder (used for feature fusion) that includes fewer parameters and can be trained more efficiently. A hierarchical encoder with multiple encoding branches is further proposed to capture multi-scale action features. The output from the preceding branch is then incrementally input to the subsequent branch as part of a cross-attention calculation. In this way, output features transition from coarse to fine as the branches deepen. We also introduce an explicit frame scoring mechanism employing sparse attention, which filters irrelevant frames more efficiently, without requiring an additional network. The proposed method achieved state-of-the-art performance on two benchmark datasets (THUMOS'14 and TVSeries), outperforming all existing models used for comparison, with an mAP of 0.2% for THUMOS'14 and an mcAP of 0.1% for TVseries.
--------------------------------------------------------------------------------------------------------
GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning
Knowledge graphs represent human knowledge as triplets forming a graph. Question answering over knowledge graphs is an important task for leveraging this structured data. The proposed GNN-RAG method combines graph neural networks with large language models, achieving state-of-the-art performance on benchmarks by effectively reasoning over the graph structure.
Authors: Costas Mavromatis, George Karypis
Link: https://arxiv.org/abs/2405.20139v1
Date: 2024-05-30
Summary:
Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG. In this work, we introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and answer candidates are extracted to represent KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA. Furthermore, we develop a retrieval augmentation (RA) technique to further boost KGQA performance with GNN-RAG. Experimental results show that GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop and multi-entity questions outperforming competing approaches by 8.9--15.5% points at answer F1.
--------------------------------------------------------------------------------------------------------
Large language models struggle in low-resource languages due to lack of quality training data. This work proposes a method to construct cross-lingual instruction data, allowing language models to better understand and follow instructions in low-resource languages. This can improve accessibility and utility of language AI across diverse languages.
Authors: Chong Li, Wen Yang, Jiajun Zhang, Jinliang Lu, Shaonan Wang, Chengqing Zong
Link: https://arxiv.org/abs/2405.19744v1
Date: 2024-05-30
Summary:
Large language models respond well in high-resource languages like English but struggle in low-resource languages. It may arise from the lack of high-quality instruction following data in these languages. Directly translating English samples into these languages can be a solution but unreliable, leading to responses with translation errors and lacking language-specific or cultural knowledge. To address this issue, we propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages. Specifically, the language model first learns to generate appropriate English instructions according to the natural web texts in other languages as responses. The candidate cross-lingual instruction tuning samples are further refined and diversified. We have employed this method to build a large-scale cross-lingual instruction tuning dataset on 10 languages, namely X-Instruction. The instruction data built using our method incorporate more language-specific knowledge compared with the naive translation method. Experimental results have shown that the response quality of the model tuned on X-Instruction greatly exceeds the model distilled from a powerful teacher model, reaching or even surpassing the ones of ChatGPT. In addition, we find that models tuned on cross-lingual instruction following samples can follow the instruction in the output language without further tuning.
--------------------------------------------------------------------------------------------------------
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
Vision-language models can assist clinicians by analyzing medical images and engaging in dialog. However, ensuring models are properly grounded in clinical knowledge is challenging. Dr-LLaVA uses symbolic medical representations to create tuning data and rewards, aligning the model with expert reasoning without costly human involvement.
Authors: Shenghuan Sun, Gregory M. Goldgof, Alexander Schubert, Zhiqing Sun, Thomas Hartvigsen, Atul J. Butte, Ahmed Alaa
Link: https://arxiv.org/abs/2405.19567v1
Date: 2024-05-29
Summary:
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions to assist in diagnostic and treatment tasks. However, VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. This challenge is particularly pronounced in the medical domain, where we do not only require VLM outputs to be accurate in single interactions but also to be consistent with clinical reasoning and diagnostic pathways throughout multi-turn conversations. For this purpose, we propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge. These representations are utilized to (i) generate GPT-4-guided visual instruction tuning data at scale, simulating clinician-VLM conversations with demonstrations of clinical reasoning, and (ii) create an automatic reward function that evaluates the clinical validity of VLM generations throughout clinician-VLM interactions. Our algorithm eliminates the need for human involvement in training data generation or reward model construction, reducing costs compared to standard reinforcement learning with human feedback (RLHF). We apply our alignment algorithm to develop Dr-LLaVA, a conversational VLM finetuned for analyzing bone marrow pathology slides, demonstrating strong performance in multi-turn medical conversations.
--------------------------------------------------------------------------------------------------------
Developing decision support systems for complex urban problems requires extensive domain expertise. This work leverages large language models to automatically generate ontologies from research articles and technical documents. The generated knowledge graphs can enhance data modeling, simulation coupling, and workflow for urban systems like freight transportation optimization.
Authors: Jose Tupayachi, Haowen Xu, Olufemi A. Omitaomu, Mustafa Can Camur, Aliza Sharmin, Xueping Li
Link: https://arxiv.org/abs/2405.19255v1
Date: 2024-05-29
Summary:
The incorporation of Artificial Intelligence (AI) models into various optimization systems is on the rise. Yet, addressing complex urban and environmental management problems normally requires in-depth domain science and informatics expertise. This expertise is essential for deriving data and simulation-driven for informed decision support. In this context, we investigate the potential of leveraging the pre-trained Large Language Models (LLMs). By adopting ChatGPT API as the reasoning core, we outline an integrated workflow that encompasses natural language processing, methontology-based prompt tuning, and transformers. This workflow automates the creation of scenario-based ontology using existing research articles and technical manuals of urban datasets and simulations. The outcomes of our methodology are knowledge graphs in widely adopted ontology languages (e.g., OWL, RDF, SPARQL). These facilitate the development of urban decision support systems by enhancing the data and metadata modeling, the integration of complex datasets, the coupling of multi-domain simulation models, and the formulation of decision-making metrics and workflow. The feasibility of our methodology is evaluated through a comparative analysis that juxtaposes our AI-generated ontology with the well-known Pizza Ontology employed in tutorials for popular ontology software (e.g., prot\'eg\'e). We close with a real-world case study of optimizing the complex urban system of multi-modal freight transportation by generating anthologies of various domain data and simulations to support informed decision-making.
--------------------------------------------------------------------------------------------------------
Verifiably Robust Conformal Prediction
Conformal prediction provides statistically valid prediction sets, but can be vulnerable to adversarial attacks. This work introduces a framework leveraging neural network verification to recover coverage guarantees under adversarial perturbations for any norm and regression tasks, improving robustness over existing approaches.
Authors: Linus Jeary, Tom Kuipers, Mehran Hosseini, Nicola Paoletti
Link: https://arxiv.org/abs/2405.18942v1
Date: 2024-05-29
Summary:
Conformal Prediction (CP) is a popular uncertainty quantification method that provides distribution-free, statistically valid prediction sets, assuming that training and test data are exchangeable. In such a case, CP's prediction sets are guaranteed to cover the (unknown) true test output with a user-specified probability. Nevertheless, this guarantee is violated when the data is subjected to adversarial attacks, which often result in a significant loss of coverage. Recently, several approaches have been put forward to recover CP guarantees in this setting. These approaches leverage variations of randomised smoothing to produce conservative sets which account for the effect of the adversarial perturbations. They are, however, limited in that they only support $\ell^2$-bounded perturbations and classification tasks. This paper introduces \emph{VRCP (Verifiably Robust Conformal Prediction)}, a new framework that leverages recent neural network verification methods to recover coverage guarantees under adversarial attacks. Our VRCP method is the first to support perturbations bounded by arbitrary norms including $\ell^1$, $\ell^2$, and $\ell^\infty$, as well as regression tasks. We evaluate and compare our approach on image classification tasks (CIFAR10, CIFAR100, and TinyImageNet) and regression tasks for deep reinforcement learning environments. In every case, VRCP achieves above nominal coverage and yields significantly more efficient and informative prediction regions than the SotA.
--------------------------------------------------------------------------------------------------------
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning
Recent text-to-music models enable editing capabilities like changing styles or instrument stems. However, training specific models for each editing task is inefficient. Instruct-MusicGen finetunes a pretrained music model to follow text editing instructions for various tasks using a parameter-efficient architecture.
Authors: Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon
Link: https://arxiv.org/abs/2405.18386v2
Date: 2024-05-29
Summary:
Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.
--------------------------------------------------------------------------------------------------------
Cognitive Insights and Stable Coalition Matching for Fostering Multi-Agent Cooperation
Theory of mind and cognitive abilities facilitate cooperation in human social interactions. This work explores leveraging theory of mind in multi-agent AI systems through a coalition matching mechanism that accounts for agents' belief alignment and specialized skills to foster cooperative behavior.
Authors: Jiaqi Shao, Tianjun Yuan, Tao Lin, Xuanyu Cao, Bing Luo
Link: https://arxiv.org/abs/2405.18044v1
Date: 2024-05-28
Summary:
Cognitive abilities, such as Theory of Mind (ToM), play a vital role in facilitating cooperation in human social interactions. However, our study reveals that agents with higher ToM abilities may not necessarily exhibit better cooperative behavior compared to those with lower ToM abilities. To address this challenge, we propose a novel matching coalition mechanism that leverages the strengths of agents with different ToM levels by explicitly considering belief alignment and specialized abilities when forming coalitions. Our proposed matching algorithm seeks to find stable coalitions that maximize the potential for cooperative behavior and ensure long-term viability. By incorporating cognitive insights into the design of multi-agent systems, our work demonstrates the potential of leveraging ToM to create more sophisticated and human-like coordination strategies that foster cooperation and improve overall system performance.
--------------------------------------------------------------------------------------------------------
The Evolution of Multimodal Model Architectures
This paper categorizes four prevalent architectural patterns for integrating multimodal inputs in neural models. Systematically understanding these architectures can guide selection for different data modalities and model capabilities as multimodal AI continues advancing.
Authors: Shakti N. Wadekar, Abhishek Chaurasia, Aman Chadha, Eugenio Culurciello
Link: https://arxiv.org/abs/2405.17927v1
Date: 2024-05-28
Summary:
This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensive exploration of architectural details and identifies four specific architectural types. The types are distinguished by their respective methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuses multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid the monitoring of any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. To assist in model selection, this work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability.
--------------------------------------------------------------------------------------------------------
Robust Perception and Navigation of Autonomous Surface Vehicles in Challenging Environments
Autonomous surface vehicles with AI offer advantages for environmental monitoring over traditional methods. However, coastal areas present challenges like obstacles, uncharted areas, and accessibility. This research develops robust perception, navigation and decision-making for autonomous monitoring in complex maritime environments.
Authors: Mingi Jeong
Link: https://arxiv.org/abs/2405.17657v1
Date: 2024-05-27
Summary:
Research on coastal regions traditionally involves methods like manual sampling, monitoring buoys, and remote sensing, but these methods face challenges in spatially and temporally diverse regions of interest. Autonomous surface vehicles (ASVs) with artificial intelligence (AI) are being explored, and recognized by the International Maritime Organization (IMO) as vital for future ecosystem understanding. However, there is not yet a mature technology for autonomous environmental monitoring due to typically complex coastal situations: (1) many static (e.g., buoy, dock) and dynamic (e.g., boats) obstacles not compliant with the rules of the road (COLREGs); (2) uncharted or uncertain information (e.g., non-updated nautical chart); and (3) high-cost ASVs not accessible to the community and citizen science while resulting in technology illiteracy. To address the above challenges, my research involves both system and algorithmic development: (1) a robotic boat system for stable and reliable in-water monitoring, (2) maritime perception to detect and track obstacles (such as buoys, and boats), and (3) navigational decision-making with multiple-obstacle avoidance and multi-objective optimization.
--------------------------------------------------------------------------------------------------------
Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection
While text-to-image diffusion models excel at image generation and editing, applying them across diverse modalities like 3D scenes and videos often requires separate models. This unified framework combines self-attention injection techniques to enable editing panoramas, 3D scenes, and videos using a single 2D image model.
Authors: Gihyun Kwon, Jangho Park, Jong Chul Ye
Link: https://arxiv.org/abs/2405.16823v1
Date: 2024-05-27
Summary:
While text-to-image models have achieved impressive capabilities in image generation and editing, their application across various modalities often necessitates training separate models. Inspired by existing method of single image editing with self attention injection and video editing with shared attention, we propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Specifically, we design a sampling method that facilitates editing consecutive images while maintaining semantic consistency utilizing shared self-attention features during both reference and consecutive image sampling processes. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
--------------------------------------------------------------------------------------------------------
Existing methods for controllable image animation via diffusion models are often limited to specific motion domains or lack fine control. MOFA-Video proposes motion field adapters that generate dense flows from sparse control signals like landmarks or trajectories, enabling controllable animation across various conditions.
Authors: Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng
Link: https://arxiv.org/abs/2405.20222v1
Date: 2024-05-30
Summary:
We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation.
--------------------------------------------------------------------------------------------------------
AI Safety: A Climb To Armageddon?
Rather than mitigating existential risk, certain AI safety measures may exacerbate it under key assumptions like inevitability of failure and correlation between power and harm severity. This work examines response strategies and intrinsic challenges in the AI safety landscape.
Authors: Herman Cappelen, Josh Dever, John Hawthorne
Link: https://arxiv.org/abs/2405.19832v1
Date: 2024-05-30
Summary:
This paper presents an argument that certain AI safety measures, rather than mitigating existential risk, may instead exacerbate it. Under certain key assumptions - the inevitability of AI failure, the expected correlation between an AI system's power at the point of failure and the severity of the resulting harm, and the tendency of safety measures to enable AI systems to become more powerful before failing - safety efforts have negative expected utility. The paper examines three response strategies: Optimism, Mitigation, and Holism. Each faces challenges stemming from intrinsic features of the AI safety landscape that we term Bottlenecking, the Perfection Barrier, and Equilibrium Fluctuation. The surprising robustness of the argument forces a re-examination of core assumptions around AI safety and points to several avenues for further research.
--------------------------------------------------------------------------------------------------------
The rise of AI brings challenges for developing relevant education and workforce skills for human-AI collaboration. This paper reviews lifelong learning needs from the perspective of computational thinking competencies to enable effective human-AI interaction serving sustainable development goals.
Authors: Margarida Romero
Link: https://arxiv.org/abs/2405.19837v1
Date: 2024-05-30
Summary:
The rapid advancement of artificial intelligence (AI) has brought significant challenges to the education and workforce skills required to take advantage of AI for human-AI collaboration in the workplace. As AI continues to reshape industries and job markets, the need to define how AI literacy can be considered in lifelong learning has become increasingly critical (Cetindamar et al., 2022; Laupichler et al., 2022; Romero et al., 2023). Like any new technology, AI is the subject of both hopes and fears, and what it entails today presents major challenges (Cugurullo \& Acheampong, 2023; Villani et al., 2018). It also raises profound questions about our own humanity. Will the machine surpass the intelligence of the humans who designed it? What will be the relationship between so-called AI and our human intelligences? How could human-AI collaboration be regulated in a way that serves the Sustainable Development Goals (SDGs)? This paper provides a review of the challenges of lifelong learning in the era of AI from a computational thinking, critical thinking, and creative competencies perspective, highlighting the implications for management and leadership in organizations.
--------------------------------------------------------------------------------------------------------
Predicting parking availability in real-time is valuable for reducing traffic congestion, especially in dense cities. This work introduces a new dataset capturing parking data across Singapore along with spatial/temporal factors, and a deep learning approach demonstrating improved availability forecasting.
Authors: Huaiwu Zhang, Yutong Xia, Siru Zhong, Kun Wang, Zekun Tong, Qingsong Wen, Roger Zimmermann, Yuxuan Liang
Link: https://arxiv.org/abs/2405.18910v1
Date: 2024-05-29
Summary:
The increasing number of vehicles highlights the need for efficient parking space management. Predicting real-time Parking Availability (PA) can help mitigate traffic congestion and the corresponding social problems, which is a pressing issue in densely populated cities like Singapore. In this study, we aim to collectively predict future PA across Singapore with complex factors from various domains. The contributions in this paper are listed as follows: (1) A New Dataset: We introduce the \texttt{SINPA} dataset, containing a year's worth of PA data from 1,687 parking lots in Singapore, enriched with various spatial and temporal factors. (2) A Data-Driven Approach: We present DeepPA, a novel deep-learning framework, to collectively and efficiently predict future PA across thousands of parking lots. (3) Extensive Experiments and Deployment: DeepPA demonstrates a 9.2% reduction in prediction error for up to 3-hour forecasts compared to existing advanced models. Furthermore, we implement DeepPA in a practical web-based platform to provide real-time PA predictions to aid drivers and inform urban planning for the governors in Singapore. We release the dataset and source code at https://github.com/yoshall/SINPA.
--------------------------------------------------------------------------------------------------------
Prior work showed the feasibility of detecting respiratory insufficiency via speech analysis using COVID-19 patient data. This study collects data from other respiratory conditions, finding that models trained only on COVID-19 do not generalize well, suggesting distinct acoustic signatures across respiratory pathologies.
Authors: Marcelo Matheus Gauy, Larissa Cristina Berti, Arnaldo Cândido Jr, Augusto Camargo Neto, Alfredo Goldman, Anna Sara Shafferman Levin, Marcus Martins, Beatriz Raposo de Medeiros, Marcelo Queiroz, Ester Cerdeira Sabino, Flaviane Romani Fernandes Svartman, Marcelo Finger
Link: https://arxiv.org/abs/2405.17569v1
Date: 2024-05-27
Summary:
This work investigates Artificial Intelligence (AI) systems that detect respiratory insufficiency (RI) by analyzing speech audios, thus treating speech as a RI biomarker. Previous works collected RI data (P1) from COVID-19 patients during the first phase of the pandemic and trained modern AI models, such as CNNs and Transformers, which achieved $96.5\%$ accuracy, showing the feasibility of RI detection via AI. Here, we collect RI patient data (P2) with several causes besides COVID-19, aiming at extending AI-based RI detection. We also collected control data from hospital patients without RI. We show that the considered models, when trained on P1, do not generalize to P2, indicating that COVID-19 RI has features that may not be found in all RI types.
--------------------------------------------------------------------------------------------------------
A Multi-Source Retrieval Question Answering Framework Based on RAG
Traditional retrieval-augmented generation relies on initially retrieved context, which can be erroneous or incomplete. This multi-source framework combines GPT-based retrieval using its broad knowledge with web retrieval for fine-grained knowledge gathering to enhance the quality and accuracy of generated outputs.
Authors: Ridong Wu, Shuhong Chen, Xiangbiao Su, Yuankai Zhu, Yifei Liao, Jianming Wu
Link: https://arxiv.org/abs/2405.19207v1
Date: 2024-05-29
Summary:
With the rapid development of large-scale language models, Retrieval-Augmented Generation (RAG) has been widely adopted. However, existing RAG paradigms are inevitably influenced by erroneous retrieval information, thereby reducing the reliability and correctness of generated results. Therefore, to improve the relevance of retrieval information, this study proposes a method that replaces traditional retrievers with GPT-3.5, leveraging its vast corpus knowledge to generate retrieval information. We also propose a web retrieval based method to implement fine-grained knowledge retrieval, Utilizing the powerful reasoning capability of GPT-3.5 to realize semantic partitioning of problem.In order to mitigate the illusion of GPT retrieval and reduce noise in Web retrieval,we proposes a multi-source retrieval framework, named MSRAG, which combines GPT retrieval with web retrieval. Experiments on multiple knowledge-intensive QA datasets demonstrate that the proposed framework in this study performs better than existing RAG framework in enhancing the overall efficiency and accuracy of QA systems.
--------------------------------------------------------------------------------------------------------
Don't Forget to Connect! Improving RAG with Graph-based Reranking
Retrieval results in RAG systems are not always clearly relevant. This work introduces a graph neural network-based reranker to capture connections between documents as well as semantic information from abstract meaning representations, outperforming large language model reranking.
Authors: Jialin Dong, Bahare Fatemi, Bryan Perozzi, Lin F. Yang, Anton Tsitsulin
Link: https://arxiv.org/abs/2405.18414v1
Date: 2024-05-28
Summary:
Retrieval Augmented Generation (RAG) has greatly improved the performance of Large Language Model (LLM) responses by grounding generation with context from existing documents. These systems work well when documents are clearly relevant to a question context. But what about when a document has partial information, or less obvious connections to the context? And how should we reason about connections between documents? In this work, we seek to answer these two core questions about RAG generation. We introduce G-RAG, a reranker based on graph neural networks (GNNs) between the retriever and reader in RAG. Our method combines both connections between documents and semantic information (via Abstract Meaning Representation graphs) to provide a context-informed ranker for RAG. G-RAG outperforms state-of-the-art approaches while having smaller computational footprint. Additionally, we assess the performance of PaLM 2 as a reranker and find it to significantly underperform G-RAG. This result emphasizes the importance of reranking for RAG even when using Large Language Models.
--------------------------------------------------------------------------------------------------------
Speech Loudness in Broadcasting and Streaming
Loudness regulation has improved consistency but created issues like excessively low speech levels impacting intelligibility. This work proposes deep learning-based methods to isolate and analyze speech loudness, defining "critical passages" with likely intelligibility issues to guide content production and personalized user experiences.
Authors: Matteo Torcoli, Mhd Modar Halimeh, Thomas Leitz, Yannik Grewe, Michael Kratschmer, Bernhard Neugebauer, Adrian Murtaza, Harald Fuchs, Emanuël A. P. Habets
Link: https://arxiv.org/abs/2405.17364v1
Date: 2024-05-27
Summary:
The introduction and regulation of loudness in broadcasting and streaming brought clear benefits to the audience, e.g., a level of uniformity across programs and channels. Yet, speech loudness is frequently reported as being too low in certain passages, which can hinder the full understanding and enjoyment of movies and TV programs. This paper proposes expanding the set of loudness-based measures typically used in the industry. We focus on speech loudness, and we show that, when clean speech is not available, Deep Neural Networks (DNNs) can be used to isolate the speech signal and so to accurately estimate speech loudness, providing a more precise estimate compared to speech-gated loudness. Moreover, we define critical passages, i.e., passages in which speech is likely to be hard to understand. Critical passages are defined based on the local Speech Loudness Deviation (SLD) and the local Speech-to-Background Loudness Difference (SBLD), as SLD and SBLD significantly contribute to intelligibility and listening effort. In contrast to other more comprehensive measures of intelligibility and listening effort, SLD and SBLD can be straightforwardly measured, are intuitive, and, most importantly, can be easily controlled by adjusting the speech level in the mix or by enabling personalization at the user's end. Finally, examples are provided that show how the detection of critical passages can support the evaluation and control of the speech signal during and after content production.
--------------------------------------------------------------------------------------------------------
Deep Reinforcement Learning for Intrusion Detection in IoT: A Survey
Defending IoT networks against new complex attacks requires advanced intrusion detection systems. This survey comprehensively reviews deep reinforcement learning techniques applied to IoT intrusion detection across wireless sensor networks, healthcare, and other domains, analyzing performance metrics and utilized datasets.
Authors: Afrah Gueriani, Hamza Kheddar, Ahmed Cherif Mazari
Link: https://arxiv.org/abs/2405.20038v1
Date: 2024-05-30
Summary:
The rise of new complex attacks scenarios in Internet of things (IoT) environments necessitate more advanced and intelligent cyber defense techniques such as various Intrusion Detection Systems (IDSs) which are responsible for detecting and mitigating malicious activities in IoT networks without human intervention. To address this issue, deep reinforcement learning (DRL) has been proposed in recent years, to automatically tackle intrusions/attacks. In this paper, a comprehensive survey of DRL-based IDS on IoT is presented. Furthermore, in this survey, the state-of-the-art DRL-based IDS methods have been classified into five categories including wireless sensor network (WSN), deep Q-network (DQN), healthcare, hybrid, and other techniques. In addition, the most crucial performance metrics, namely accuracy, recall, precision, false negative rate (FNR), false positive rate (FPR), and F-measure, are detailed, in order to evaluate the performance of each proposed method. The paper provides a summary of datasets utilized in the studies as well.
--------------------------------------------------------------------------------------------------------