Eye On AI

View Original

Week Ending 7.21.2024

RESEARCH WATCH: 7.21.2024

NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication

This paper introduces an extended framework for simulating language learning and group communication using neural network agents. Building on the previous NeLLCom framework, NeLLCom-X incorporates more realistic role-alternating agents and group communication. This allows researchers to investigate the interplay between language learnability, communication pressures, and group size effects. The framework has been validated by replicating key findings from prior research on the emergence of word-order and case-marking trade-offs. NeLLCom-X opens up possibilities for future simulations of various linguistic aspects, emphasizing the importance of interaction and group dynamics in language evolution. This research could have applications in understanding language development and creating more natural language processing systems.

Authors:  Yuchen Lian, Tessa Verhoef, Arianna Bisazza

Link:  https://arxiv.org/abs/2407.13999v1

Date: 2024-07-19

Summary:

Recent advances in computational linguistics include simulating the emergence  of human-like languages with interacting neural network agents, starting from  sets of random symbols. The recently introduced NeLLCom framework (Lian et al.,  2023) allows agents to first learn an artificial language and then use it to  communicate, with the aim of studying the emergence of specific linguistics  properties. We extend this framework (NeLLCom-X) by introducing more realistic  role-alternating agents and group communication in order to investigate the  interplay between language learnability, communication pressures, and group  size effects. We validate NeLLCom-X by replicating key findings from prior  research simulating the emergence of a word-order/case-marking trade-off. Next,  we investigate how interaction affects linguistic convergence and emergence of  the trade-off. The novel framework facilitates future simulations of diverse  linguistic aspects, emphasizing the importance of interaction and group  dynamics in language evolution.

--------------------------------------------------------------------------------------------------------

Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems

This paper presents a novel web agent called Agent-E, which introduces several architectural improvements over previous state-of-the-art web agents. These include a hierarchical architecture, flexible DOM distillation and denoising method, and the concept of "change observation" to guide the agent towards more accurate performance. The researchers evaluated Agent-E on the WebVoyager benchmark dataset, demonstrating superior performance compared to other text and multi-modal web agents. The paper also synthesizes learnings from Agent-E's development into general design principles for agentic systems. This research could have significant implications for developing more efficient and effective AI agents for web navigation and task completion.

Authors:  Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, Ravi Kokku

Link:  https://arxiv.org/abs/2407.13032v1

Date: 2024-07-17

Summary:

AI Agents are changing the way work gets done, both in consumer and  enterprise domains. However, the design patterns and architectures to build  highly capable agents or multi-agent systems are still developing, and the  understanding of the implication of various design choices and algorithms is  still evolving. In this paper, we present our work on building a novel web  agent, Agent-E \footnote{Our code is available at  \url{https://github.com/EmergenceAI/Agent-E}}. Agent-E introduces numerous  architectural improvements over prior state-of-the-art web agents such as  hierarchical architecture, flexible DOM distillation and denoising method, and  the concept of \textit{change observation} to guide the agent towards more  accurate performance. We first present the results of an evaluation of Agent-E  on WebVoyager benchmark dataset and show that Agent-E beats other SOTA text and  multi-modal web agents on this benchmark in most categories by 10-30\%. We then  synthesize our learnings from the development of Agent-E into general design  principles for developing agentic systems. These include the use of  domain-specific primitive skills, the importance of distillation and de-noising  of environmental observations, the advantages of a hierarchical architecture,  and the role of agentic self-improvement to enhance agent efficiency and  efficacy as the agent gathers experience.

--------------------------------------------------------------------------------------------------------

A Methodology Establishing Linear Convergence of Adaptive Gradient Methods under PL Inequality

This paper establishes a methodology for proving linear convergence of adaptive gradient methods under the Polyak-Łojasiewicz (PL) inequality. The researchers focus on AdaGrad and Adam, two popular adaptive gradient methods, and demonstrate their linear convergence for smooth cost functions satisfying the PL inequality. The theoretical framework presented follows a simple and unified approach applicable to both batch and stochastic gradients. This research provides a deeper understanding of the convergence properties of these widely used optimization algorithms, potentially leading to more efficient training of machine learning models, particularly in deep learning applications.

Authors:  Kushal Chakrabarti, Mayank Baranwal

Link:  https://arxiv.org/abs/2407.12629v1

Date: 2024-07-17

Summary:

Adaptive gradient-descent optimizers are the standard choice for training  neural network models. Despite their faster convergence than gradient-descent  and remarkable performance in practice, the adaptive optimizers are not as well  understood as vanilla gradient-descent. A reason is that the dynamic update of  the learning rate that helps in faster convergence of these methods also makes  their analysis intricate. Particularly, the simple gradient-descent method  converges at a linear rate for a class of optimization problems, whereas the  practically faster adaptive gradient methods lack such a theoretical guarantee.  The Polyak-{\L}ojasiewicz (PL) inequality is the weakest known class, for which  linear convergence of gradient-descent and its momentum variants has been  proved. Therefore, in this paper, we prove that AdaGrad and Adam, two  well-known adaptive gradient methods, converge linearly when the cost function  is smooth and satisfies the PL inequality. Our theoretical framework follows a  simple and unified approach, applicable to both batch and stochastic gradients,  which can potentially be utilized in analyzing linear convergence of other  variants of Adam.

--------------------------------------------------------------------------------------------------------

Large Vision-Language Models as Emotion Recognizers in Context Awareness

This paper explores the potential of Large Vision-Language Models (LVLMs) for context-aware emotion recognition (CAER). The researchers investigate three paradigms: fine-tuning LVLMs on CAER datasets, designing zero-shot and few-shot patterns for scenarios with limited data, and incorporating Chain-of-Thought (CoT) to enhance reasoning ability. The study demonstrates that LVLMs can achieve competitive performance in CAER tasks across different paradigms, with particularly strong results in few-shot settings. This research could lead to more accurate and adaptable emotion recognition systems with applications in human-computer interaction, sentiment analysis, and affective computing.

Authors:  Yuxuan Lei, Dingkang Yang, Zhaoyu Chen, Jiawei Chen, Peng Zhai, Lihua Zhang

Link:  https://arxiv.org/abs/2407.11300v1

Date: 2024-07-16

Summary:

Context-aware emotion recognition (CAER) is a complex and significant task  that requires perceiving emotions from various contextual cues. Previous  approaches primarily focus on designing sophisticated architectures to extract  emotional cues from images. However, their knowledge is confined to specific  training datasets and may reflect the subjective emotional biases of the  annotators. Furthermore, acquiring large amounts of labeled data is often  challenging in real-world applications. In this paper, we systematically  explore the potential of leveraging Large Vision-Language Models (LVLMs) to  empower the CAER task from three paradigms: 1) We fine-tune LVLMs on two CAER  datasets, which is the most common way to transfer large models to downstream  tasks. 2) We design zero-shot and few-shot patterns to evaluate the performance  of LVLMs in scenarios with limited data or even completely unseen. In this  case, a training-free framework is proposed to fully exploit the In-Context  Learning (ICL) capabilities of LVLMs. Specifically, we develop an image  similarity-based ranking algorithm to retrieve examples; subsequently, the  instructions, retrieved examples, and the test example are combined to feed  LVLMs to obtain the corresponding sentiment judgment. 3) To leverage the rich  knowledge base of LVLMs, we incorporate Chain-of-Thought (CoT) into our  framework to enhance the model's reasoning ability and provide interpretable  results. Extensive experiments and analyses demonstrate that LVLMs achieve  competitive performance in the CAER task across different paradigms. Notably,  the superior performance in few-shot settings indicates the feasibility of  LVLMs for accomplishing specific tasks without extensive training.

--------------------------------------------------------------------------------------------------------

iHuman: Instant Animatable Digital Humans From Monocular Videos

This paper presents a method for creating animatable 3D digital humans from monocular videos using Gaussian splatting. The researchers developed a novel pipeline that achieves accurate 3D mesh-type modeling of the human body, allowing for better animations. Key aspects of the method include implicit modeling of surface displacements and color spherical harmonics, binding 3D Gaussians to triangular faces of the body template, and a new technique for rendering and supervising normals. The method achieves state-of-the-art results in rendering and 3D reconstruction performance, with significantly faster training times than competitors. This research could revolutionize the creation of personalized 3D avatars for applications in virtual reality, gaming, and digital media production.

Authors:  Pramish Paudel, Anubhav Khanal, Ajad Chhatkuli, Danda Pani Paudel, Jyoti Tandukar

Link:  https://arxiv.org/abs/2407.11174v1

Date: 2024-07-15

Summary:

Personalized 3D avatars require an animatable representation of digital  humans. Doing so instantly from monocular videos offers scalability to broad  class of users and wide-scale applications. In this paper, we present a fast,  simple, yet effective method for creating animatable 3D digital humans from  monocular videos. Our method utilizes the efficiency of Gaussian splatting to  model both 3D geometry and appearance. However, we observed that naively  optimizing Gaussian splats results in inaccurate geometry, thereby leading to  poor animations. This work achieves and illustrates the need of accurate 3D  mesh-type modelling of the human body for animatable digitization through  Gaussian splats. This is achieved by developing a novel pipeline that benefits  from three key aspects: (a) implicit modelling of surface's displacements and  the color's spherical harmonics; (b) binding of 3D Gaussians to the respective  triangular faces of the body template; (c) a novel technique to render normals  followed by their auxiliary supervision. Our exhaustive experiments on three  different benchmark datasets demonstrates the state-of-the-art results of our  method, in limited time settings. In fact, our method is faster by an order of  magnitude (in terms of training time) than its closest competitor. At the same  time, we achieve superior rendering and 3D reconstruction performance under the  change of poses.

--------------------------------------------------------------------------------------------------------

BiasScanner: Automatic Detection and Classification of News Bias to Strengthen Democracy

This paper introduces BiasScanner, an application designed to strengthen democracy by helping news consumers scrutinize online articles for bias. The system uses a pre-trained large language model to identify and classify over two dozen types of media bias at the sentence level. BiasScanner is implemented as a lightweight, privacy-respecting browser plug-in that highlights potentially biased sentences, provides explanations for each classification decision, and offers a summary analysis for each article. This tool could play a crucial role in promoting media literacy and combating the spread of disinformation in the digital age.

Authors:  Tim Menzner, Jochen L. Leidner

Link:  https://arxiv.org/abs/2407.10829v1

Date: 2024-07-15

Summary:

The increasing consumption of news online in the 21st century coincided with  increased publication of disinformation, biased reporting, hate speech and  other unwanted Web content. We describe BiasScanner, an application that aims  to strengthen democracy by supporting news consumers with scrutinizing news  articles they are reading online. BiasScanner contains a server-side  pre-trained large language model to identify biased sentences of news articles  and a front-end Web browser plug-in. At the time of writing, BiasScanner can  identify and classify more than two dozen types of media bias at the sentence  level, making it the most fine-grained model and only deployed application  (automatic system in use) of its kind. It was implemented in a light-weight and  privacy-respecting manner, and in addition to highlighting likely biased  sentence it also provides explanations for each classification decision as well  as a summary analysis for each news article. While prior research has addressed  news bias detection, we are not aware of any work that resulted in a deployed  browser plug-in (c.f. also biasscanner.org for a Web demo).

--------------------------------------------------------------------------------------------------------

On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

This paper explores on-device training of deep neural networks (DNNs) for Cortex-M microcontroller units (MCUs). The researchers present a method enabling efficient training of DNNs completely on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. They demonstrate the feasibility of their approach on multiple vision and time-series datasets, providing insights into the trade-offs between training accuracy, memory overhead, energy consumption, and latency on real hardware. This research could lead to more adaptive and efficient edge AI systems, enabling smart devices to learn and improve their performance in real-time without relying on cloud computing resources.

Authors:  Mark Deutel, Frank Hannig, Christopher Mutschler, Jürgen Teich

Link:  https://arxiv.org/abs/2407.10734v1

Date: 2024-07-15

Summary:

On-device training of DNNs allows models to adapt and fine-tune to newly  collected data or changing domains while deployed on microcontroller units  (MCUs). However, DNN training is a resource-intensive task, making the  implementation and execution of DNN training algorithms on MCUs challenging due  to low processor speeds, constrained throughput, limited floating-point  support, and memory constraints. In this work, we explore on-device training of  DNNs for Cortex-M MCUs. We present a method that enables efficient training of  DNNs completely in place on the MCU using fully quantized training (FQT) and  dynamic partial gradient updates. We demonstrate the feasibility of our  approach on multiple vision and time-series datasets and provide insights into  the tradeoff between training accuracy, memory overhead, energy, and latency on  real hardware.

--------------------------------------------------------------------------------------------------------

Learning Rapid Turning, Aerial Reorientation, and Balancing using Manipulator as a Tail

This paper investigates the use of a 6-DoF manipulator as a tail for quadruped robots to enhance their physical capabilities. The researchers developed a reinforcement learning-based controller for the robot equipped with the manipulator. Experimental results show that robots with manipulators outperform those without in tasks such as rapid turning, aerial reorientation, and balancing. This innovative approach could lead to more versatile and agile quadruped robots with applications in search and rescue, exploration, and industrial automation, combining the benefits of legged locomotion with manipulation capabilities.

Authors:  Insung Yang, Jemin Hwangbo

Link:  https://arxiv.org/abs/2407.10420v1

Date: 2024-07-15

Summary:

In this research, we investigated the innovative use of a manipulator as a  tail in quadruped robots to augment their physical capabilities. Previous  studies have primarily focused on enhancing various abilities by attaching  robotic tails that function solely as tails on quadruped robots. While these  tails improve the performance of the robots, they come with several  disadvantages, such as increased overall weight and higher costs. To mitigate  these limitations, we propose the use of a 6-DoF manipulator as a tail,  allowing it to serve both as a tail and as a manipulator. To control this  highly complex robot, we developed a controller based on reinforcement learning  for the robot equipped with the manipulator. Our experimental results  demonstrate that robots equipped with a manipulator outperform those without a  manipulator in tasks such as rapid turning, aerial reorientation, and  balancing. These results indicate that the manipulator can improve the agility  and stability of quadruped robots, similar to a tail, in addition to its  manipulation capabilities.

--------------------------------------------------------------------------------------------------------

An Empirical Study of Mamba-based Pedestrian Attribute Recognition

This paper explores the potential of Mamba, a linear complexity model, for pedestrian attribute recognition (PAR) tasks. The researchers design and adapt Mamba into two typical PAR frameworks: a text-image fusion approach and a pure vision Mamba multi-label recognition framework. They also investigate various hybrid Mamba-Transformer variants. The study provides insights into the effectiveness of Mamba for PAR tasks and multi-label recognition, potentially leading to more efficient and accurate systems for applications in surveillance, security, and urban analytics.

Authors:  Xiao Wang, Weizhe Kong, Jiandong Jin, Shiao Wang, Ruichong Gao, Qingchuan Ma, Chenglong Li, Jin Tang

Link:  https://arxiv.org/abs/2407.10374v1

Date: 2024-07-15

Summary:

Current strong pedestrian attribute recognition models are developed based on  Transformer networks, which are computationally heavy. Recently proposed models  with linear complexity (e.g., Mamba) have garnered significant attention and  have achieved a good balance between accuracy and computational cost across a  variety of visual tasks. Relevant review articles also suggest that while these  models can perform well on some pedestrian attribute recognition datasets, they  are generally weaker than the corresponding Transformer models. To further tap  into the potential of the novel Mamba architecture for PAR tasks, this paper  designs and adapts Mamba into two typical PAR frameworks, i.e., the text-image  fusion approach and pure vision Mamba multi-label recognition framework. It is  found that interacting with attribute tags as additional input does not always  lead to an improvement, specifically, Vim can be enhanced, but VMamba cannot.  This paper further designs various hybrid Mamba-Transformer variants and  conducts thorough experimental validations. These experimental results indicate  that simply enhancing Mamba with a Transformer does not always lead to  performance improvements but yields better results under certain settings. We  hope this empirical study can further inspire research in Mamba for PAR, and  even extend into the domain of multi-label recognition, through the design of  these network structures and comprehensive experimentation. The source code of  this work will be released at \url{https://github.com/Event-AHU/OpenPAR}

--------------------------------------------------------------------------------------------------------

How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines

This paper investigates the privacy preservation capabilities of low-frequency audio recording for studying social dynamics in real-world settings. The researchers examine the degree to which low-frequency speech ensures verbal privacy by simulating potential privacy attacks in various noise environments. They also explore the trade-off between voice activity detection performance and privacy preservation. This study provides valuable insights for developing privacy-preserving wearable devices and audio processing techniques, with potential applications in social science research, workplace analytics, and personal privacy protection.

Authors:  Ailin Liu, Pepijn Vunderink, Jose Vargas Quiros, Chirag Raman, Hayley Hung

Link:  https://arxiv.org/abs/2407.13266v1

Date: 2024-07-18

Summary:

Low-frequency audio has been proposed as a promising privacy-preserving  modality to study social dynamics in real-world settings. To this end,  researchers have developed wearable devices that can record audio at  frequencies as low as 1250 Hz to mitigate the automatic extraction of the  verbal content of speech that may contain private details. This paper  investigates the validity of this hypothesis, examining the degree to which  low-frequency speech ensures verbal privacy. It includes simulating a potential  privacy attack in various noise environments. Further, it explores the  trade-off between the performance of voice activity detection, which is  fundamental for understanding social behavior, and privacy-preservation. The  evaluation incorporates subjective human intelligibility and automatic speech  recognition performance, comprehensively analyzing the delicate balance between  effective social behavior analysis and preserving verbal privacy.

--------------------------------------------------------------------------------------------------------

PASTA: Controllable Part-Aware Shape Generation with Autoregressive Transformers

Part-Aware Shape Generation: This paper presents PASTA, an autoregressive transformer architecture for generating high-quality 3D shapes. The model comprises two main components: an autoregressive transformer that generates objects as a sequence of cuboidal primitives, and a blending network that composes these sequences into high-quality meshes. PASTA can perform shape generation from diverse inputs, including partial objects, text, and images, and allows for size-guided generation. This research could have significant implications for 3D content creation in various fields, including computer graphics, virtual reality, and industrial design.

Authors:  Songlin Li, Despoina Paschalidou, Leonidas Guibas

Link:  https://arxiv.org/abs/2407.13677v1

Date: 2024-07-18

Summary:

The increased demand for tools that automate the 3D content creation process  led to tremendous progress in deep generative models that can generate diverse  3D objects of high fidelity. In this paper, we present PASTA, an autoregressive  transformer architecture for generating high quality 3D shapes. PASTA comprises  two main components: An autoregressive transformer that generates objects as a  sequence of cuboidal primitives and a blending network, implemented with a  transformer decoder that composes the sequences of cuboids and synthesizes high  quality meshes for each object. Our model is trained in two stages: First we  train our autoregressive generative model using only annotated cuboidal parts  as supervision and next, we train our blending network using explicit 3D  supervision, in the form of watertight meshes. Evaluations on various ShapeNet  objects showcase the ability of our model to perform shape generation from  diverse inputs \eg from scratch, from a partial object, from text and images,  as well size-guided generation, by explicitly conditioning on a bounding box  that defines the object's boundaries. Moreover, as our model considers the  underlying part-based structure of a 3D object, we are able to select a  specific part and produce shapes with meaningful variations of this part. As  evidenced by our experiments, our model generates 3D shapes that are both more  realistic and diverse than existing part-based and non part-based methods,  while at the same time is simpler to implement and train.

--------------------------------------------------------------------------------------------------------

Transforming Agency. On the mode of existence of Large Language Models

This paper investigates the ontological characterization of Large Language Models (LLMs) like ChatGPT, focusing on their status as agents. The authors argue that LLMs fail to meet necessary conditions for autonomous agency according to embodied theories of mind. They propose characterizing LLMs as interlocutors or linguistic automata, capable of engaging in non-purposeful yet purpose-structured tasks. The paper also explores how LLM-human coupling can produce new forms of agency. This research contributes to our understanding of AI systems and their relationship to human agency, with implications for AI ethics, philosophy of mind, and human-computer interaction.

Authors:  Xabier E. Barandiaran, Lola S. Almendros

Link:  https://arxiv.org/abs/2407.10735v2

Date: 2024-07-16

Summary:

This paper investigates the ontological characterization of Large Language  Models (LLMs) like ChatGPT. Between inflationary and deflationary accounts, we  pay special attention to their status as agents. This requires explaining in  detail the architecture, processing, and training procedures that enable LLMs  to display their capacities, and the extensions used to turn LLMs into  agent-like systems. After a systematic analysis we conclude that a LLM fails to  meet necessary and sufficient conditions for autonomous agency in the light of  embodied theories of mind: the individuality condition (it is not the product  of its own activity, it is not even directly affected by it), the normativity  condition (it does not generate its own norms or goals), and, partially the  interactional asymmetry condition (it is not the origin and sustained source of  its interaction with the environment). If not agents, then ... what are LLMs?  We argue that ChatGPT should be characterized as an interlocutor or linguistic  automaton, a library-that-talks, devoid of (autonomous) agency, but capable to  engage performatively on non-purposeful yet purpose-structured and  purpose-bounded tasks. When interacting with humans, a "ghostly" component of  the human-machine interaction makes it possible to enact genuine conversational  experiences with LLMs. Despite their lack of sensorimotor and biological  embodiment, LLMs textual embodiment (the training corpus) and resource-hungry  computational embodiment, significantly transform existing forms of human  agency. Beyond assisted and extended agency, the LLM-human coupling can produce  midtended forms of agency, closer to the production of intentional agency than  to the extended instrumentality of any previous technologies.

--------------------------------------------------------------------------------------------------------

Benchmarking Vision Language Models for Cultural Understanding

This paper introduces CulturalVQA, a visual question-answering benchmark for assessing Vision Language Models' (VLMs) geo-diverse cultural understanding. The dataset includes image-question pairs representing cultures from 11 countries across 5 continents, probing understanding of various cultural facets. Benchmarking results reveal disparities in VLMs' cultural understanding across regions and cultural aspects. This research highlights areas where VLMs lack cultural understanding and provides a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures, with potential applications in developing more culturally sensitive AI systems.

Authors:  Shravan Nayak, Kanishk Jain, Rabiul Awal, Siva Reddy, Sjoerd van Steenkiste, Lisa Anne Hendricks, Karolina Stańczak, Aishwarya Agrawal

Link:  https://arxiv.org/abs/2407.10920v2

Date: 2024-07-18

Summary:

Foundation models and vision-language pre-training have notably advanced  Vision Language Models (VLMs), enabling multimodal processing of visual and  linguistic data. However, their performance has been typically assessed on  general scene understanding - recognizing objects, attributes, and actions -  rather than cultural comprehension. This study introduces CulturalVQA, a visual  question-answering benchmark aimed at assessing VLM's geo-diverse cultural  understanding. We curate a collection of 2,378 image-question pairs with 1-5  answers per question representing cultures from 11 countries across 5  continents. The questions probe understanding of various facets of culture such  as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on  CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of  cultural understanding across regions, with strong cultural understanding  capabilities for North America while significantly lower performance for  Africa. We observe disparity in their performance across cultural facets too,  with clothing, rituals, and traditions seeing higher performances than food and  drink. These disparities help us identify areas where VLMs lack cultural  understanding and demonstrate the potential of CulturalVQA as a comprehensive  evaluation set for gauging VLM progress in understanding diverse cultures.

--------------------------------------------------------------------------------------------------------

Towards Enhanced Classification of Abnormal Lung sound in Multi-breath: A Light Weight Multi-label and Multi-head Attention Classification Method

This study aims to develop an auxiliary diagnostic system for classifying abnormal lung respiratory sounds using a multi-label learning approach and multi-head attention mechanism. The researchers employ a lightweight and highly accurate model, using a two-dimensional label set to represent multiple respiratory sound characteristics. This research could lead to improved automatic diagnosis of lung respiratory sound abnormalities, with potential clinical applications in respiratory medicine and telemedicine.

Authors:  Yi-Wei Chua, Yun-Chien Cheng

Link:  https://arxiv.org/abs/2407.10828v1

Date: 2024-07-15

Summary:

This study aims to develop an auxiliary diagnostic system for classifying  abnormal lung respiratory sounds, enhancing the accuracy of automatic abnormal  breath sound classification through an innovative multi-label learning approach  and multi-head attention mechanism. Addressing the issue of class imbalance and  lack of diversity in existing respiratory sound datasets, our study employs a  lightweight and highly accurate model, using a two-dimensional label set to  represent multiple respiratory sound characteristics. Our method achieved a  59.2% ICBHI score in the four-category task on the ICBHI2017 dataset,  demonstrating its advantages in terms of lightweight and high accuracy. This  study not only improves the accuracy of automatic diagnosis of lung respiratory  sound abnormalities but also opens new possibilities for clinical applications.

--------------------------------------------------------------------------------------------------------

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Cognitive Reasoning Benchmark for VLMs: This paper introduces NTSEBENCH, a new dataset designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises multiple-choice questions featuring both visual and textual general aptitude questions that do not rely on rote learning. The authors establish baselines using state-of-the-art LLMs and VLMs and propose four modeling strategies to handle different modalities. This benchmark could drive advancements in AI systems capable of more complex reasoning tasks, with potential applications in education, cognitive science, and AI development.

Authors:  Pranshu Pandya, Agney S Talwarr, Vatsal Gupta, Tushar Kataria, Vivek Gupta, Dan Roth

Link:  https://arxiv.org/abs/2407.10380v1

Date: 2024-07-15

Summary:

Cognitive textual and visual reasoning tasks, such as puzzles, series, and  analogies, demand the ability to quickly reason, decipher, and evaluate  patterns both textually and spatially. While LLMs and VLMs, through extensive  training on large amounts of human-curated data, have attained a high level of  pseudo-human intelligence in some common sense reasoning tasks, they still  struggle with more complex reasoning tasks that require cognitive  understanding. In this work, we introduce a new dataset, NTSEBench, designed to  evaluate the cognitive multi-modal reasoning and problem-solving skills of  large models. The dataset comprises 2,728 multiple-choice questions comprising  of a total of 4,642 images across 26 categories sampled from the NTSE  examination conducted nationwide in India, featuring both visual and textual  general aptitude questions that do not rely on rote learning. We establish  baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a  comparison between open source and propriety models, we propose four distinct  modeling strategies to handle different modalities (text and images) in the  dataset instances.

--------------------------------------------------------------------------------------------------------

On the use of Probabilistic Forecasting for Network Analysis in Open RAN

This paper proposes the use of probabilistic forecasting techniques as a radio App (rApp) within the Open RAN architecture. The researchers investigate and compare different probabilistic and single-point forecasting methods to estimate the utilization and resource demands of Physical Resource Blocks (PRBs) of cellular base stations. Their evaluations demonstrate the advantages of probabilistic forecasting techniques over traditional methods. This research could lead to more efficient and reliable resource management in 5G and future cellular networks.

Authors:  Vaishnavi Kasuluru, Luis Blanco, Engin Zeydan

Link:  https://arxiv.org/abs/2407.14375v1

Date: 2024-07-19

Summary:

Unlike other single-point Artificial Intelligence (AI)-based prediction  techniques, such as Long-Short Term Memory (LSTM), probabilistic forecasting  techniques (e.g., DeepAR and Transformer) provide a range of possible outcomes  and associated probabilities that enable decision makers to make more informed  and robust decisions. At the same time, the architecture of Open RAN has  emerged as a revolutionary approach for mobile networks, aiming at openness,  interoperability and innovation in the ecosystem of RAN. In this paper, we  propose the use of probabilistic forecasting techniques as a radio App (rApp)  within the Open RAN architecture. We investigate and compare different  probabilistic and single-point forecasting methods and algorithms to estimate  the utilization and resource demands of Physical Resource Blocks (PRBs) of  cellular base stations. Through our evaluations, we demonstrate the numerical  advantages of probabilistic forecasting techniques over traditional  single-point forecasting methods and show that they are capable of providing  more accurate and reliable estimates. In particular, DeepAR clearly outperforms  single-point forecasting techniques such as LSTM and Seasonal-Naive (SN)  baselines and other probabilistic forecasting techniques such as  Simple-Feed-Forward (SFF) and Transformer neural networks.

--------------------------------------------------------------------------------------------------------

Enhancing Biomedical Knowledge Discovery for Diseases: An End-To-End Open-Source Framework

This paper introduces an open-source end-to-end framework for constructing knowledge around specific diseases directly from raw text. The researchers create two annotated datasets focused on Rett syndrome and Alzheimer's disease and conduct extensive benchmarking to explore optimal modeling strategies for semantic relation detection. This framework could accelerate biomedical research by enabling more efficient knowledge discovery from the vast amount of published literature.

Authors:  Christos Theodoropoulos, Andrei Catalin Coman, James Henderson, Marie-Francine Moens

Link:  https://arxiv.org/abs/2407.13492v1

Date: 2024-07-18

Summary:

The ever-growing volume of biomedical publications creates a critical need  for efficient knowledge discovery. In this context, we introduce an open-source  end-to-end framework designed to construct knowledge around specific diseases  directly from raw text. To facilitate research in disease-related knowledge  discovery, we create two annotated datasets focused on Rett syndrome and  Alzheimer's disease, enabling the identification of semantic relations between  biomedical entities. Extensive benchmarking explores various ways to represent  relations and entity representations, offering insights into optimal modeling  strategies for semantic relation detection and highlighting language models'  competence in knowledge discovery. We also conduct probing experiments using  different layer representations and attention scores to explore transformers'  ability to capture semantic relations.

--------------------------------------------------------------------------------------------------------

Reducing Barriers to the Use of Marginalised Music Genres in AI

This paper explores the challenges and opportunities associated with using marginalised genres of music with AI models. The researchers identify several eXplainable AI (XAI) opportunities and emphasize the importance of working with small datasets to strengthen cultural representation and address bias issues in deep learning models. This research could lead to more inclusive and diverse AI-generated music and contribute to preserving and promoting underrepresented musical traditions.

Authors:  Nick Bryan-Kinns, Zijin Li

Link:  https://arxiv.org/abs/2407.13439v1

Date: 2024-07-18

Summary:

AI systems for high quality music generation typically rely on extremely  large musical datasets to train the AI models. This creates barriers to  generating music beyond the genres represented in dominant datasets such as  Western Classical music or pop music. We undertook a 4 month international  research project summarised in this paper to explore the eXplainable AI (XAI)  challenges and opportunities associated with reducing barriers to using  marginalised genres of music with AI models. XAI opportunities identified  included topics of improving transparency and control of AI models, explaining  the ethics and bias of AI models, fine tuning large models with small datasets  to reduce bias, and explaining style-transfer opportunities with AI models.  Participants in the research emphasised that whilst it is hard to work with  small datasets such as marginalised music and AI, such approaches strengthen  cultural representation of underrepresented cultures and contribute to  addressing issues of bias of deep learning models. We are now building on this  project to bring together a global International Responsible AI Music community  and invite people to join our network.

--------------------------------------------------------------------------------------------------------

Exploring the Use of Abusive Generative AI Models on Civitai

This paper presents the first comprehensive empirical study of an AI-Generated Content (AIGC) social platform, focusing on its potential for generating abusive content. The researchers construct a dataset covering Civitai, the largest available AIGC social platform, and explore the characteristics of content and discuss moderation strategies. This study provides valuable insights for governing AIGC platforms and addressing ethical concerns related to generative AI technologies.

Authors:  Yiluo Wei, Yiming Zhu, Pan Hui, Gareth Tyson

Link:  https://arxiv.org/abs/2407.12876v1

Date: 2024-07-16

Summary:

The rise of generative AI is transforming the landscape of digital imagery,  and exerting a significant influence on online creative communities. This has  led to the emergence of AI-Generated Content (AIGC) social platforms, such as  Civitai. These distinctive social platforms allow users to build and share  their own generative AI models, thereby enhancing the potential for more  diverse artistic expression. Designed in the vein of social networks, they also  provide artists with the means to showcase their creations (generated from the  models), engage in discussions, and obtain feedback, thus nurturing a sense of  community. Yet, this openness also raises concerns about the abuse of such  platforms, e.g., using models to disseminate deceptive deepfakes or infringe  upon copyrights. To explore this, we conduct the first comprehensive empirical  study of an AIGC social platform, focusing on its use for generating abusive  content. As an exemplar, we construct a comprehensive dataset covering Civitai,  the largest available AIGC social platform. Based on this dataset of 87K models  and 2M images, we explore the characteristics of content and discuss strategies  for moderation to better govern these platforms.

--------------------------------------------------------------------------------------------------------

Machine Learning in Communications: A Road to Intelligent Transmission and Processing

This overview article discusses the roles of machine learning in intelligent wireless communications, as well as its features, challenges, and practical considerations. The authors highlight how AI and machine learning are transforming traditional communication systems, leading to more adaptive, efficient, and intelligent algorithms. This research provides a roadmap for the future of wireless communications, with potential applications in 5G and beyond, Internet of Things (IoT), and smart cities.

Authors:  Shixiong Wang, Geoffrey Ye Li

Link:  https://arxiv.org/abs/2407.11595v1

Date: 2024-07-16

Summary:

Prior to the era of artificial intelligence and big data, wireless  communications primarily followed a conventional research route involving  problem analysis, model building and calibration, algorithm design and tuning,  and holistic and empirical verification. However, this methodology often  encountered limitations when dealing with large-scale and complex problems and  managing dynamic and massive data, resulting in inefficiencies and limited  performance of traditional communication systems and methods. As such, wireless  communications have embraced the revolutionary impact of artificial  intelligence and machine learning, giving birth to more adaptive, efficient,  and intelligent systems and algorithms. This technological shift opens a road  to intelligent information transmission and processing. This overview article  discusses the typical roles of machine learning in intelligent wireless  communications, as well as its features, challenges, and practical  considerations.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.