Week Ending 7.21.2024

July 22, 2024 Craig Smith

RESEARCH WATCH: 7.21.2024

NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication

This paper introduces an extended framework for simulating language learning and group communication using neural network agents. Building on the previous NeLLCom framework, NeLLCom-X incorporates more realistic role-alternating agents and group communication. This allows researchers to investigate the interplay between language learnability, communication pressures, and group size effects. The framework has been validated by replicating key findings from prior research on the emergence of word-order and case-marking trade-offs. NeLLCom-X opens up possibilities for future simulations of various linguistic aspects, emphasizing the importance of interaction and group dynamics in language evolution. This research could have applications in understanding language development and creating more natural language processing systems.

Authors: Yuchen Lian, Tessa Verhoef, Arianna Bisazza

Link: https://arxiv.org/abs/2407.13999v1

Date: 2024-07-19

Summary:

Recent advances in computational linguistics include simulating the emergence of human-like languages with interacting neural network agents, starting from sets of random symbols. The recently introduced NeLLCom framework (Lian et al., 2023) allows agents to first learn an artificial language and then use it to communicate, with the aim of studying the emergence of specific linguistics properties. We extend this framework (NeLLCom-X) by introducing more realistic role-alternating agents and group communication in order to investigate the interplay between language learnability, communication pressures, and group size effects. We validate NeLLCom-X by replicating key findings from prior research simulating the emergence of a word-order/case-marking trade-off. Next, we investigate how interaction affects linguistic convergence and emergence of the trade-off. The novel framework facilitates future simulations of diverse linguistic aspects, emphasizing the importance of interaction and group dynamics in language evolution.

--------------------------------------------------------------------------------------------------------

Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems

This paper presents a novel web agent called Agent-E, which introduces several architectural improvements over previous state-of-the-art web agents. These include a hierarchical architecture, flexible DOM distillation and denoising method, and the concept of "change observation" to guide the agent towards more accurate performance. The researchers evaluated Agent-E on the WebVoyager benchmark dataset, demonstrating superior performance compared to other text and multi-modal web agents. The paper also synthesizes learnings from Agent-E's development into general design principles for agentic systems. This research could have significant implications for developing more efficient and effective AI agents for web navigation and task completion.

Authors: Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, Ravi Kokku

Link: https://arxiv.org/abs/2407.13032v1

Date: 2024-07-17

Summary:

AI Agents are changing the way work gets done, both in consumer and enterprise domains. However, the design patterns and architectures to build highly capable agents or multi-agent systems are still developing, and the understanding of the implication of various design choices and algorithms is still evolving. In this paper, we present our work on building a novel web agent, Agent-E \footnote{Our code is available at \url{https://github.com/EmergenceAI/Agent-E}}. Agent-E introduces numerous architectural improvements over prior state-of-the-art web agents such as hierarchical architecture, flexible DOM distillation and denoising method, and the concept of \textit{change observation} to guide the agent towards more accurate performance. We first present the results of an evaluation of Agent-E on WebVoyager benchmark dataset and show that Agent-E beats other SOTA text and multi-modal web agents on this benchmark in most categories by 10-30\%. We then synthesize our learnings from the development of Agent-E into general design principles for developing agentic systems. These include the use of domain-specific primitive skills, the importance of distillation and de-noising of environmental observations, the advantages of a hierarchical architecture, and the role of agentic self-improvement to enhance agent efficiency and efficacy as the agent gathers experience.

--------------------------------------------------------------------------------------------------------

A Methodology Establishing Linear Convergence of Adaptive Gradient Methods under PL Inequality

This paper establishes a methodology for proving linear convergence of adaptive gradient methods under the Polyak-Łojasiewicz (PL) inequality. The researchers focus on AdaGrad and Adam, two popular adaptive gradient methods, and demonstrate their linear convergence for smooth cost functions satisfying the PL inequality. The theoretical framework presented follows a simple and unified approach applicable to both batch and stochastic gradients. This research provides a deeper understanding of the convergence properties of these widely used optimization algorithms, potentially leading to more efficient training of machine learning models, particularly in deep learning applications.

Authors: Kushal Chakrabarti, Mayank Baranwal

Link: https://arxiv.org/abs/2407.12629v1

Date: 2024-07-17

Summary:

Adaptive gradient-descent optimizers are the standard choice for training neural network models. Despite their faster convergence than gradient-descent and remarkable performance in practice, the adaptive optimizers are not as well understood as vanilla gradient-descent. A reason is that the dynamic update of the learning rate that helps in faster convergence of these methods also makes their analysis intricate. Particularly, the simple gradient-descent method converges at a linear rate for a class of optimization problems, whereas the practically faster adaptive gradient methods lack such a theoretical guarantee. The Polyak-{\L}ojasiewicz (PL) inequality is the weakest known class, for which linear convergence of gradient-descent and its momentum variants has been proved. Therefore, in this paper, we prove that AdaGrad and Adam, two well-known adaptive gradient methods, converge linearly when the cost function is smooth and satisfies the PL inequality. Our theoretical framework follows a simple and unified approach, applicable to both batch and stochastic gradients, which can potentially be utilized in analyzing linear convergence of other variants of Adam.

--------------------------------------------------------------------------------------------------------

Large Vision-Language Models as Emotion Recognizers in Context Awareness

This paper explores the potential of Large Vision-Language Models (LVLMs) for context-aware emotion recognition (CAER). The researchers investigate three paradigms: fine-tuning LVLMs on CAER datasets, designing zero-shot and few-shot patterns for scenarios with limited data, and incorporating Chain-of-Thought (CoT) to enhance reasoning ability. The study demonstrates that LVLMs can achieve competitive performance in CAER tasks across different paradigms, with particularly strong results in few-shot settings. This research could lead to more accurate and adaptable emotion recognition systems with applications in human-computer interaction, sentiment analysis, and affective computing.

Authors: Yuxuan Lei, Dingkang Yang, Zhaoyu Chen, Jiawei Chen, Peng Zhai, Lihua Zhang

Link: https://arxiv.org/abs/2407.11300v1

Date: 2024-07-16

Summary:

Context-aware emotion recognition (CAER) is a complex and significant task that requires perceiving emotions from various contextual cues. Previous approaches primarily focus on designing sophisticated architectures to extract emotional cues from images. However, their knowledge is confined to specific training datasets and may reflect the subjective emotional biases of the annotators. Furthermore, acquiring large amounts of labeled data is often challenging in real-world applications. In this paper, we systematically explore the potential of leveraging Large Vision-Language Models (LVLMs) to empower the CAER task from three paradigms: 1) We fine-tune LVLMs on two CAER datasets, which is the most common way to transfer large models to downstream tasks. 2) We design zero-shot and few-shot patterns to evaluate the performance of LVLMs in scenarios with limited data or even completely unseen. In this case, a training-free framework is proposed to fully exploit the In-Context Learning (ICL) capabilities of LVLMs. Specifically, we develop an image similarity-based ranking algorithm to retrieve examples; subsequently, the instructions, retrieved examples, and the test example are combined to feed LVLMs to obtain the corresponding sentiment judgment. 3) To leverage the rich knowledge base of LVLMs, we incorporate Chain-of-Thought (CoT) into our framework to enhance the model's reasoning ability and provide interpretable results. Extensive experiments and analyses demonstrate that LVLMs achieve competitive performance in the CAER task across different paradigms. Notably, the superior performance in few-shot settings indicates the feasibility of LVLMs for accomplishing specific tasks without extensive training.

--------------------------------------------------------------------------------------------------------

iHuman: Instant Animatable Digital Humans From Monocular Videos

This paper presents a method for creating animatable 3D digital humans from monocular videos using Gaussian splatting. The researchers developed a novel pipeline that achieves accurate 3D mesh-type modeling of the human body, allowing for better animations. Key aspects of the method include implicit modeling of surface displacements and color spherical harmonics, binding 3D Gaussians to triangular faces of the body template, and a new technique for rendering and supervising normals. The method achieves state-of-the-art results in rendering and 3D reconstruction performance, with significantly faster training times than competitors. This research could revolutionize the creation of personalized 3D avatars for applications in virtual reality, gaming, and digital media production.

Authors: Pramish Paudel, Anubhav Khanal, Ajad Chhatkuli, Danda Pani Paudel, Jyoti Tandukar

Link: https://arxiv.org/abs/2407.11174v1

Date: 2024-07-15

Summary:

Personalized 3D avatars require an animatable representation of digital humans. Doing so instantly from monocular videos offers scalability to broad class of users and wide-scale applications. In this paper, we present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos. Our method utilizes the efficiency of Gaussian splatting to model both 3D geometry and appearance. However, we observed that naively optimizing Gaussian splats results in inaccurate geometry, thereby leading to poor animations. This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body for animatable digitization through Gaussian splats. This is achieved by developing a novel pipeline that benefits from three key aspects: (a) implicit modelling of surface's displacements and the color's spherical harmonics; (b) binding of 3D Gaussians to the respective triangular faces of the body template; (c) a novel technique to render normals followed by their auxiliary supervision. Our exhaustive experiments on three different benchmark datasets demonstrates the state-of-the-art results of our method, in limited time settings. In fact, our method is faster by an order of magnitude (in terms of training time) than its closest competitor. At the same time, we achieve superior rendering and 3D reconstruction performance under the change of poses.

--------------------------------------------------------------------------------------------------------

BiasScanner: Automatic Detection and Classification of News Bias to Strengthen Democracy

This paper introduces BiasScanner, an application designed to strengthen democracy by helping news consumers scrutinize online articles for bias. The system uses a pre-trained large language model to identify and classify over two dozen types of media bias at the sentence level. BiasScanner is implemented as a lightweight, privacy-respecting browser plug-in that highlights potentially biased sentences, provides explanations for each classification decision, and offers a summary analysis for each article. This tool could play a crucial role in promoting media literacy and combating the spread of disinformation in the digital age.

Authors: Tim Menzner, Jochen L. Leidner

Link: https://arxiv.org/abs/2407.10829v1

Date: 2024-07-15

Summary:

The increasing consumption of news online in the 21st century coincided with increased publication of disinformation, biased reporting, hate speech and other unwanted Web content. We describe BiasScanner, an application that aims to strengthen democracy by supporting news consumers with scrutinizing news articles they are reading online. BiasScanner contains a server-side pre-trained large language model to identify biased sentences of news articles and a front-end Web browser plug-in. At the time of writing, BiasScanner can identify and classify more than two dozen types of media bias at the sentence level, making it the most fine-grained model and only deployed application (automatic system in use) of its kind. It was implemented in a light-weight and privacy-respecting manner, and in addition to highlighting likely biased sentence it also provides explanations for each classification decision as well as a summary analysis for each news article. While prior research has addressed news bias detection, we are not aware of any work that resulted in a deployed browser plug-in (c.f. also biasscanner.org for a Web demo).

--------------------------------------------------------------------------------------------------------

On-Device Training of Fully Quantized Deep Neural Networks on Cortex-M Microcontrollers

This paper explores on-device training of deep neural networks (DNNs) for Cortex-M microcontroller units (MCUs). The researchers present a method enabling efficient training of DNNs completely on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. They demonstrate the feasibility of their approach on multiple vision and time-series datasets, providing insights into the trade-offs between training accuracy, memory overhead, energy consumption, and latency on real hardware. This research could lead to more adaptive and efficient edge AI systems, enabling smart devices to learn and improve their performance in real-time without relying on cloud computing resources.

Authors: Mark Deutel, Frank Hannig, Christopher Mutschler, Jürgen Teich

Link: https://arxiv.org/abs/2407.10734v1

Date: 2024-07-15

Summary:

On-device training of DNNs allows models to adapt and fine-tune to newly collected data or changing domains while deployed on microcontroller units (MCUs). However, DNN training is a resource-intensive task, making the implementation and execution of DNN training algorithms on MCUs challenging due to low processor speeds, constrained throughput, limited floating-point support, and memory constraints. In this work, we explore on-device training of DNNs for Cortex-M MCUs. We present a method that enables efficient training of DNNs completely in place on the MCU using fully quantized training (FQT) and dynamic partial gradient updates. We demonstrate the feasibility of our approach on multiple vision and time-series datasets and provide insights into the tradeoff between training accuracy, memory overhead, energy, and latency on real hardware.

--------------------------------------------------------------------------------------------------------

Learning Rapid Turning, Aerial Reorientation, and Balancing using Manipulator as a Tail

This paper investigates the use of a 6-DoF manipulator as a tail for quadruped robots to enhance their physical capabilities. The researchers developed a reinforcement learning-based controller for the robot equipped with the manipulator. Experimental results show that robots with manipulators outperform those without in tasks such as rapid turning, aerial reorientation, and balancing. This innovative approach could lead to more versatile and agile quadruped robots with applications in search and rescue, exploration, and industrial automation, combining the benefits of legged locomotion with manipulation capabilities.

Authors: Insung Yang, Jemin Hwangbo

Link: https://arxiv.org/abs/2407.10420v1

Date: 2024-07-15

Summary:

In this research, we investigated the innovative use of a manipulator as a tail in quadruped robots to augment their physical capabilities. Previous studies have primarily focused on enhancing various abilities by attaching robotic tails that function solely as tails on quadruped robots. While these tails improve the performance of the robots, they come with several disadvantages, such as increased overall weight and higher costs. To mitigate these limitations, we propose the use of a 6-DoF manipulator as a tail, allowing it to serve both as a tail and as a manipulator. To control this highly complex robot, we developed a controller based on reinforcement learning for the robot equipped with the manipulator. Our experimental results demonstrate that robots equipped with a manipulator outperform those without a manipulator in tasks such as rapid turning, aerial reorientation, and balancing. These results indicate that the manipulator can improve the agility and stability of quadruped robots, similar to a tail, in addition to its manipulation capabilities.

--------------------------------------------------------------------------------------------------------

An Empirical Study of Mamba-based Pedestrian Attribute Recognition

This paper explores the potential of Mamba, a linear complexity model, for pedestrian attribute recognition (PAR) tasks. The researchers design and adapt Mamba into two typical PAR frameworks: a text-image fusion approach and a pure vision Mamba multi-label recognition framework. They also investigate various hybrid Mamba-Transformer variants. The study provides insights into the effectiveness of Mamba for PAR tasks and multi-label recognition, potentially leading to more efficient and accurate systems for applications in surveillance, security, and urban analytics.

Authors: Xiao Wang, Weizhe Kong, Jiandong Jin, Shiao Wang, Ruichong Gao, Qingchuan Ma, Chenglong Li, Jin Tang

Link: https://arxiv.org/abs/2407.10374v1

Date: 2024-07-15

Summary:

Current strong pedestrian attribute recognition models are developed based on Transformer networks, which are computationally heavy. Recently proposed models with linear complexity (e.g., Mamba) have garnered significant attention and have achieved a good balance between accuracy and computational cost across a variety of visual tasks. Relevant review articles also suggest that while these models can perform well on some pedestrian attribute recognition datasets, they are generally weaker than the corresponding Transformer models. To further tap into the potential of the novel Mamba architecture for PAR tasks, this paper designs and adapts Mamba into two typical PAR frameworks, i.e., the text-image fusion approach and pure vision Mamba multi-label recognition framework. It is found that interacting with attribute tags as additional input does not always lead to an improvement, specifically, Vim can be enhanced, but VMamba cannot. This paper further designs various hybrid Mamba-Transformer variants and conducts thorough experimental validations. These experimental results indicate that simply enhancing Mamba with a Transformer does not always lead to performance improvements but yields better results under certain settings. We hope this empirical study can further inspire research in Mamba for PAR, and even extend into the domain of multi-label recognition, through the design of these network structures and comprehensive experimentation. The source code of this work will be released at \url{https://github.com/Event-AHU/OpenPAR}

--------------------------------------------------------------------------------------------------------

How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines

This paper investigates the privacy preservation capabilities of low-frequency audio recording for studying social dynamics in real-world settings. The researchers examine the degree to which low-frequency speech ensures verbal privacy by simulating potential privacy attacks in various noise environments. They also explore the trade-off between voice activity detection performance and privacy preservation. This study provides valuable insights for developing privacy-preserving wearable devices and audio processing techniques, with potential applications in social science research, workplace analytics, and personal privacy protection.

Authors: Ailin Liu, Pepijn Vunderink, Jose Vargas Quiros, Chirag Raman, Hayley Hung

Link: https://arxiv.org/abs/2407.13266v1

Date: 2024-07-18

Summary:

Low-frequency audio has been proposed as a promising privacy-preserving modality to study social dynamics in real-world settings. To this end, researchers have developed wearable devices that can record audio at frequencies as low as 1250 Hz to mitigate the automatic extraction of the verbal content of speech that may contain private details. This paper investigates the validity of this hypothesis, examining the degree to which low-frequency speech ensures verbal privacy. It includes simulating a potential privacy attack in various noise environments. Further, it explores the trade-off between the performance of voice activity detection, which is fundamental for understanding social behavior, and privacy-preservation. The evaluation incorporates subjective human intelligibility and automatic speech recognition performance, comprehensively analyzing the delicate balance between effective social behavior analysis and preserving verbal privacy.

--------------------------------------------------------------------------------------------------------

PASTA: Controllable Part-Aware Shape Generation with Autoregressive Transformers

Part-Aware Shape Generation: This paper presents PASTA, an autoregressive transformer architecture for generating high-quality 3D shapes. The model comprises two main components: an autoregressive transformer that generates objects as a sequence of cuboidal primitives, and a blending network that composes these sequences into high-quality meshes. PASTA can perform shape generation from diverse inputs, including partial objects, text, and images, and allows for size-guided generation. This research could have significant implications for 3D content creation in various fields, including computer graphics, virtual reality, and industrial design.

Authors: Songlin Li, Despoina Paschalidou, Leonidas Guibas

Link: https://arxiv.org/abs/2407.13677v1

Date: 2024-07-18

Summary:

The increased demand for tools that automate the 3D content creation process led to tremendous progress in deep generative models that can generate diverse 3D objects of high fidelity. In this paper, we present PASTA, an autoregressive transformer architecture for generating high quality 3D shapes. PASTA comprises two main components: An autoregressive transformer that generates objects as a sequence of cuboidal primitives and a blending network, implemented with a transformer decoder that composes the sequences of cuboids and synthesizes high quality meshes for each object. Our model is trained in two stages: First we train our autoregressive generative model using only annotated cuboidal parts as supervision and next, we train our blending network using explicit 3D supervision, in the form of watertight meshes. Evaluations on various ShapeNet objects showcase the ability of our model to perform shape generation from diverse inputs \eg from scratch, from a partial object, from text and images, as well size-guided generation, by explicitly conditioning on a bounding box that defines the object's boundaries. Moreover, as our model considers the underlying part-based structure of a 3D object, we are able to select a specific part and produce shapes with meaningful variations of this part. As evidenced by our experiments, our model generates 3D shapes that are both more realistic and diverse than existing part-based and non part-based methods, while at the same time is simpler to implement and train.

--------------------------------------------------------------------------------------------------------

Transforming Agency. On the mode of existence of Large Language Models

This paper investigates the ontological characterization of Large Language Models (LLMs) like ChatGPT, focusing on their status as agents. The authors argue that LLMs fail to meet necessary conditions for autonomous agency according to embodied theories of mind. They propose characterizing LLMs as interlocutors or linguistic automata, capable of engaging in non-purposeful yet purpose-structured tasks. The paper also explores how LLM-human coupling can produce new forms of agency. This research contributes to our understanding of AI systems and their relationship to human agency, with implications for AI ethics, philosophy of mind, and human-computer interaction.

Authors: Xabier E. Barandiaran, Lola S. Almendros

Link: https://arxiv.org/abs/2407.10735v2

Date: 2024-07-16

Summary:

This paper investigates the ontological characterization of Large Language Models (LLMs) like ChatGPT. Between inflationary and deflationary accounts, we pay special attention to their status as agents. This requires explaining in detail the architecture, processing, and training procedures that enable LLMs to display their capacities, and the extensions used to turn LLMs into agent-like systems. After a systematic analysis we conclude that a LLM fails to meet necessary and sufficient conditions for autonomous agency in the light of embodied theories of mind: the individuality condition (it is not the product of its own activity, it is not even directly affected by it), the normativity condition (it does not generate its own norms or goals), and, partially the interactional asymmetry condition (it is not the origin and sustained source of its interaction with the environment). If not agents, then ... what are LLMs? We argue that ChatGPT should be characterized as an interlocutor or linguistic automaton, a library-that-talks, devoid of (autonomous) agency, but capable to engage performatively on non-purposeful yet purpose-structured and purpose-bounded tasks. When interacting with humans, a "ghostly" component of the human-machine interaction makes it possible to enact genuine conversational experiences with LLMs. Despite their lack of sensorimotor and biological embodiment, LLMs textual embodiment (the training corpus) and resource-hungry computational embodiment, significantly transform existing forms of human agency. Beyond assisted and extended agency, the LLM-human coupling can produce midtended forms of agency, closer to the production of intentional agency than to the extended instrumentality of any previous technologies.

--------------------------------------------------------------------------------------------------------

Benchmarking Vision Language Models for Cultural Understanding

This paper introduces CulturalVQA, a visual question-answering benchmark for assessing Vision Language Models' (VLMs) geo-diverse cultural understanding. The dataset includes image-question pairs representing cultures from 11 countries across 5 continents, probing understanding of various cultural facets. Benchmarking results reveal disparities in VLMs' cultural understanding across regions and cultural aspects. This research highlights areas where VLMs lack cultural understanding and provides a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures, with potential applications in developing more culturally sensitive AI systems.

Authors: Shravan Nayak, Kanishk Jain, Rabiul Awal, Siva Reddy, Sjoerd van Steenkiste, Lisa Anne Hendricks, Karolina Stańczak, Aishwarya Agrawal

Link: https://arxiv.org/abs/2407.10920v2

Date: 2024-07-18

Summary:

Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.

--------------------------------------------------------------------------------------------------------

Towards Enhanced Classification of Abnormal Lung sound in Multi-breath: A Light Weight Multi-label and Multi-head Attention Classification Method

This study aims to develop an auxiliary diagnostic system for classifying abnormal lung respiratory sounds using a multi-label learning approach and multi-head attention mechanism. The researchers employ a lightweight and highly accurate model, using a two-dimensional label set to represent multiple respiratory sound characteristics. This research could lead to improved automatic diagnosis of lung respiratory sound abnormalities, with potential clinical applications in respiratory medicine and telemedicine.

Authors: Yi-Wei Chua, Yun-Chien Cheng

Link: https://arxiv.org/abs/2407.10828v1

Date: 2024-07-15

Summary:

This study aims to develop an auxiliary diagnostic system for classifying abnormal lung respiratory sounds, enhancing the accuracy of automatic abnormal breath sound classification through an innovative multi-label learning approach and multi-head attention mechanism. Addressing the issue of class imbalance and lack of diversity in existing respiratory sound datasets, our study employs a lightweight and highly accurate model, using a two-dimensional label set to represent multiple respiratory sound characteristics. Our method achieved a 59.2% ICBHI score in the four-category task on the ICBHI2017 dataset, demonstrating its advantages in terms of lightweight and high accuracy. This study not only improves the accuracy of automatic diagnosis of lung respiratory sound abnormalities but also opens new possibilities for clinical applications.

--------------------------------------------------------------------------------------------------------

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Cognitive Reasoning Benchmark for VLMs: This paper introduces NTSEBENCH, a new dataset designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises multiple-choice questions featuring both visual and textual general aptitude questions that do not rely on rote learning. The authors establish baselines using state-of-the-art LLMs and VLMs and propose four modeling strategies to handle different modalities. This benchmark could drive advancements in AI systems capable of more complex reasoning tasks, with potential applications in education, cognitive science, and AI development.

Authors: Pranshu Pandya, Agney S Talwarr, Vatsal Gupta, Tushar Kataria, Vivek Gupta, Dan Roth

Link: https://arxiv.org/abs/2407.10380v1

Date: 2024-07-15

Summary:

Cognitive textual and visual reasoning tasks, such as puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. While LLMs and VLMs, through extensive training on large amounts of human-curated data, have attained a high level of pseudo-human intelligence in some common sense reasoning tasks, they still struggle with more complex reasoning tasks that require cognitive understanding. In this work, we introduce a new dataset, NTSEBench, designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises 2,728 multiple-choice questions comprising of a total of 4,642 images across 26 categories sampled from the NTSE examination conducted nationwide in India, featuring both visual and textual general aptitude questions that do not rely on rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open source and propriety models, we propose four distinct modeling strategies to handle different modalities (text and images) in the dataset instances.

--------------------------------------------------------------------------------------------------------

On the use of Probabilistic Forecasting for Network Analysis in Open RAN

This paper proposes the use of probabilistic forecasting techniques as a radio App (rApp) within the Open RAN architecture. The researchers investigate and compare different probabilistic and single-point forecasting methods to estimate the utilization and resource demands of Physical Resource Blocks (PRBs) of cellular base stations. Their evaluations demonstrate the advantages of probabilistic forecasting techniques over traditional methods. This research could lead to more efficient and reliable resource management in 5G and future cellular networks.

Authors: Vaishnavi Kasuluru, Luis Blanco, Engin Zeydan

Link: https://arxiv.org/abs/2407.14375v1

Date: 2024-07-19

Summary:

Unlike other single-point Artificial Intelligence (AI)-based prediction techniques, such as Long-Short Term Memory (LSTM), probabilistic forecasting techniques (e.g., DeepAR and Transformer) provide a range of possible outcomes and associated probabilities that enable decision makers to make more informed and robust decisions. At the same time, the architecture of Open RAN has emerged as a revolutionary approach for mobile networks, aiming at openness, interoperability and innovation in the ecosystem of RAN. In this paper, we propose the use of probabilistic forecasting techniques as a radio App (rApp) within the Open RAN architecture. We investigate and compare different probabilistic and single-point forecasting methods and algorithms to estimate the utilization and resource demands of Physical Resource Blocks (PRBs) of cellular base stations. Through our evaluations, we demonstrate the numerical advantages of probabilistic forecasting techniques over traditional single-point forecasting methods and show that they are capable of providing more accurate and reliable estimates. In particular, DeepAR clearly outperforms single-point forecasting techniques such as LSTM and Seasonal-Naive (SN) baselines and other probabilistic forecasting techniques such as Simple-Feed-Forward (SFF) and Transformer neural networks.

--------------------------------------------------------------------------------------------------------

Enhancing Biomedical Knowledge Discovery for Diseases: An End-To-End Open-Source Framework

This paper introduces an open-source end-to-end framework for constructing knowledge around specific diseases directly from raw text. The researchers create two annotated datasets focused on Rett syndrome and Alzheimer's disease and conduct extensive benchmarking to explore optimal modeling strategies for semantic relation detection. This framework could accelerate biomedical research by enabling more efficient knowledge discovery from the vast amount of published literature.

Authors: Christos Theodoropoulos, Andrei Catalin Coman, James Henderson, Marie-Francine Moens

Link: https://arxiv.org/abs/2407.13492v1

Date: 2024-07-18

Summary:

The ever-growing volume of biomedical publications creates a critical need for efficient knowledge discovery. In this context, we introduce an open-source end-to-end framework designed to construct knowledge around specific diseases directly from raw text. To facilitate research in disease-related knowledge discovery, we create two annotated datasets focused on Rett syndrome and Alzheimer's disease, enabling the identification of semantic relations between biomedical entities. Extensive benchmarking explores various ways to represent relations and entity representations, offering insights into optimal modeling strategies for semantic relation detection and highlighting language models' competence in knowledge discovery. We also conduct probing experiments using different layer representations and attention scores to explore transformers' ability to capture semantic relations.

--------------------------------------------------------------------------------------------------------

Reducing Barriers to the Use of Marginalised Music Genres in AI

This paper explores the challenges and opportunities associated with using marginalised genres of music with AI models. The researchers identify several eXplainable AI (XAI) opportunities and emphasize the importance of working with small datasets to strengthen cultural representation and address bias issues in deep learning models. This research could lead to more inclusive and diverse AI-generated music and contribute to preserving and promoting underrepresented musical traditions.

Authors: Nick Bryan-Kinns, Zijin Li

Link: https://arxiv.org/abs/2407.13439v1

Date: 2024-07-18

Summary:

AI systems for high quality music generation typically rely on extremely large musical datasets to train the AI models. This creates barriers to generating music beyond the genres represented in dominant datasets such as Western Classical music or pop music. We undertook a 4 month international research project summarised in this paper to explore the eXplainable AI (XAI) challenges and opportunities associated with reducing barriers to using marginalised genres of music with AI models. XAI opportunities identified included topics of improving transparency and control of AI models, explaining the ethics and bias of AI models, fine tuning large models with small datasets to reduce bias, and explaining style-transfer opportunities with AI models. Participants in the research emphasised that whilst it is hard to work with small datasets such as marginalised music and AI, such approaches strengthen cultural representation of underrepresented cultures and contribute to addressing issues of bias of deep learning models. We are now building on this project to bring together a global International Responsible AI Music community and invite people to join our network.

--------------------------------------------------------------------------------------------------------

Exploring the Use of Abusive Generative AI Models on Civitai

This paper presents the first comprehensive empirical study of an AI-Generated Content (AIGC) social platform, focusing on its potential for generating abusive content. The researchers construct a dataset covering Civitai, the largest available AIGC social platform, and explore the characteristics of content and discuss moderation strategies. This study provides valuable insights for governing AIGC platforms and addressing ethical concerns related to generative AI technologies.

Authors: Yiluo Wei, Yiming Zhu, Pan Hui, Gareth Tyson

Link: https://arxiv.org/abs/2407.12876v1

Date: 2024-07-16

Summary:

The rise of generative AI is transforming the landscape of digital imagery, and exerting a significant influence on online creative communities. This has led to the emergence of AI-Generated Content (AIGC) social platforms, such as Civitai. These distinctive social platforms allow users to build and share their own generative AI models, thereby enhancing the potential for more diverse artistic expression. Designed in the vein of social networks, they also provide artists with the means to showcase their creations (generated from the models), engage in discussions, and obtain feedback, thus nurturing a sense of community. Yet, this openness also raises concerns about the abuse of such platforms, e.g., using models to disseminate deceptive deepfakes or infringe upon copyrights. To explore this, we conduct the first comprehensive empirical study of an AIGC social platform, focusing on its use for generating abusive content. As an exemplar, we construct a comprehensive dataset covering Civitai, the largest available AIGC social platform. Based on this dataset of 87K models and 2M images, we explore the characteristics of content and discuss strategies for moderation to better govern these platforms.

--------------------------------------------------------------------------------------------------------

Machine Learning in Communications: A Road to Intelligent Transmission and Processing

This overview article discusses the roles of machine learning in intelligent wireless communications, as well as its features, challenges, and practical considerations. The authors highlight how AI and machine learning are transforming traditional communication systems, leading to more adaptive, efficient, and intelligent algorithms. This research provides a roadmap for the future of wireless communications, with potential applications in 5G and beyond, Internet of Things (IoT), and smart cities.

Authors: Shixiong Wang, Geoffrey Ye Li

Link: https://arxiv.org/abs/2407.11595v1

Date: 2024-07-16

Summary:

Prior to the era of artificial intelligence and big data, wireless communications primarily followed a conventional research route involving problem analysis, model building and calibration, algorithm design and tuning, and holistic and empirical verification. However, this methodology often encountered limitations when dealing with large-scale and complex problems and managing dynamic and massive data, resulting in inefficiencies and limited performance of traditional communication systems and methods. As such, wireless communications have embraced the revolutionary impact of artificial intelligence and machine learning, giving birth to more adaptive, efficient, and intelligent systems and algorithms. This technological shift opens a road to intelligent information transmission and processing. This overview article discusses the typical roles of machine learning in intelligent wireless communications, as well as its features, challenges, and practical considerations.

--------------------------------------------------------------------------------------------------------

Eye On AI

Week Ending 7.21.2024

RESEARCH WATCH: 7.21.2024

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.