Eye On AI

View Original

Week Ending 7.14.2024

RESEARCH WATCH: 7.14.2024

Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videos

The Sensorium 2023 Benchmark Competition addresses a crucial gap in computational neuroscience by providing a standardized benchmark for dynamic models of the mouse visual system. This competition features a large-scale dataset from the primary visual cortex of mice, including neuronal responses to dynamic stimuli and behavioral measurements. The benchmark aims to advance our understanding of biological visual information processing and improve predictive models connecting biological and machine vision. Potential applications include developing more accurate artificial neural networks, enhancing brain-computer interfaces, and advancing medical imaging techniques for diagnosing and treating visual disorders.

Authors:  Polina Turishcheva, Paul G. Fahey, Michaela Vystrčilová, Laura Hansel, Rachel Froebe, Kayla Ponder, Yongrong Qiu, Konstantin F. Willeke, Mohammad Bashiri, Ruslan Baikulov, Yu Zhu, Lei Ma, Shan Yu, Tiejun Huang, Bryan M. Li, Wolf De Wulf, Nina Kudryashova, Matthias H. Hennig, Nathalie L. Rochefort, Arno Onken, Eric Wang, Zhiwei Ding, Andreas S. Tolias, Fabian H. Sinz, Alexander S Ecker

Link:  https://arxiv.org/abs/2407.09100v1

Date: 2024-07-12

Summary:

Understanding how biological visual systems process information is challenging because of the nonlinear relationship between visual input and neuronal responses. Artificial neural networks allow computational neuroscientists to create predictive models that connect biological and machine vision. Machine learning has benefited tremendously from benchmarks that compare different model on the same task under standardized conditions. However, there was no standardized benchmark to identify state-of-the-art dynamic models of the mouse visual system. To address this gap, we established the Sensorium 2023 Benchmark Competition with dynamic input, featuring a new large-scale dataset from the primary visual cortex of ten mice. This dataset includes responses from 78,853 neurons to 2 hours of dynamic stimuli per neuron, together with the behavioral measurements such as running speed, pupil dilation, and eye movements. The competition ranked models in two tracks based on predictive performance for neuronal responses on a held-out test set: one focusing on predicting in-domain natural stimuli and another on out-of-distribution (OOD) stimuli to assess model generalization. As part of the NeurIPS 2023 competition track, we received more than 160 model submissions from 22 teams. Several new architectures for predictive models were proposed, and the winning teams improved the previous state-of-the-art model by 50%. Access to the dataset as well as the benchmarking infrastructure will remain online at www.sensorium-competition.net.

--------------------------------------------------------------------------------------------------------

CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data

CADC (Collaborative Aware Data Compression) tackles the growing challenge of training deep learning recommendation models (DLRMs) with exponentially increasing datasets. This innovative approach focuses on compressing training data while preserving crucial collaborative information from user-item interactions. By enriching user and item embeddings with interaction history, CADC allows for significant dataset reduction without compromising model accuracy. This method has potential applications in e-commerce, content recommendation systems, and personalized advertising, where it could dramatically reduce computational resources required for training while maintaining the effectiveness of recommendation algorithms.

Authors:  Hossein Entezari Zarch, Abdulla Alshabanah, Chaoyi Jiang, Murali Annavaram

Link:  https://arxiv.org/abs/2407.08108v1

Date: 2024-07-11

Summary:

Deep learning recommendation models (DLRMs) are at the heart of the current e-commerce industry. However, the amount of training data used to train these large models is growing exponentially, leading to substantial training hurdles. The training dataset contains two primary types of information: content-based information (features of users and items) and collaborative information (interactions between users and items). One approach to reduce the training dataset is to remove user-item interactions. But that significantly diminishes collaborative information, which is crucial for maintaining accuracy due to its inclusion of interaction histories. This loss profoundly impacts DLRM performance.   This paper makes an important observation that if one can capture the user-item interaction history to enrich the user and item embeddings, then the interaction history can be compressed without losing model accuracy. Thus, this work, Collaborative Aware Data Compression (CADC), takes a two-step approach to training dataset compression. In the first step, we use matrix factorization of the user-item interaction matrix to create a novel embedding representation for both the users and items. Once the user and item embeddings are enriched by the interaction history information the approach then applies uniform random sampling of the training dataset to drastically reduce the training dataset size while minimizing model accuracy drop. The source code of CADC is available at \href{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}.

--------------------------------------------------------------------------------------------------------

A Neurosymbolic Approach to Adaptive Feature Extraction in SLAM

The neurosymbolic approach to adaptive feature extraction in SLAM (Simultaneous Localization and Mapping) addresses the limitations of existing tracking approaches in dynamically changing environments. By combining domain knowledge from traditional SLAM methods with data-driven learning, this method aims to create more adaptable and efficient SLAM pipelines. The focus on synthesizing the feature extraction module demonstrates improved pose error reduction compared to state-of-the-art feature extractors. Potential applications include enhancing autonomous robots, self-driving vehicles, and mixed-reality headsets, particularly in safety-critical scenarios where accurate and reliable tracking is essential.

Authors:  Yasra Chandio, Momin A. Khan, Khotso Selialia, Luis Garcia, Joseph DeGol, Fatima M. Anwar

Link:  https://arxiv.org/abs/2407.06889v1

Date: 2024-07-09

Summary:

Autonomous robots, autonomous vehicles, and humans wearing mixed-reality headsets require accurate and reliable tracking services for safety-critical applications in dynamically changing real-world environments. However, the existing tracking approaches, such as Simultaneous Localization and Mapping (SLAM), do not adapt well to environmental changes and boundary conditions despite extensive manual tuning. On the other hand, while deep learning-based approaches can better adapt to environmental changes, they typically demand substantial data for training and often lack flexibility in adapting to new domains. To solve this problem, we propose leveraging the neurosymbolic program synthesis approach to construct adaptable SLAM pipelines that integrate the domain knowledge from traditional SLAM approaches while leveraging data to learn complex relationships. While the approach can synthesize end-to-end SLAM pipelines, we focus on synthesizing the feature extraction module. We first devise a domain-specific language (DSL) that can encapsulate domain knowledge on the important attributes for feature extraction and the real-world performance of various feature extractors. Our neurosymbolic architecture then undertakes adaptive feature extraction, optimizing parameters via learning while employing symbolic reasoning to select the most suitable feature extractor. Our evaluations demonstrate that our approach, neurosymbolic Feature EXtraction (nFEX), yields higher-quality features. It also reduces the pose error observed for the state-of-the-art baseline feature extractors ORB and SIFT by up to 90% and up to 66%, respectively, thereby enhancing the system's efficiency and adaptability to novel environments.

--------------------------------------------------------------------------------------------------------

Collaborative Design of AI-Enhanced Learning Activities

This paper presents a formative intervention designed to enhance AI literacy among educators and EdTech specialists. The approach focuses on collaborative design of AI-enhanced learning activities, aiming to empower education professionals to effectively incorporate AI into their teaching practices. By exploring various activities that integrate AI literacy in education, participants develop skills to leverage AI while maintaining critical awareness of its implications. Potential applications include improving personalized learning experiences, enhancing student engagement, and preparing educators to adapt to the rapidly evolving landscape of AI in education.

Authors:  Margarida Romero

Link:  https://arxiv.org/abs/2407.06660v1

Date: 2024-07-09

Summary:

Artificial intelligence has accelerated innovations in different aspects of citizens' lives. Many contexts have already addressed technology-enhanced learning, but educators at different educational levels now need to develop AI literacy and the ability to integrate appropriate AI usage into their teaching. We take into account this objective, along with the creative learning design, to create a formative intervention that enables preservice teachers, in-service teachers, and EdTech specialists to effectively incorporate AI into their teaching practices. We developed the formative intervention with Terra Numerica and Maison de l'Intelligence Artificielle in two phases in order to enhance their understanding of AI and foster its creative application in learning design. Participants reflect on AI's potential in teaching and learning by exploring different activities that can integrate AI literacy in education, including its ethical considerations and potential for innovative pedagogy. The approach emphasises not only acculturating professionals to AI but also empowering them to collaboratively design AI-enhanced educational activities that promote learner engagement and personalised learning experiences. Through this process, participants in the workshops develop the skills and mindset necessary to effectively leverage AI while maintaining a critical awareness of its implications in education.

--------------------------------------------------------------------------------------------------------

T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models

T2VSafetyBench introduces a comprehensive benchmark for evaluating the safety of text-to-video generative models. As these models become more advanced, concerns about potential security risks and unethical content generation increase. This benchmark covers 12 critical aspects of video generation safety and employs a malicious prompt dataset to assess model vulnerabilities. The findings highlight the urgency of prioritizing video safety in the era of generative AI. Potential applications include improving content moderation systems, developing safer AI models for creative industries, and informing policy-making regarding AI-generated video content.

Authors:  Yibo Miao, Yifan Zhu, Yinpeng Dong, Lijia Yu, Jun Zhu, Xiao-Shan Gao

Link:  https://arxiv.org/abs/2407.05965v1

Date: 2024-07-08

Summary:

The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its security risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover fewer aspects and do not address the unique temporal risk inherent in video generation. To bridge this research gap, we introduce T2VSafetyBench, a new benchmark designed for conducting safety-critical assessments of text-to-video models. We define 12 critical aspects of video generation safety and construct a malicious prompt dataset using LLMs and jailbreaking prompt attacks. Based on our evaluation results, we draw several important findings, including: 1) no single model excels in all aspects, with different models showing various strengths; 2) the correlation between GPT-4 assessments and manual reviews is generally high; 3) there is a trade-off between the usability and safety of text-to-video generative models. This indicates that as the field of video generation rapidly advances, safety risks are set to surge, highlighting the urgency of prioritizing video safety. We hope that T2VSafetyBench can provide insights for better understanding the safety of video generation in the era of generative AI.

--------------------------------------------------------------------------------------------------------

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Video-STaR (Video Self-Training with augmented Reasoning) presents a novel approach to video instruction tuning for Large Vision Language Models (LVLMs). This method enables the utilization of any labeled video dataset for training, addressing the limitations of existing video instruction tuning datasets. By cycling between instruction generation and finetuning, Video-STaR improves general video understanding and adapts LVLMs to novel downstream tasks. Potential applications include enhancing video search and retrieval systems, improving content recommendation algorithms, and advancing automated video analysis tools for various industries such as media, security, and education.

Authors:  Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

Link:  https://arxiv.org/abs/2407.06189v1

Date: 2024-07-08

Summary:

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.

--------------------------------------------------------------------------------------------------------

Are They the Same Picture? Adapting Concept Bottleneck Models for Human-AI Collaboration in Image Retrieval

CHAIR (Concept Bottleneck Model for Human-AI Collaboration in Image Retrieval) introduces an innovative approach to image retrieval that allows for human intervention in AI models. By adapting Concept Bottleneck Models, CHAIR enables humans to correct intermediate concepts, improving generated embeddings and allowing for flexible levels of intervention. This method has potential applications in wildlife conservation, healthcare diagnostics, and other fields where image retrieval plays a crucial role. It could significantly enhance the efficiency and accuracy of image search systems while leveraging human expertise in a time-efficient manner.

Authors:  Vaibhav Balloli, Sara Beery, Elizabeth Bondi-Kelly

Link:  https://arxiv.org/abs/2407.08908v1

Date: 2024-07-12

Summary:

Image retrieval plays a pivotal role in applications from wildlife conservation to healthcare, for finding individual animals or relevant images to aid diagnosis. Although deep learning techniques for image retrieval have advanced significantly, their imperfect real-world performance often necessitates including human expertise. Human-in-the-loop approaches typically rely on humans completing the task independently and then combining their opinions with an AI model in various ways, as these models offer very little interpretability or \textit{correctability}. To allow humans to intervene in the AI model instead, thereby saving human time and effort, we adapt the Concept Bottleneck Model (CBM) and propose \texttt{CHAIR}. \texttt{CHAIR} (a) enables humans to correct intermediate concepts, which helps \textit{improve} embeddings generated, and (b) allows for flexible levels of intervention that accommodate varying levels of human expertise for better retrieval. To show the efficacy of \texttt{CHAIR}, we demonstrate that our method performs better than similar models on image retrieval metrics without any external intervention. Furthermore, we also showcase how human intervention helps further improve retrieval performance, thereby achieving human-AI complementarity.

--------------------------------------------------------------------------------------------------------

A Chatbot for Asylum-Seeking Migrants in Europe

ACME (A Chatbot for asylum-seeking Migrants in Europe) is an innovative application of computational argumentation to assist migrants in identifying the highest level of protection they can apply for. This system aims to contribute to more sustainable migration by reducing the workload on territorial commissions, courts, and humanitarian organizations supporting asylum applicants. Potential applications extend beyond migration, as the underlying technology could be adapted to other complex decision-making processes in legal, administrative, or social service contexts where individuals need guidance navigating complex systems and regulations.

Authors:  Bettina Fazzinga, Elena Palmieri, Margherita Vestoso, Luca Bolognini, Andrea Galassi, Filippo Furfaro, Paolo Torroni

Link:  https://arxiv.org/abs/2407.09197v1

Date: 2024-07-12

Summary:

We present ACME: A Chatbot for asylum-seeking Migrants in Europe. ACME relies on computational argumentation and aims to help migrants identify the highest level of protection they can apply for. This would contribute to a more sustainable migration by reducing the load on territorial commissions, Courts, and humanitarian organizations supporting asylum applicants. We describe the context, system architectures, technologies, and the case study used to run the demonstration.

--------------------------------------------------------------------------------------------------------

Learning With Generalised Card Representations for "Magic: The Gathering"

This research explores generalised card representations for the game "Magic: The Gathering," addressing the challenges of deck building in collectable card games. By developing representations that can generalize to unseen cards, the study aims to extend the real-world utility of AI-based deck building. The findings demonstrate the potential for AI to deeply understand card quality and strategy, even for new cards. Potential applications include enhancing player experience in digital card games, assisting game designers in balancing new card sets, and developing more sophisticated AI opponents for training and entertainment purposes.

Authors:  Timo Bertram, Johannes Fürnkranz, Martin Müller

Link:  https://arxiv.org/abs/2407.05879v1

Date: 2024-07-08

Summary:

A defining feature of collectable card games is the deck building process prior to actual gameplay, in which players form their decks according to some restrictions. Learning to build decks is difficult for players and models alike due to the large card variety and highly complex semantics, as well as requiring meaningful card and deck representations when aiming to utilise AI. In addition, regular releases of new card sets lead to unforeseeable fluctuations in the available card pool, thus affecting possible deck configurations and requiring continuous updates. Previous Game AI approaches to building decks have often been limited to fixed sets of possible cards, which greatly limits their utility in practice. In this work, we explore possible card representations that generalise to unseen cards, thus greatly extending the real-world utility of AI-based deck building for the game "Magic: The Gathering".We study such representations based on numerical, nominal, and text-based features of cards, card images, and meta information about card usage from third-party services. Our results show that while the particular choice of generalised input representation has little effect on learning to predict human card selections among known cards, the performance on new, unseen cards can be greatly improved. Our generalised model is able to predict 55\% of human choices on completely unseen cards, thus showing a deep understanding of card quality and strategy.

--------------------------------------------------------------------------------------------------------

Introducing VaDA: Novel Image Segmentation Model for Maritime Object Segmentation Using New Dataset

The VaDA (Vertical and Detail Attention) model and OASIs (Ocean AI Segmentation Initiatives) dataset address the challenges of maritime object segmentation in computer vision AI. This research aims to improve object recognition in challenging maritime environments, which is crucial for advancing autonomous navigation systems in the shipping industry. The proposed model and dataset could have significant applications in maritime safety, autonomous shipping, coastal surveillance, and environmental monitoring. Additionally, the introduced Integrated Figure of Calculation Performance (IFCP) evaluation method could be valuable for assessing real-time performance of AI models in maritime applications.

Authors:  Yongjin Kim, Jinbum Park, Sanha Kang, Hanguen Kim

Link:  https://arxiv.org/abs/2407.09005v1

Date: 2024-07-12

Summary:

The maritime shipping industry is undergoing rapid evolution driven by advancements in computer vision artificial intelligence (AI). Consequently, research on AI-based object recognition models for maritime transportation is steadily growing, leveraging advancements in sensor technology and computing performance. However, object recognition in maritime environments faces challenges such as light reflection, interference, intense lighting, and various weather conditions. To address these challenges, high-performance deep learning algorithms tailored to maritime imagery and high-quality datasets specialized for maritime scenes are essential. Existing AI recognition models and datasets have limited suitability for composing autonomous navigation systems. Therefore, in this paper, we propose a Vertical and Detail Attention (VaDA) model for maritime object segmentation and a new model evaluation method, the Integrated Figure of Calculation Performance (IFCP), to verify its suitability for the system in real-time. Additionally, we introduce a benchmark maritime dataset, OASIs (Ocean AI Segmentation Initiatives) to standardize model performance evaluation across diverse maritime environments. OASIs dataset and details are available at our website: https://www.navlue.com/dataset

--------------------------------------------------------------------------------------------------------

A Matter of Mindset? Features and Processes of Newsroom-based Corporate Communication in Times of Artificial Intelligence

This study explores the transformation of corporate communication through the adoption of AI-powered newsrooms. As companies seek to streamline their communication strategies, corporate newsrooms have emerged as organizational hubs for agile, topic-oriented communication. The research, based on interviews with Swiss communication experts, reveals how AI is being integrated into newsrooms for routine tasks and innovative applications. The findings highlight the delicate balance between optimizing and stabilizing communication structures, as well as the urgent need for AI regulation in corporate communication. This work provides valuable insights for companies looking to establish or manage AI-enhanced corporate newsrooms in the evolving landscape of business communication.

Authors:  Tobias Rohrbach, Mykola Makhortykh

Link:  https://arxiv.org/abs/2407.06604v1

Date: 2024-07-09

Summary:

Many companies adopt the corporate newsroom model to streamline their corporate communication. This article addresses why and how corporate newsrooms transform corporate communication following the rise of artificial intelligence (AI) systems. It draws on original data from 13 semi-structured interviews with executive communication experts in large Swiss companies which use corporate newsrooms. Interviews show that corporate newsrooms serve as an organisational (rather than spatial) coordination body for topic-oriented and agile corporate communication. To enable their functionality, it is crucial to find the right balance between optimising and stabilising communication structures. Newsrooms actively adopt AI both to facilitate routine tasks and enable more innovative applications, such as living data archives and channel translations. Interviews also highlight an urgent need for AI regulation for corporate communication. The article's findings provide important insights into the practical challenges and coping strategies for establishing and managing corporate newsrooms and how newsrooms can be transformed by AI.

--------------------------------------------------------------------------------------------------------

Application of Artificial Intelligence in Supporting Healthcare Professionals and Caregivers in Treatment of Autistic Children

This paper addresses the challenges in diagnosing and treating Autism Spectrum Disorder (ASD) by exploring the potential of Artificial Intelligence. The researchers developed a sophisticated algorithm that analyzes facial and bodily expressions of children to detect ASD. Utilizing deep learning models, specifically the Xception and ResNet50V2 architectures, the study achieved high accuracy in ASD diagnosis. This research demonstrates the transformative potential of AI in improving ASD management, offering new tools for healthcare professionals and caregivers. The findings could lead to earlier, more accurate diagnoses and personalized treatment plans, potentially revolutionizing the approach to ASD care and improving outcomes for affected individuals and their families.

Authors:  Hossein Mohammadi Rouzbahani, Hadis Karimipour

Link:  https://arxiv.org/abs/2407.08902v1

Date: 2024-07-12

Summary:

Autism Spectrum Disorder (ASD) represents a multifaceted neurodevelopmental condition marked by difficulties in social interaction, communication impediments, and repetitive behaviors. Despite progress in understanding ASD, its diagnosis and treatment continue to pose significant challenges due to the variability in symptomatology and the necessity for multidisciplinary care approaches. This paper investigates the potential of Artificial Intelligence (AI) to augment the capabilities of healthcare professionals and caregivers in managing ASD. We have developed a sophisticated algorithm designed to analyze facial and bodily expressions during daily activities of both autistic and non-autistic children, leading to the development of a powerful deep learning-based autism detection system. Our study demonstrated that AI models, specifically the Xception and ResNet50V2 architectures, achieved high accuracy in diagnosing Autism Spectrum Disorder (ASD). This research highlights the transformative potential of AI in improving the diagnosis, treatment, and comprehensive management of ASD. Our study revealed that AI models, notably the Xception and ResNet50V2 architectures, demonstrated high accuracy in diagnosing ASD.

--------------------------------------------------------------------------------------------------------

Agglomerative Clustering in Uniform and Proportional Feature Spaces

This study explores advanced pattern comparison techniques in the context of scientific modeling, artificial intelligence, and pattern recognition. The researchers focus on a hierarchical agglomerative clustering approach based on the coincidence similarity index, which offers advantages such as strict comparisons between similar patterns, inherent normalization, and robustness to noise and outliers. The paper compares this method with two other hierarchical clustering approaches, including the traditional Ward's method. By examining performance in both uniform and proportional feature spaces, the study provides valuable insights into the effectiveness of different clustering methodologies. These findings have potential applications in data analysis, machine learning, and pattern recognition across various scientific disciplines.

Authors:  Alexandre Benatti, Luciano da F. Costa

Link:  https://arxiv.org/abs/2407.08604v1

Date: 2024-07-11

Summary:

Pattern comparison represents a fundamental and crucial aspect of scientific modeling, artificial intelligence, and pattern recognition. Three main approaches have typically been applied for pattern comparison: (i) distances; (ii) statistical joint variation; (iii) projections; and (iv) similarity indices, each with their specific characteristics. In addition to arguing for intrinsic interesting properties of multiset-based similarity approaches, the present work describes a respectively based hierarchical agglomerative clustering approach which inherits the several interesting characteristics of the coincidence similarity index -- including strict comparisons allowing distinguishing between closely similar patterns, inherent normalization, as well as substantial robustness to the presence of noise and outliers in datasets. Two other hierarchical clustering approaches are considered, namely a multiset-based method as well as the traditional Ward's approach. After characterizing uniform and proportional features spaces and presenting the main basic concepts and methods, a comparison of relative performance between the three considered hierarchical methods is reported and discussed, with several interesting and important results. In particular, though intrinsically suitable for implementing proportional comparisons, the coincidence similarity methodology also works effectively in several types of data in uniform feature spaces

--------------------------------------------------------------------------------------------------------

Paving the way toward foundation models for irregular and unaligned Satellite Image Time Series

This paper introduces ALISE (ALIgned Sits Encoder), a novel approach to address the challenges in creating foundation models for satellite image time series (SITS). Unlike previous models that struggle with irregular and unaligned temporal data, ALISE incorporates spatial, spectral, and temporal dimensions of SITS while producing aligned latent representations. The model uses a flexible query mechanism and a multi-view framework, integrating instance discrimination with masked autoencoding. ALISE's effectiveness is demonstrated through downstream tasks including crop segmentation, land cover segmentation, and unsupervised crop change detection. This research paves the way for more accurate and versatile remote sensing applications, potentially improving agricultural monitoring, land use analysis, and environmental change detection.

Authors:  Iris Dumeur, Silvia Valero, Jordi Inglada

Link:  https://arxiv.org/abs/2407.08448v1

Date: 2024-07-11

Summary:

Although recently several foundation models for satellite remote sensing imagery have been proposed, they fail to address major challenges of real/operational applications. Indeed, embeddings that don't take into account the spectral, spatial and temporal dimensions of the data as well as the irregular or unaligned temporal sampling are of little use for most real world uses.As a consequence, we propose an ALIgned Sits Encoder (ALISE), a novel approach that leverages the spatial, spectral, and temporal dimensions of irregular and unaligned SITS while producing aligned latent representations. Unlike SSL models currently available for SITS, ALISE incorporates a flexible query mechanism to project the SITS into a common and learned temporal projection space. Additionally, thanks to a multi-view framework, we explore integration of instance discrimination along a masked autoencoding task to SITS. The quality of the produced representation is assessed through three downstream tasks: crop segmentation (PASTIS), land cover segmentation (MultiSenGE), and a novel crop change detection dataset. Furthermore, the change detection task is performed without supervision. The results suggest that the use of aligned representations is more effective than previous SSL methods for linear probing segmentation tasks.

--------------------------------------------------------------------------------------------------------

An efficient method to automate tooth identification and 3D bounding box extraction from Cone Beam CT Images

This paper presents an innovative method for automating tooth identification and 3D bounding box extraction from Cone Beam Computed Tomography (CBCT) images. The proposed approach addresses the challenges of accurately analyzing dental pathologies, particularly in the presence of artifacts from fillings and restorations. By dividing 3D images into axial slices and using a single-stage object detector, the method pinpoints and labels individual teeth, creating three-dimensional representations. This automation has been successfully integrated into the dental analysis tool Dentomo, potentially revolutionizing dental diagnostics and treatment planning. The technology could enhance the accuracy and efficiency of dental procedures, improve patient care, and facilitate more precise orthodontic and surgical interventions.

Authors:  Ignacio Garrido Botella, Ignacio Arranz Águeda, Juan Carlos Armenteros Carmona, Oleg Vorontsov, Fernando Bayón Robledo, Evgeny Solovykh, Obrubov Aleksandr Andreevich, Adrián Alonso Barriuso

Link:  https://arxiv.org/abs/2407.05892v2

Date: 2024-07-10

Summary:

Accurate identification, localization, and segregation of teeth from Cone Beam Computed Tomography (CBCT) images are essential for analyzing dental pathologies. Modeling an individual tooth can be challenging and intricate to accomplish, especially when fillings and other restorations introduce artifacts. This paper proposes a method for automatically detecting, identifying, and extracting teeth from CBCT images. Our approach involves dividing the three-dimensional images into axial slices for image detection. Teeth are pinpointed and labeled using a single-stage object detector. Subsequently, bounding boxes are delineated and identified to create three-dimensional representations of each tooth. The proposed solution has been successfully integrated into the dental analysis tool Dentomo.

--------------------------------------------------------------------------------------------------------

GeNet: A Multimodal LLM-Based Co-Pilot for Network Topology and Configuration

GeNet introduces a novel multimodal co-pilot for enterprise network engineers, addressing the complex and error-prone nature of communication network engineering. Leveraging a large language model (LLM), GeNet streamlines network design workflows by interpreting and updating network topologies and device configurations based on user intents. The framework's ability to accurately interpret network topology images has been demonstrated through evaluations on enterprise network scenarios. GeNet's potential to reduce network engineers' efforts and accelerate design processes could significantly improve efficiency in enterprise network management. Additionally, the research highlights the importance of precise topology understanding when handling network modifications, paving the way for more intelligent and automated network design tools.

Authors:  Beni Ifland, Elad Duani, Rubin Krief, Miro Ohana, Aviram Zilberman, Andres Murillo, Ofir Manor, Ortal Lavi, Hikichi Kenji, Asaf Shabtai, Yuval Elovici, Rami Puzis

Link:  https://arxiv.org/abs/2407.08249v1

Date: 2024-07-11

Summary:

Communication network engineering in enterprise environments is traditionally a complex, time-consuming, and error-prone manual process. Most research on network engineering automation has concentrated on configuration synthesis, often overlooking changes in the physical network topology. This paper introduces GeNet, a multimodal co-pilot for enterprise network engineers. GeNet is a novel framework that leverages a large language model (LLM) to streamline network design workflows. It uses visual and textual modalities to interpret and update network topologies and device configurations based on user intents. GeNet was evaluated on enterprise network scenarios adapted from Cisco certification exercises. Our results demonstrate GeNet's ability to interpret network topology images accurately, potentially reducing network engineers' efforts and accelerating network design processes in enterprise environments. Furthermore, we show the importance of precise topology understanding when handling intents that require modifications to the network's topology.

--------------------------------------------------------------------------------------------------------

Variational Best-of-N Alignment

This paper introduces Variational Best-of-N (vBoN) alignment, a novel approach to improve the efficiency of aligning language models with human preferences. While the Best-of-N (BoN) algorithm is effective, it's computationally expensive, reducing sampling throughput. vBoN aims to mimic BoN's performance through fine-tuning, minimizing backward KL divergence to the BoN distribution. This method, analogous to mean-field variational inference, potentially reduces inference cost significantly. Although not as effective as BoN in alignment, vBoN shows promise by appearing more frequently on the Pareto frontier of reward and KL divergence compared to models trained with KL-constrained RL objectives. This research could lead to more efficient language model training and deployment in various natural language processing applications.

Authors:  Afra Amini, Tim Vieira, Ryan Cotterell

Link:  https://arxiv.org/abs/2407.06057v1

Date: 2024-07-08

Summary:

Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, BoN is computationally expensive; it reduces sampling throughput by a factor of N. To make BoN more efficient at inference time, one strategy is to fine-tune the language model to mimic what BoN does during inference. To achieve this, we derive the distribution induced by the BoN algorithm. We then propose to fine-tune the language model to minimize backward KL divergence to the BoN distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on a controlled generation task suggest that while variational BoN is not as effective as BoN in aligning language models, it is close to BoN performance as vBoN appears more often on the Pareto frontier of reward and KL divergence compared to models trained with KL-constrained RL objective.

--------------------------------------------------------------------------------------------------------

Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search

Swiss DINO addresses the growing trend of incorporating vision systems in robotic home appliances for personalized object search. This framework tackles the challenges of distinguishing between fine-grained classes in cluttered environments and meeting strict on-device resource requirements. Based on the DINOv2 transformer model, Swiss DINO offers a simple yet effective solution for one-shot personal object search without requiring adaptation training. The research demonstrates significant improvements in segmentation and recognition accuracy compared to lightweight solutions, while also reducing inference time and GPU consumption compared to heavier transformer-based models. This technology has potential applications in smart home systems, personal robotics, and adaptive vision systems for various mobile devices.

Authors:  Kirill Paramonov, Jia-Xing Zhong, Umberto Michieli, Jijoong Moon, Mete Ozay

Link:  https://arxiv.org/abs/2407.07541v1

Date: 2024-07-10

Summary:

In this paper, we address a recent trend in robotic home appliances to include vision systems on personal devices, capable of personalizing the appliances on the fly. In particular, we formulate and address an important technical task of personal object search, which involves localization and identification of personal items of interest on images captured by robotic appliances, with each item referenced only by a few annotated images. The task is crucial for robotic home appliances and mobile systems, which need to process personal visual scenes or to operate with particular personal objects (e.g., for grasping or navigation). In practice, personal object search presents two main technical challenges. First, a robot vision system needs to be able to distinguish between many fine-grained classes, in the presence of occlusions and clutter. Second, the strict resource requirements for the on-device system restrict the usage of most state-of-the-art methods for few-shot learning and often prevent on-device adaptation. In this work, we propose Swiss DINO: a simple yet effective framework for one-shot personal object search based on the recent DINOv2 transformer model, which was shown to have strong zero-shot generalization properties. Swiss DINO handles challenging on-device personalized scene understanding requirements and does not require any adaptation training. We show significant improvement (up to 55%) in segmentation and recognition accuracy compared to the common lightweight solutions, and significant footprint reduction of backbone inference time (up to 100x) and GPU consumption (up to 10x) compared to the heavy transformer-based solutions.

--------------------------------------------------------------------------------------------------------

On LLM Wizards: Identifying Large Language Models' Behaviors for Wizard of Oz Experiments

This paper explores the use of Large Language Models (LLMs) as "Wizards" in Wizard of Oz (WoZ) experiments, a research method traditionally involving human role-playing to simulate not-yet-available technologies. The study addresses the lack of methodological guidance for responsibly applying LLMs in WoZ experiments and evaluating their role-playing abilities. Through two LLM-powered WoZ studies, the researchers develop an experiment lifecycle for safely integrating LLMs into such experiments and interpreting the resulting data. They also introduce a heuristic-based evaluation framework for assessing LLMs' role-playing capabilities. This research could significantly impact human-computer interaction studies, allowing for more scalable and cost-effective user behavior analysis and technology design exploration.

Authors:  Jingchao Fang, Nikos Arechiga, Keiichi Namaoshi, Nayeli Bravo, Candice Hogan, David A. Shamma

Link:  https://arxiv.org/abs/2407.08067v1

Date: 2024-07-10

Summary:

The Wizard of Oz (WoZ) method is a widely adopted research approach where a human Wizard ``role-plays'' a not readily available technology and interacts with participants to elicit user behaviors and probe the design space. With the growing ability for modern large language models (LLMs) to role-play, one can apply LLMs as Wizards in WoZ experiments with better scalability and lower cost than the traditional approach. However, methodological guidance on responsibly applying LLMs in WoZ experiments and a systematic evaluation of LLMs' role-playing ability are lacking. Through two LLM-powered WoZ studies, we take the first step towards identifying an experiment lifecycle for researchers to safely integrate LLMs into WoZ experiments and interpret data generated from settings that involve Wizards role-played by LLMs. We also contribute a heuristic-based evaluation framework that allows the estimation of LLMs' role-playing ability in WoZ experiments and reveals LLMs' behavior patterns at scale.

--------------------------------------------------------------------------------------------------------

MEEG and AT-DGNN: Advancing EEG Emotion Recognition with Music and Graph Learning

This paper introduces the MEEG dataset and AT-DGNN framework, advancing EEG-based emotion recognition through music and graph learning. The MEEG dataset captures a wide range of emotional responses to music, enabling in-depth analysis of brainwave patterns in musical contexts. The AT-DGNN framework combines an attention-based temporal learner with a dynamic graph neural network to model EEG data dynamics across varying brain network topologies. Achieving superior performance in arousal and valence prediction, this approach outperforms state-of-the-art methods. The research not only enhances our understanding of brain emotional processing but also improves emotion recognition technologies for brain-computer interfaces. Potential applications include personalized music therapy, emotion-aware AI systems, and advanced neurofeedback techniques.

Authors:  Minghao Xiao, Zhengxi Zhu, Wenyu Wang, Meixia Qu

Link:  https://arxiv.org/abs/2407.05550v1

Date: 2024-07-08

Summary:

Recent advances in neuroscience have elucidated the crucial role of coordinated brain region activities during cognitive tasks. To explore the complexity, we introduce the MEEG dataset, a comprehensive multi-modal music-induced electroencephalogram (EEG) dataset and the Attention-based Temporal Learner with Dynamic Graph Neural Network (AT-DGNN), a novel framework for EEG-based emotion recognition. The MEEG dataset captures a wide range of emotional responses to music, enabling an in-depth analysis of brainwave patterns in musical contexts. The AT-DGNN combines an attention-based temporal learner with a dynamic graph neural network (DGNN) to accurately model the local and global graph dynamics of EEG data across varying brain network topology. Our evaluations show that AT-DGNN achieves superior performance, with an accuracy (ACC) of 83.06\% in arousal and 85.31\% in valence, outperforming state-of-the-art (SOTA) methods on the MEEG dataset. Comparative analyses with traditional datasets like DEAP highlight the effectiveness of our approach and underscore the potential of music as a powerful medium for emotion induction. This study not only advances our understanding of the brain emotional processing, but also enhances the accuracy of emotion recognition technologies in brain-computer interfaces (BCI), leveraging both graph-based learning and the emotional impact of music. The source code and dataset are available at \textit{https://github.com/xmh1011/AT-DGNN}.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.