Eye On AI

View Original

Week Ending 11.3.2024

RESEARCH WATCH: 11.3.2024

In-Context Fine-Tuning for Time-Series Foundation Models

The paper addresses the growing need for adaptable time-series forecasting models by introducing a novel in-context fine-tuning approach. While traditional forecasting models often struggle with domain adaptation, this research presents a foundation model that can leverage multiple related time-series examples during inference to improve predictions. The approach demonstrates superior performance compared to conventional methods and matches the effectiveness of explicitly fine-tuned models. This innovation could benefit various fields requiring accurate time-series forecasting, such as financial markets, weather prediction, and demand forecasting in supply chains.

Authors:  Abhimanyu Das, Matthew Faw, Rajat Sen, Yichen Zhou

Link:  https://arxiv.org/abs/2410.24087v1

Date: 2024-10-31

Summary:

Motivated by the recent success of time-series foundation models for zero-shot forecasting, we present a methodology for in-context fine-tuning of a time-series foundation model. In particular, we design a pretrained foundation model that can be prompted (at inference time) with multiple time-series examples, in order to forecast a target time-series into the future. Our foundation model is specifically trained to utilize examples from multiple related time-series in its context window (in addition to the history of the target time-series) to help it adapt to the specific distribution of the target domain at inference time. We show that such a foundation model that uses in-context examples at inference time can obtain much better performance on popular forecasting benchmarks compared to supervised deep learning methods, statistical models, as well as other time-series foundation models. Interestingly, our in-context fine-tuning approach even rivals the performance of a foundation model that is explicitly fine-tuned on the target domain.

--------------------------------------------------------------------------------------------------------

Leveraging LLMs for MT in Crisis Scenarios: a blueprint for low-resource languages

This research tackles the critical challenge of rapid, accurate communication during crisis situations for low-resource languages. The study presents a framework combining Large Language Models with community-driven corpus development to enhance machine translation capabilities. By focusing on crisis-specific scenarios and using the COVID-19 pandemic as a case study, the research provides a practical blueprint for developing emergency response translation systems. This approach could significantly improve humanitarian aid efforts, emergency response coordination, and crisis communication in regions where traditional translation resources are limited.

Authors:  Séamus Lankford, Andy Way

Link:  https://arxiv.org/abs/2410.23890v1

Date: 2024-10-31

Summary:

In an evolving landscape of crisis communication, the need for robust and adaptable Machine Translation (MT) systems is more pressing than ever, particularly for low-resource languages. This study presents a comprehensive exploration of leveraging Large Language Models (LLMs) and Multilingual LLMs (MLLMs) to enhance MT capabilities in such scenarios. By focusing on the unique challenges posed by crisis situations where speed, accuracy, and the ability to handle a wide range of languages are paramount, this research outlines a novel approach that combines the cutting-edge capabilities of LLMs with fine-tuning techniques and community-driven corpus development strategies. At the core of this study is the development and empirical evaluation of MT systems tailored for two low-resource language pairs, illustrating the process from initial model selection and fine-tuning through to deployment. Bespoke systems are developed and modelled on the recent Covid-19 pandemic. The research highlights the importance of community involvement in creating highly specialised, crisis-specific datasets and compares custom GPTs with NLLB-adapted MLLM models. It identifies fine-tuned MLLM models as offering superior performance compared with their LLM counterparts. A scalable and replicable model for rapid MT system development in crisis scenarios is outlined. Our approach enhances the field of humanitarian technology by offering a blueprint for developing multilingual communication systems during emergencies.

--------------------------------------------------------------------------------------------------------

TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes

This study introduces a novel approach to modeling human visual attention patterns by combining deep learning with point process theory. While previous models focused primarily on predicting where people look, TPP-Gaze innovatively models both spatial and temporal aspects of gaze behavior. This advancement could have significant applications in human-computer interaction, virtual reality design, user experience testing, and cognitive science research. The model's superior performance across multiple datasets suggests its potential for improving how we understand and predict human visual attention patterns.

Authors:  Alessandro D'Amelio, Giuseppe Cartella, Vittorio Cuculo, Manuele Lucchi, Marcella Cornia, Rita Cucchiara, Giuseppe Boccignone

Link:  https://arxiv.org/abs/2410.23409v1

Date: 2024-10-30

Summary:

Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the deserved amount of time given current processing demands, before shifting to the next one. As such, gaze deployment crucially is a temporal process. Existing computational models have made significant strides in predicting spatial aspects of observer's visual scanpaths (where to look), while often putting on the background the temporal facet of attention dynamics (when). In this paper we present TPP-Gaze, a novel and principled approach to model scanpath dynamics based on Neural Temporal Point Process (TPP), that jointly learns the temporal dynamics of fixations position and duration, integrating deep learning methodologies with point process theory. We conduct extensive experiments across five publicly available datasets. Our results show the overall superior performance of the proposed model compared to state-of-the-art approaches. Source code and trained models are publicly available at: https://github.com/phuselab/tppgaze.

--------------------------------------------------------------------------------------------------------

EMMA: End-to-End Multimodal Model for Autonomous Driving

EMMA represents a significant advancement in autonomous driving technology by integrating multiple tasks into a unified language-based framework. Built on a multimodal large language model foundation, it processes camera data to generate driving outputs, including trajectories and object detection. While showing promising results in motion planning and object detection, it currently has limitations in processing multiple frames and computational requirements. This research could influence future autonomous vehicle development by demonstrating the potential of language models in complex driving tasks.

Authors:  Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, James Guo, Dragomir Anguelov, Mingxing Tan

Link:  https://arxiv.org/abs/2410.23262v1

Date: 2024-10-30

Summary:

We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures.

--------------------------------------------------------------------------------------------------------

Partial Channel Dependence with Channel Masks for Time Series Foundation Models

This paper addresses a fundamental challenge in time series foundation models by introducing partial channel dependence (PCD) to handle varying dependencies between channels. The research presents a novel channel mask approach that captures both relative and absolute dependencies between channels within datasets. This innovation could improve performance across various time series applications, including forecasting, classification, imputation, and anomaly detection. The method's effectiveness in both few-shot and zero-shot scenarios makes it particularly valuable for real-world applications with limited data availability.

Authors:  Seunghan Lee, Taeyoung Park, Kibok Lee

Link:  https://arxiv.org/abs/2410.23222v1

Date: 2024-10-30

Summary:

Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily focused on designing model architectures to address explicit heterogeneity among datasets such as various numbers of channels, while often overlooking implicit heterogeneity such as varying dependencies between channels. In this work, we introduce the concept of partial channel dependence (PCD), which enables a more sophisticated adjustment of channel dependencies based on dataset-specific information. To achieve PCD, we propose a channel mask that captures the relationships between channels within a dataset using two key components: 1) a correlation matrix that encodes relative dependencies between channels, and 2) domain parameters that learn the absolute dependencies specific to each dataset, refining the correlation matrix. We validate the effectiveness of PCD across four tasks in TS including forecasting, classification, imputation, and anomaly detection, under diverse settings, including few-shot and zero-shot scenarios with both TS foundation models and single-task models. Code is available at https://github.com/seunghan96/CM.

--------------------------------------------------------------------------------------------------------

From Hype to Reality: The Road Ahead of Deploying DRL in 6G Networks

This research explores the practical implementation of Deep Reinforcement Learning (DRL) in 6G network management. As networks evolve to meet demands for massive connectivity and ultra-low latency, traditional management approaches become insufficient. The study demonstrates DRL's potential through applications in wireless access control, baseband function placement, and network slicing coordination. While highlighting DRL's promise, it also addresses practical deployment challenges and solutions. This work could guide the development of more efficient and adaptive network management systems for next-generation communications.

Authors:  Haiyuan Li, Hari Madhukumar, Peizheng Li, Yiran Teng, Shuangyi Yan, Dimitra Simeonidou

Link:  https://arxiv.org/abs/2410.23086v1

Date: 2024-10-30

Summary:

The industrial landscape is rapidly evolving with the advent of 6G applications, which demand massive connectivity, high computational capacity, and ultra-low latency. These requirements present new challenges, which can no longer be efficiently addressed by conventional strategies. In response, this article underscores the transformative potential of Deep Reinforcement Learning (DRL) for 6G, highlighting its advantages over classic machine learning solutions in meeting the demands of 6G. The necessity of DRL is further validated through three DRL applications in an end-to-end communication procedure, including wireless access control, baseband function placement, and network slicing coordination. However, DRL-based network management initiatives are far from mature. We extend the discussion to identify the challenges of applying DRL in practical networks and explore potential solutions along with their respective limitations. In the end, these insights are validated through a practical DRL deployment in managing network slices on the testbed.

--------------------------------------------------------------------------------------------------------

VPO: Leveraging the Number of Votes in Preference Optimization

This paper advances the field of language model training by introducing a novel approach that better incorporates human preference data. By leveraging user voting statistics through Bayesian estimation, VPO improves upon existing preference optimization methods like DPO and IPO. The framework distinguishes between controversial and obvious generation pairs, leading to better alignment with diverse subjective preferences. This advancement could enhance the development of language models that better reflect human preferences and values, improving applications in content generation, dialogue systems, and human-AI interaction.

Authors:  Jae Hyeon Cho, Minkyung Park, Byung-Jun Lee

Link:  https://arxiv.org/abs/2410.22891v1

Date: 2024-10-30

Summary:

Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets are typically created by selecting preferred sentences through a voting process involving multiple individuals, as opinions can vary due to the subjective nature of human preferences. While the number of votes offers insight into whether a sentence pair is clearly preferable or controversial, current methods do not fully leverage this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferable to another. Using this estimated probability as a target, we develop the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs. We show that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms.

--------------------------------------------------------------------------------------------------------

Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising

This research addresses the challenge of real-time speech enhancement for automatic speech recognition in dynamic environments. The study proposes an adaptive approach combining neural beamforming with run-time optimization for simultaneous dereverberation and denoising. This innovation could significantly improve speech recognition systems' performance in challenging acoustic environments, benefiting applications like virtual assistants, teleconferencing systems, and hearing aids. The method's ability to adapt to varying conditions makes it particularly valuable for real-world applications.

Authors:  Yoto Fujita, Aditya Arie Nugraha, Diego Di Carlo, Yoshiaki Bando, Mathieu Fontaine, Kazuyoshi Yoshii

Link:  https://arxiv.org/abs/2410.22805v1

Date: 2024-10-30

Summary:

This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).

--------------------------------------------------------------------------------------------------------

Offline Behavior Distillation

This paper introduces a novel approach to efficient reinforcement learning by distilling expert behavior from sub-optimal data. The research presents theoretical frameworks for behavior distillation and proposes an improved objective with linear discount complexity. This innovation could significantly impact the field of robotics and autonomous systems by enabling more efficient learning from existing datasets. The method's ability to work with low-reward offline data makes it particularly valuable for real-world applications where collecting optimal training data is challenging.

Authors:  Shiye Lei, Sen Zhang, Dacheng Tao

Link:  https://arxiv.org/abs/2410.22728v1

Date: 2024-10-30

Summary:

Massive reinforcement learning (RL) data are typically collected to train policies offline without the need for interactions, but the large data volume can cause training inefficiencies. To tackle this issue, we formulate offline behavior distillation (OBD), which synthesizes limited expert behavioral data from sub-optimal RL data, enabling rapid policy learning. We propose two naive OBD objectives, DBC and PBC, which measure distillation performance via the decision difference between policies trained on distilled data and either offline data or a near-expert policy. Due to intractable bi-level optimization, the OBD objective is difficult to minimize to small values, which deteriorates PBC by its distillation performance guarantee with quadratic discount complexity O(1/(1−γ)2). We theoretically establish the equivalence between the policy performance and action-value weighted decision difference, and introduce action-value weighted PBC (Av-PBC) as a more effective OBD objective. By optimizing the weighted decision difference, Av-PBC achieves a superior distillation guarantee with linear discount complexity O(1/(1−γ)). Extensive experiments on multiple D4RL datasets reveal that Av-PBC offers significant improvements in OBD performance, fast distillation convergence speed, and robust cross-architecture/optimizer generalization.

--------------------------------------------------------------------------------------------------------

ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising

ContextIQ presents a sophisticated approach to contextual advertising through multimodal video analysis. The system employs multiple experts focusing on video, audio, transcript, and metadata to create rich video representations. This innovation addresses the growing need for privacy-compliant, context-aware advertising solutions in the expanding video content landscape. The system's ability to understand complex video content at a granular level could revolutionize digital advertising by enabling more relevant and engaging ad placements while maintaining brand safety.

Authors:  Ashutosh Chaubey, Anoubhav Agarwaal, Sartaki Sinha Roy, Aayush Agarwal, Susmita Ghose

Link:  https://arxiv.org/abs/2410.22233v1

Date: 2024-10-29

Summary:

Contextual advertising serves ads that are aligned to the content that the user is viewing. The rapid growth of video content on social platforms and streaming services, along with privacy concerns, has increased the need for contextual advertising. Placing the right ad in the right context creates a seamless and pleasant ad viewing experience, resulting in higher audience engagement and, ultimately, better ad monetization. From a technology standpoint, effective contextual advertising requires a video retrieval system capable of understanding complex video content at a very granular level. Current text-to-video retrieval models based on joint multimodal training demand large datasets and computational resources, limiting their practicality and lacking the key functionalities required for ad ecosystem integration. We introduce ContextIQ, a multimodal expert-based video retrieval system designed specifically for contextual advertising. ContextIQ utilizes modality-specific experts-video, audio, transcript (captions), and metadata such as objects, actions, emotion, etc.-to create semantically rich video representations. We show that our system, without joint training, achieves better or comparable results to state-of-the-art models and commercial solutions on multiple text-to-video retrieval benchmarks. Our ablation studies highlight the benefits of leveraging multiple modalities for enhanced video retrieval accuracy instead of using a vision-language model alone. Furthermore, we show how video retrieval systems such as ContextIQ can be used for contextual advertising in an ad ecosystem while also addressing concerns related to brand safety and filtering inappropriate content.

--------------------------------------------------------------------------------------------------------

Data streaming platform for crowd-sourced vehicle dataset generation

This research presents an edge-cloud platform designed to facilitate the collection and processing of vehicle sensor data. The system addresses key challenges in Data Spaces, including data sovereignty, governance, and privacy, while enabling real-time data streaming from vehicles to development servers. This innovation could accelerate the development of advanced driver assistance systems and autonomous driving technologies by providing a robust infrastructure for collecting and analyzing real-world driving data. The platform's performance analysis with various connectivity technologies provides valuable insights for implementation.

Authors:  Felipe Mogollon, Zaloa Fernandez, Angel Martin, Juan Diego Ortega, Gorka Velez

Link:  https://arxiv.org/abs/2410.21934v1

Date: 2024-10-29

Summary:

Vehicles are sophisticated machines equipped with sensors that provide real-time data for onboard driving assistance systems. Due to the wide variety of traffic, road, and weather conditions, continuous system enhancements are essential. Connectivity allows vehicles to transmit previously unknown data, expanding datasets and accelerating the development of new data models. This enables faster identification and integration of novel data, improving system reliability and reducing time to market. Data Spaces aim to create a data-driven, interconnected, and innovative data economy, where edge and cloud infrastructures support a virtualised IoT platform that connects data sources and development servers. This paper proposes an edge-cloud data platform to connect car data producers with multiple and heterogeneous services, addressing key challenges in Data Spaces, such as data sovereignty, governance, interoperability, and privacy. The paper also evaluates the data platform's performance limits for text, image, and video data workloads, examines the impact of connectivity technologies, and assesses latencies. The results show that latencies drop to 33ms with 5G connectivity when pipelining data to consuming applications hosted at the edge, compared to around 77ms when crossing both edge and cloud processing infrastructures. The results offer guidance on the necessary processing assets to avoid bottlenecks in car data platforms.

--------------------------------------------------------------------------------------------------------

Knowledge-Guided Prompt Learning for Request Quality Assurance in Public Code Review

This paper introduces KP-PCR, a novel approach to improving code review request quality in public software development communities. By combining knowledge-guided prompt learning with code analysis, the system helps predict request necessity and recommend appropriate tags. This innovation could enhance the efficiency of public code review processes by improving request visibility and matching reviewers with appropriate expertise. The system's strong performance in both prediction and recommendation tasks suggests its potential to streamline software development workflows.

Authors:  Lin Li, Xinchun Yu, Xinyu Chen, Peng Liang

Link:  https://arxiv.org/abs/2410.21673v1

Date: 2024-10-29

Summary:

Public Code Review (PCR) is an assistant to the internal code review of the development team, in the form of a public Software Question Answering (SQA) community, to help developers access high-quality and efficient review services. Current methods on PCR mainly focus on the reviewer's perspective, including finding a capable reviewer, predicting comment quality, and recommending/generating review comments. However, it is not well studied that how to satisfy the review necessity requests posted by developers which can increase their visibility, which in turn acts as a prerequisite for better review responses. To this end, we propose a Knowledge-guided Prompt learning for Public Code Review (KP-PCR) to achieve developer-based code review request quality assurance (i.e., predicting request necessity and recommending tags subtask). Specifically, we reformulate the two subtasks via 1) text prompt tuning which converts both of them into a Masked Language Model (MLM) by constructing prompt templates using hard prompt; 2) knowledge and code prefix tuning which introduces external knowledge by soft prompt, and uses data flow diagrams to characterize code snippets. Finally, both of the request necessity prediction and tag recommendation subtasks output predicted results through an answer engineering module. In addition, we further analysis the time complexity of our KP-PCR that has lightweight prefix based the operation of introducing knowledge. Experimental results on the PCR dataset for the period 2011-2023 demonstrate that our KP-PCR outperforms baselines by 8.3%-28.8% in the request necessity prediction and by 0.1%-29.5% in the tag recommendation. The code implementation is released at https://github.com/WUT-IDEA/KP-PCR.

--------------------------------------------------------------------------------------------------------

Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers

This research addresses the challenge of online finetuning in Decision Transformers, particularly when pretrained with low-reward offline data. By incorporating TD3 gradients into the finetuning process, the study demonstrates improved performance in online adaptation. This advancement could significantly impact reinforcement learning applications, particularly in robotics and autonomous systems where adapting to new situations is crucial. The theoretical analysis provides valuable insights for future improvements in decision transformer architectures.

Authors:  Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

Link:  https://arxiv.org/abs/2410.24108v1

Date: 2024-10-31

Summary:

Decision Transformers have recently emerged as a new and compelling paradigm for offline Reinforcement Learning (RL), completing a trajectory in an autoregressive way. While improvements have been made to overcome initial shortcomings, online finetuning of decision transformers has been surprisingly under-explored. The widely adopted state-of-the-art Online Decision Transformer (ODT) still struggles when pretrained with low-reward offline data. In this paper, we theoretically analyze the online-finetuning of the decision transformer, showing that the commonly used Return-To-Go (RTG) that's far from the expected return hampers the online fine-tuning process. This problem, however, is well-addressed by the value function and advantage of standard RL algorithms. As suggested by our analysis, in our experiments, we hence find that simply adding TD3 gradients to the finetuning process of ODT effectively improves the online finetuning performance of ODT, especially if ODT is pretrained with low-reward offline data. These findings provide new directions to further improve decision transformers.

--------------------------------------------------------------------------------------------------------

Estimating Causal Effects of Text Interventions Leveraging LLMs

CausalDANN introduces a novel approach to estimating the causal effects of textual interventions using large language models. This research addresses the challenge of understanding how text modifications impact social systems, particularly when direct interventions are impractical. The method's ability to handle arbitrary text interventions and adapt to domain shifts makes it valuable for social media analysis, content moderation, and policy development. This innovation could help researchers and practitioners better understand the impact of textual modifications on user behavior.

Authors:  Siyi Guo, Myrl G. Marmarelis, Fred Morstatter, Kristina Lerman

Link:  https://arxiv.org/abs/2410.21474v1

Date: 2024-10-28

Summary:

Quantifying the effect of textual interventions in social systems, such as reducing anger in social media posts to see its impact on engagement, poses significant challenges. Direct interventions on real-world systems are often infeasible, necessitating reliance on observational data. Traditional causal inference methods, typically designed for binary or discrete treatments, are inadequate for handling the complex, high-dimensional nature of textual data. This paper addresses these challenges by proposing a novel approach, CausalDANN, to estimate causal effects using text transformations facilitated by large language models (LLMs). Unlike existing methods, our approach accommodates arbitrary textual interventions and leverages text-level classifiers with domain adaptation ability to produce robust effect estimates against domain shifts, even when only the control group is observed. This flexibility in handling various text interventions is a key advancement in causal estimation for textual data, offering opportunities to better understand human behaviors and develop effective policies within social systems.

--------------------------------------------------------------------------------------------------------

Fair Division with Market Values

This theoretical research explores the challenge of fairly dividing indivisible goods among agents while considering both subjective valuations and market values. The study provides important insights into the existence and impossibility of various fairness guarantees in this context. This work has practical applications in resource allocation problems, such as estate division, task assignment, and market design. The findings could influence the development of fair division algorithms that balance individual preferences with market considerations.

Authors:  Siddharth Barman, Soroush Ebadian, Mohamad Latifian, Nisarg Shah

Link:  https://arxiv.org/abs/2410.23137v1

Date: 2024-10-30

Summary:

We introduce a model of fair division with market values, where indivisible goods must be partitioned among agents with (additive) subjective valuations, and each good additionally has a market value. The market valuation can be viewed as a separate additive valuation that holds identically across all the agents. We seek allocations that are simultaneously fair with respect to the subjective valuations and with respect to the market valuation.   We show that an allocation that satisfies stochastically-dominant envy-freeness up to one good (SD-EF1) with respect to both the subjective valuations and the market valuation does not always exist, but the weaker guarantee of EF1 with respect to the subjective valuations along with SD-EF1 with respect to the market valuation can be guaranteed. We also study a number of other guarantees such as Pareto optimality, EFX, and MMS. In addition, we explore non-additive valuations and extend our model to cake-cutting. Along the way, we identify several tantalizing open questions.

--------------------------------------------------------------------------------------------------------

Human-inspired Grasping Strategies of Fresh Fruits and Vegetables Applied to Robotic Manipulation

This research examines human grasping strategies for handling fresh fruits and vegetables to improve robotic manipulation capabilities. By analyzing human approaches and implementing them in a robotic system with a multi-fingered compliant gripper, the study provides valuable insights for industrial applications. This work could significantly impact the automation of food handling in logistics and retail sectors, where the manipulation of diverse, delicate items remains a significant challenge. The evaluation using industrial Key Performance Indicators ensures practical relevance.

Authors:  Romeo Orsolino, Mykhaylo Marfeychuk, Mariana de Paula Assis Fonseca, Mario Baggetta, Wesley Wimshurst, Francesco Porta, Morgan Clarke, Giovanni Berselli, Jelizaveta Konstantinova

Link:  https://arxiv.org/abs/2410.22893v1

Date: 2024-10-30

Summary:

Robotic manipulation of fresh fruits and vegetables, including the grasping of multiple loose items, has a strong industrial need but it still is a challenging task for robotic manipulation. This paper outlines the distinctive manipulation strategies used by humans to pick loose fruits and vegetables with the aim to better adopt them for robotic manipulation of diverse items. In this work we present a first version of a robotic setup designed to pick different single or multiple fresh items, featuring multi-fingered compliant robotic gripper. We analyse human grasping strategies from the perspective of industrial Key Performance Indicators (KPIs) used in the logistic sector. The robotic system was validated using the same KPIs, as well as taking into account human performance and strategies. This paper lays the foundation for future development of the robotic demonstrator for fresh fruit and vegetable intelligent manipulation, and outlines the need for generic approaches to handle the complexity of the task.

--------------------------------------------------------------------------------------------------------

EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents

EMOS introduces an innovative framework for managing heterogeneous multi-robot systems using large language models. The system's unique approach includes a "Robot Resume" feature that allows agents to understand and utilize their physical capabilities through URDF files and kinematics tools. This advancement could revolutionize multi-robot coordination in complex environments, particularly in scenarios requiring diverse robot capabilities. Applications could include warehouse automation, disaster response, and manufacturing environments where different types of robots need to collaborate effectively.

Authors:  Junting Chen, Checheng Yu, Xunzhe Zhou, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, Lin Shao

Link:  https://arxiv.org/abs/2410.22662v1

Date: 2024-10-30

Summary:

Heterogeneous multi-robot systems (HMRS) have emerged as a powerful approach for tackling complex tasks that single robots cannot manage alone. Current large-language-model-based multi-agent systems (LLM-based MAS) have shown success in areas like software development and operating systems, but applying these systems to robot control presents unique challenges. In particular, the capabilities of each agent in a multi-robot system are inherently tied to the physical composition of the robots, rather than predefined roles. To address this issue, we introduce a novel multi-agent framework designed to enable effective collaboration among heterogeneous robots with varying embodiments and capabilities, along with a new benchmark named Habitat-MAS. One of our key designs is Robot Resume: Instead of adopting human-designed role play, we propose a self-prompted approach, where agents comprehend robot URDF files and call robot kinematics tools to generate descriptions of their physics capabilities to guide their behavior in task planning and action execution. The Habitat-MAS benchmark is designed to assess how a multi-agent framework handles tasks that require embodiment-aware reasoning, which includes 1) manipulation, 2) perception, 3) navigation, and 4) comprehensive multi-floor object rearrangement. The experimental results indicate that the robot's resume and the hierarchical design of our multi-agent system are essential for the effective operation of the heterogeneous multi-robot system within this intricate problem context.

--------------------------------------------------------------------------------------------------------

BongLLaMA: LLaMA for Bangla Language

BongLLaMA addresses the significant gap in language model support for Bangla, the world's fifth most spoken language. By fine-tuning LLaMA on extensive Bangla corpora and instruction-tuning datasets, this research provides a valuable resource for the Bangla-speaking community. This development could significantly impact natural language processing applications for Bangla, including machine translation, content generation, and educational tools. The model's public availability makes it a valuable baseline for future research in Bangla language processing.

Authors:  Abdullah Khan Zehady, Safi Al Mamun, Naymul Islam, Santu Karmaker

Link:  https://arxiv.org/abs/2410.21200v1

Date: 2024-10-28

Summary:

Bangla (or "Bengali") is a language spoken by approximately 240 million native speakers and around 300 million people worldwide. Despite being the 5th largest spoken language in the world, Bangla is still a "low-resource" language, and existing pretrained language models often struggle to perform well on Bangla Language Processing (BLP) tasks. This work addresses this gap by introducing BongLLaMA (i.e., Bangla-LLaMA), an open-source large language model fine-tuned exclusively on large Bangla corpora and instruction-tuning datasets. We present our methodology, data augmentation techniques, fine-tuning details, and comprehensive benchmarking results showcasing the utility of BongLLaMA on BLP tasks. We believe BongLLaMA will serve as the new standard baseline for Bangla Language Models and, thus, facilitate future benchmarking studies focused on this widely-spoken yet "low-resource" language. All BongLLaMA models are available for public use at https://huggingface.co/BanglaLLM.

--------------------------------------------------------------------------------------------------------

Belief in the Machine: Investigating Epistemological Blind Spots of Language Models

This comprehensive study examines the epistemological capabilities of modern language models, revealing important limitations in their understanding of truth, belief, and knowledge. Through the KaBLE dataset, the research identifies key challenges in how models handle false scenarios, personal beliefs, and the nature of knowledge. These findings have significant implications for the deployment of language models in critical sectors like healthcare, law, and journalism, where accurate epistemological reasoning is essential for reliable decision-making.

Authors:  Mirac Suzgun, Tayfun Gur, Federico Bianchi, Daniel E. Ho, Thomas Icard, Dan Jurafsky, James Zou

Link:  https://arxiv.org/abs/2410.21195v1

Date: 2024-10-28

Summary:

As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. Failure to grasp these distinctions can lead to significant consequences in areas such as medical diagnosis, legal judgments, and dissemination of fake news. Despite this, current literature has largely focused on more complex issues such as theory of mind, overlooking more fundamental epistemic challenges. This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios, particularly in belief-related tasks. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data, which raises concerns for applications in healthcare and counseling, where engaging with a person's beliefs is critical. Third, we identify a salient bias in how LMs process first-person versus third-person beliefs, performing better on third-person tasks (80.7%) compared to first-person tasks (54.4%). Fourth, LMs lack a robust understanding of the factive nature of knowledge, namely, that knowledge inherently requires truth. Fifth, LMs rely on linguistic cues for fact-checking and sometimes bypass the deeper reasoning. These findings highlight significant concerns about current LMs' ability to reason about truth, belief, and knowledge while emphasizing the need for advancements in these areas before broad deployment in critical sectors.

--------------------------------------------------------------------------------------------------------

DiaMond: Dementia Diagnosis with Multi-Modal Vision Transformers Using MRI and PET

DiaMond presents an innovative approach to dementia diagnosis by combining MRI and PET data using vision Transformers. The framework's novel bi-attention mechanism and multi-modal normalization enable effective integration of different imaging modalities, leading to improved diagnostic accuracy. This advancement could significantly impact clinical practice by enhancing the differential diagnosis of various forms of dementia, particularly Alzheimer's Disease and frontotemporal dementia. The robust performance across multiple datasets suggests strong potential for real-world clinical applications.

Authors:  Yitong Li, Morteza Ghahremani, Youssef Wally, Christian Wachinger

Link:  https://arxiv.org/abs/2410.23219v1

Date: 2024-10-30

Summary:

Diagnosing dementia, particularly for Alzheimer's Disease (AD) and frontotemporal dementia (FTD), is complex due to overlapping symptoms. While magnetic resonance imaging (MRI) and positron emission tomography (PET) data are critical for the diagnosis, integrating these modalities in deep learning faces challenges, often resulting in suboptimal performance compared to using single modalities. Moreover, the potential of multi-modal approaches in differential diagnosis, which holds significant clinical importance, remains largely unexplored. We propose a novel framework, DiaMond, to address these issues with vision Transformers to effectively integrate MRI and PET. DiaMond is equipped with self-attention and a novel bi-attention mechanism that synergistically combine MRI and PET, alongside a multi-modal normalization to reduce redundant dependency, thereby boosting the performance. DiaMond significantly outperforms existing multi-modal methods across various datasets, achieving a balanced accuracy of 92.4% in AD diagnosis, 65.2% for AD-MCI-CN classification, and 76.5% in differential diagnosis of AD and FTD. We also validated the robustness of DiaMond in a comprehensive ablation study. The code is available at https://github.com/ai-med/DiaMond.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.