Week Ending 12.15.2024
RESEARCH WATCH: 12.15.2024
Relational Neurosymbolic Markov Models
Sequential modeling is crucial in AI applications like reinforcement learning and NLP, but current approaches face a trade-off between performance and constraint satisfaction. While deep models like transformers excel at sequential tasks, they can't guarantee crucial safety constraints. This paper introduces NeSy-MMs, combining the power of deep learning with logical constraints in a differentiable framework. This innovation could be particularly valuable in safety-critical applications where both performance and constraint satisfaction are essential, such as autonomous systems, medical diagnosis sequences, or financial trading algorithms.
Authors: Lennert De Smet, Gabriele Venturato, Luc De Raedt, Giuseppe Marra
Link: https://arxiv.org/abs/2412.13023v1
Date: 2024-12-17
Summary:
Sequential problems are ubiquitous in AI, such as in reinforcement learning or natural language processing. State-of-the-art deep sequential models, like transformers, excel in these settings but fail to guarantee the satisfaction of constraints necessary for trustworthy deployment. In contrast, neurosymbolic AI (NeSy) provides a sound formalism to enforce constraints in deep probabilistic models but scales exponentially on sequential problems. To overcome these limitations, we introduce relational neurosymbolic Markov models (NeSy-MMs), a new class of end-to-end differentiable sequential models that integrate and provably satisfy relational logical constraints. We propose a strategy for inference and learning that scales on sequential settings, and that combines approximate Bayesian inference, automated reasoning, and gradient estimation. Our experiments show that NeSy-MMs can solve problems beyond the current state-of-the-art in neurosymbolic AI and still provide strong guarantees with respect to desired properties. Moreover, we show that our models are more interpretable and that constraints can be adapted at test time to out-of-distribution scenarios.
--------------------------------------------------------------------------------------------------------
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
As AI systems increasingly work with both visual and textual information, there's a growing need to evaluate how well they can reason across modalities. Current benchmarks only test text outputs, missing crucial visual reasoning capabilities. CoMT introduces a novel evaluation framework that requires models to demonstrate visual operations like creation, deletion, updates, and selection. This benchmark could be particularly valuable for developing and testing AI systems in fields like medical imaging, architectural design, or visual instruction following, where both understanding and manipulating visual information is crucial.
Authors: Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, Libo Qin
Link: https://arxiv.org/abs/2412.12932v1
Date: 2024-12-17
Summary:
Large Vision-Language Models (LVLMs) have recently demonstrated amazing success in multi-modal tasks, including advancements in Multi-modal Chain-of-Thought (MCoT) reasoning. Despite these successes, current benchmarks still follow a traditional paradigm with multi-modal input and text-modal output, which leads to significant drawbacks such as missing visual operations and vague expressions. Motivated by this, we introduce a novel Chain of Multi-modal Thought (CoMT) benchmark to address these limitations. Different from the traditional MCoT benchmark, CoMT requires both multi-modal input and multi-modal reasoning output, aiming to mimic human-like reasoning that inherently integrates visual operation. Specifically, CoMT consists of four categories: (1) Visual Creation, (2) Visual Deletion, (3) Visual Update, and (4) Visual Selection to comprehensively explore complex visual operations and concise expression in real scenarios. We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches. We hope that CoMT can inspire more research on introducing multi-modal generation into the reasoning process.
--------------------------------------------------------------------------------------------------------
Differential Alignment for Domain Adaptive Object Detection
Object detection systems often struggle when deployed in new environments different from their training data. This research tackles this challenge by introducing a novel approach to domain adaptation that considers the varying importance of different image regions and object instances. By using prediction discrepancy and uncertainty-based alignment, the system can better adapt to new domains without requiring additional labeling. This advancement could be particularly valuable in real-world applications like surveillance systems, autonomous vehicles, or medical imaging where deployment conditions often differ from training data.
Authors: Xinyu He, Xinhui Li, Xiaojie Guo
Link: https://arxiv.org/abs/2412.12830v1
Date: 2024-12-17
Summary:
Domain adaptive object detection (DAOD) aims to generalize an object detector trained on labeled source-domain data to a target domain without annotations, the core principle of which is \emph{source-target feature alignment}. Typically, existing approaches employ adversarial learning to align the distributions of the source and target domains as a whole, barely considering the varying significance of distinct regions, say instances under different circumstances and foreground \emph{vs} background areas, during feature alignment. To overcome the shortcoming, we investigates a differential feature alignment strategy. Specifically, a prediction-discrepancy feedback instance alignment module (dubbed PDFA) is designed to adaptively assign higher weights to instances of higher teacher-student detection discrepancy, effectively handling heavier domain-specific information. Additionally, an uncertainty-based foreground-oriented image alignment module (UFOA) is proposed to explicitly guide the model to focus more on regions of interest. Extensive experiments on widely-used DAOD datasets together with ablation studies are conducted to demonstrate the efficacy of our proposed method and reveal its superiority over other SOTA alternatives. Our code is available at https://github.com/EstrellaXyu/Differential-Alignment-for-DAOD.
--------------------------------------------------------------------------------------------------------
Lagrangian Index Policy for Restless Bandits with Average Reward
Multi-armed bandit problems are fundamental to many real-world decision-making scenarios, from web optimization to resource allocation. This paper presents the Lagrangian Index Policy (LIP) as an alternative to the traditional Whittle Index Policy (WIP) for restless bandits. The approach shows particular promise in cases where WIP performs poorly, and requires less memory for implementation. The method could be particularly valuable in applications like web crawling and minimizing the age of information in communication systems, where optimal resource allocation is crucial.
Authors: Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah
Link: https://arxiv.org/abs/2412.12641v1
Date: 2024-12-17
Summary:
We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP requires significantly less memory than the analogous scheme for WIP. We calculate analytically the Lagrangian index for the restart model, which describes the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous bandits as the number of arms goes to infinity, based on exchangeability and de Finetti's theorem.
--------------------------------------------------------------------------------------------------------
Smoothness Really Matters: A Simple yet Effective Approach for Unsupervised Graph Domain Adaptation
Graph Neural Networks (GNNs) often struggle when applied to new domains with different structural characteristics. This paper introduces TDSS, a method that smooths the structure of target domain graphs to better align with source domains, while preserving essential properties. This innovative approach could be particularly valuable in applications like drug discovery, where molecular graphs from different sources need to be analyzed, or in social network analysis where network structures vary across platforms. The method's simplicity and effectiveness make it particularly attractive for practical applications.
Authors: Wei Chen, Guo Ye, Yakun Wang, Zhao Zhang, Libang Zhang, Daxin Wang, Zhiqiang Zhang, Fuzhen Zhuang
Link: https://arxiv.org/abs/2412.11654v1
Date: 2024-12-16
Summary:
Unsupervised Graph Domain Adaptation (UGDA) seeks to bridge distribution shifts between domains by transferring knowledge from labeled source graphs to given unlabeled target graphs. Existing UGDA methods primarily focus on aligning features in the latent space learned by graph neural networks (GNNs) across domains, often overlooking structural shifts, resulting in limited effectiveness when addressing structurally complex transfer scenarios. Given the sensitivity of GNNs to local structural features, even slight discrepancies between source and target graphs could lead to significant shifts in node embeddings, thereby reducing the effectiveness of knowledge transfer. To address this issue, we introduce a novel approach for UGDA called Target-Domain Structural Smoothing (TDSS). TDSS is a simple and effective method designed to perform structural smoothing directly on the target graph, thereby mitigating structural distribution shifts and ensuring the consistency of node representations. Specifically, by integrating smoothing techniques with neighborhood sampling, TDSS maintains the structural coherence of the target graph while mitigating the risk of over-smoothing. Our theoretical analysis shows that TDSS effectively reduces target risk by improving model smoothness. Empirical results on three real-world datasets demonstrate that TDSS outperforms recent state-of-the-art baselines, achieving significant improvements across six transfer scenarios. The code is available in https://github.com/cwei01/TDSS.
--------------------------------------------------------------------------------------------------------
ACE-$M^3$: Automatic Capability Evaluator for Multimodal Medical Models
As multimodal AI models increasingly enter healthcare, there's a crucial need for reliable evaluation methods. This paper introduces an automated evaluator specifically designed for medical AI systems that work with both images and text. The system uses a branch-merge architecture and novel optimization strategy to assess medical AI models efficiently. This tool could be particularly valuable for healthcare institutions and medical AI developers who need to validate their systems' performance before deployment, ensuring both accuracy and reliability in clinical settings.
Authors: Xiechi Zhang, Shunfan Zheng, Linlin Wang, Gerard de Melo, Zhu Cao, Xiaoling Wang, Liang He
Link: https://arxiv.org/abs/2412.11453v1
Date: 2024-12-16
Summary:
As multimodal large language models (MLLMs) gain prominence in the medical field, the need for precise evaluation methods to assess their effectiveness has become critical. While benchmarks provide a reliable means to evaluate the capabilities of MLLMs, traditional metrics like ROUGE and BLEU employed for open domain evaluation only focus on token overlap and may not align with human judgment. Although human evaluation is more reliable, it is labor-intensive, costly, and not scalable. LLM-based evaluation methods have proven promising, but to date, there is still an urgent need for open-source multimodal LLM-based evaluators in the medical field. To address this issue, we introduce ACE-$M^3$, an open-sourced \textbf{A}utomatic \textbf{C}apability \textbf{E}valuator for \textbf{M}ultimodal \textbf{M}edical \textbf{M}odels specifically designed to assess the question answering abilities of medical MLLMs. It first utilizes a branch-merge architecture to provide both detailed analysis and a concise final score based on standard medical evaluation criteria. Subsequently, a reward token-based direct preference optimization (RTDPO) strategy is incorporated to save training time without compromising performance of our model. Extensive experiments have demonstrated the effectiveness of our ACE-$M^3$ model\footnote{\url{https://huggingface.co/collections/AIUSRTMP/ace-m3-67593297ff391b93e3e5d068}} in evaluating the capabilities of medical MLLMs.
--------------------------------------------------------------------------------------------------------
Generating long, coherent text has been a persistent challenge for diffusion-based language models. Current approaches either focus too narrowly on token-level patterns or struggle with long passages. This paper introduces SLD, a framework that breaks down long text generation into manageable segments while maintaining coherence through robust representation training and improved guidance. This approach could be particularly valuable for applications requiring long-form content generation, such as article writing, story generation, or dialogue summarization, where maintaining consistency and context over extended passages is crucial.
Authors: Xiaochen Zhu, Georgi Karadzhov, Chenxi Whitehouse, Andreas Vlachos
Link: https://arxiv.org/abs/2412.11333v1
Date: 2024-12-15
Summary:
Diffusion models have shown promise in text generation but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion overlooks word-order dependencies and enforces short output windows, while passage-level diffusion struggles with learning robust representation for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into separate latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on XSum, ROCStories, DialogSum, and DeliData demonstrate that SLD achieves competitive or superior performance in fluency, coherence, and contextual compatibility across automatic and human evaluation metrics comparing with other diffusion and autoregressive baselines. Ablation studies further validate the effectiveness of our segmentation and representation learning strategies.
--------------------------------------------------------------------------------------------------------
ProFe: Communication-Efficient Decentralized Federated Learning via Distillation and Prototypes
Decentralized Federated Learning faces significant challenges in managing communication costs and model aggregation, especially when dealing with diverse data distributions. This paper presents ProFe, combining knowledge distillation, prototype learning, and quantization to optimize communication in federated learning environments. By reducing communication costs while maintaining model performance, this approach could be particularly valuable in scenarios with limited bandwidth or privacy concerns, such as healthcare networks, IoT deployments, or distributed enterprise systems where efficient, private collaboration is essential.
Authors: Pedro Miguel Sánchez Sánchez, Enrique Tomás Martínez Beltrán, Miguel Fernández Llamas, Gérôme Bovet, Gregorio Martínez Pérez, Alberto Huertas Celdrán
Link: https://arxiv.org/abs/2412.11207v1
Date: 2024-12-15
Summary:
Decentralized Federated Learning (DFL) trains models in a collaborative and privacy-preserving manner while removing model centralization risks and improving communication bottlenecks. However, DFL faces challenges in efficient communication management and model aggregation within decentralized environments, especially with heterogeneous data distributions. Thus, this paper introduces ProFe, a novel communication optimization algorithm for DFL that combines knowledge distillation, prototype learning, and quantization techniques. ProFe utilizes knowledge from large local models to train smaller ones for aggregation, incorporates prototypes to better learn unseen classes, and applies quantization to reduce data transmitted during communication rounds. The performance of ProFe has been validated and compared to the literature by using benchmark datasets like MNIST, CIFAR10, and CIFAR100. Results showed that the proposed algorithm reduces communication costs by up to ~40-50% while maintaining or improving model performance. In addition, it adds ~20% training time due to increased complexity, generating a trade-off.
--------------------------------------------------------------------------------------------------------
Predicting Survival of Hemodialysis Patients using Federated Learning
Accurate survival prediction for hemodialysis patients is crucial for optimizing kidney transplant waiting lists, but sensitive medical data is often siloed across different healthcare centers. This paper explores using Federated Learning to combine insights from multiple centers without sharing sensitive patient data. This approach could be particularly valuable for healthcare providers and transplant centers, enabling them to make more informed decisions about patient prioritization while maintaining privacy and regulatory compliance. The study specifically focuses on data from NephroPlus, India's largest private network of dialysis centers.
Authors: Abhiram Raju, Praneeth Vepakomma
Link: https://arxiv.org/abs/2412.10919v1
Date: 2024-12-14
Summary:
Hemodialysis patients who are on donor lists for kidney transplant may get misidentified, delaying their wait time. Thus, predicting their survival time is crucial for optimizing waiting lists and personalizing treatment plans. Predicting survival times for patients often requires large quantities of high quality but sensitive data. This data is siloed and since individual datasets are smaller and less diverse, locally trained survival models do not perform as well as centralized ones. Hence, we propose the use of Federated Learning in the context of predicting survival for hemodialysis patients. Federated Learning or FL can have comparatively better performances than local models while not sharing data between centers. However, despite the increased use of such technologies, the application of FL in survival and even more, dialysis patients remains sparse. This paper studies the performance of FL for data of hemodialysis patients from NephroPlus, the largest private network of dialysis centers in India.
--------------------------------------------------------------------------------------------------------
Spurious Isospin Breaking in the In-medium Similarity Renormalization Group
In the context of nuclear physics, accurate calculation of theoretical corrections to superallowed beta decay rates is crucial for testing fundamental physics principles. This paper identifies and addresses artificial isospin symmetry breaking introduced by computational frameworks. By providing remedies for these spurious effects, the research contributes to more accurate predictions of nuclear properties. This work could be particularly valuable for nuclear physics experiments and theoretical calculations where precise understanding of isospin symmetry breaking is essential for testing fundamental physics theories.
Authors: A. Farren, S. R. Stroberg
Link: https://arxiv.org/abs/2412.10693v1
Date: 2024-12-14
Summary:
Robustly quantifying the uncertainty in the isospin-related theoretical correction $\delta_C$ to superallowed beta decay rates is vital for a correct assessment of CKM unitarity. To this end, we identify the sources of artificial or \textit{spurious} isospin symmetry breaking introduced by the IMSRG many-body framework at a computational level and provide remedies. We test our best policy for preventing spurious ISB by evaluating $\delta_C$.
--------------------------------------------------------------------------------------------------------
Client-Side Patching against Backdoor Attacks in Federated Learning
Federated learning systems are vulnerable to backdoor attacks from malicious participants, especially in scenarios with heterogeneous data distributions. This paper proposes a novel defense mechanism using adversarial learning and model patching on the client side. By effectively reducing backdoor accuracy while maintaining performance on clean data, this approach could be particularly valuable for organizations implementing federated learning in security-sensitive applications, such as healthcare, financial services, or collaborative research where protecting against malicious participants is crucial.
Authors: Borja Molina Coronado
Link: https://arxiv.org/abs/2412.10605v1
Date: 2024-12-13
Summary:
Federated learning is a versatile framework for training models in decentralized environments. However, the trust placed in clients makes federated learning vulnerable to backdoor attacks launched by malicious participants. While many defenses have been proposed, they often fail short when facing heterogeneous data distributions among participating clients. In this paper, we propose a novel defense mechanism for federated learning systems designed to mitigate backdoor attacks on the clients-side. Our approach leverages adversarial learning techniques and model patching to neutralize the impact of backdoor attacks. Through extensive experiments on the MNIST and Fashion-MNIST datasets, we demonstrate that our defense effectively reduces backdoor accuracy, outperforming existing state-of-the-art defenses, such as LFighter, FLAME, and RoseAgg, in i.i.d. and non-i.i.d. scenarios, while maintaining competitive or superior accuracy on clean data.
--------------------------------------------------------------------------------------------------------
Label-template based Few-Shot Text Classification with Contrastive Learning
Few-shot text classification remains challenging, particularly in leveraging limited labeled examples effectively. This paper proposes a framework that better utilizes class label information through label templates and contrastive learning. The approach could be particularly valuable in scenarios where obtaining labeled data is expensive or time-consuming, such as specialized document classification, sentiment analysis in new domains, or content moderation for emerging categories, where systems need to learn from just a few examples while maintaining high accuracy.
Authors: Guanghua Hou, Shuhui Cao, Deqiang Ouyang, Ning Wang
Link: https://arxiv.org/abs/2412.10110v1
Date: 2024-12-13
Summary:
As an algorithmic framework for learning to learn, meta-learning provides a promising solution for few-shot text classification. However, most existing research fail to give enough attention to class labels. Traditional basic framework building meta-learner based on prototype networks heavily relies on inter-class variance, and it is easily influenced by noise. To address these limitations, we proposes a simple and effective few-shot text classification framework. In particular, the corresponding label templates are embed into input sentences to fully utilize the potential value of class labels, guiding the pre-trained model to generate more discriminative text representations through the semantic information conveyed by labels. With the continuous influence of label semantics, supervised contrastive learning is utilized to model the interaction information between support samples and query samples. Furthermore, the averaging mechanism is replaced with an attention mechanism to highlight vital semantic information. To verify the proposed scheme, four typical datasets are employed to assess the performance of different methods. Experimental results demonstrate that our method achieves substantial performance enhancements and outperforms existing state-of-the-art models on few-shot text classification tasks.
--------------------------------------------------------------------------------------------------------
Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning
Complex reasoning remains a challenge for Large Language Models (LLMs). While existing methods like Chain-of-Thought help, they typically perform single-pass reasoning without revisiting potential errors. This paper introduces Forest-of-Thought, integrating multiple reasoning trees with sparse activation and dynamic self-correction strategies. This approach could be particularly valuable in applications requiring complex logical reasoning, such as mathematical problem-solving, legal analysis, or scientific research, where the ability to explore multiple reasoning paths and correct errors is crucial.
Authors: Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, Yunhe Wang
Link: https://arxiv.org/abs/2412.09078v1
Date: 2024-12-12
Summary:
Large Language Models (LLMs) have shown remarkable abilities across various language tasks, but solving complex reasoning problems remains a challenge. While existing methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) enhance reasoning by decomposing problems or structuring prompts, they typically perform a single pass of reasoning and may fail to revisit flawed paths, compromising accuracy. To address this, we propose a novel reasoning framework called Forest-of-Thought (FoT), which integrates multiple reasoning trees to leverage collective decision-making for solving complex logical problems. FoT utilizes sparse activation strategies to select the most relevant reasoning paths, improving both efficiency and accuracy. Additionally, we introduce a dynamic self-correction strategy that enables real-time error correction and learning from past mistakes, as well as consensus-guided decision making strategies to optimize correctness and computational resources. Experimental results demonstrate that the FoT framework, combined with these strategies, significantly enhances the reasoning capabilities of LLMs, enabling them to solve complex tasks with greater precision and efficiency.
--------------------------------------------------------------------------------------------------------
Large Concept Models: Language Modeling in a Sentence Representation Space
Current Language Models operate at the token level, unlike humans who think in higher-level concepts. This paper introduces an architecture operating on language- and modality-agnostic "concepts," initially implemented as sentence-level representations. Using the SONAR embedding space, the model supports multiple languages and modalities. This approach could be particularly valuable for applications requiring cross-lingual understanding, summarization, or content expansion, especially in multilingual environments where traditional token-based approaches may struggle with semantic coherence across languages.
Authors: LCM team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk
Link: https://arxiv.org/abs/2412.08821v2
Date: 2024-12-15
Summary:
LLMs have revolutionized the field of artificial intelligence and have emerged as the de-facto tool for many tasks. The current established technology of LLMs is to process input and generate output at the token level. This is in sharp contrast to humans who operate at multiple levels of abstraction, well beyond single words, to analyze information and to generate creative content. In this paper, we present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept. Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow. Hence, we build a "Large Concept Model". In this study, as proof of feasibility, we assume that a concept corresponds to a sentence, and use an existing sentence embedding space, SONAR, which supports up to 200 languages in both text and speech modalities. The Large Concept Model is trained to perform autoregressive sentence prediction in an embedding space. We explore multiple approaches, namely MSE regression, variants of diffusion-based generation, and models operating in a quantized SONAR space. These explorations are performed using 1.6B parameter models and training data in the order of 1.3T tokens. We then scale one architecture to a model size of 7B parameters and training data of about 2.7T tokens. We perform an experimental evaluation on several generative tasks, namely summarization and a new task of summary expansion. Finally, we show that our model exhibits impressive zero-shot generalization performance to many languages, outperforming existing LLMs of the same size. The training code of our models is freely available.
--------------------------------------------------------------------------------------------------------
TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning
The paper presents an open-source design for an affordable, flexible holonomic mobile manipulator robot. Using powered casters for full directional control, the design simplifies mobile manipulation tasks and enables easier data collection for imitation learning. This platform could be particularly valuable for robotics researchers and developers working on household automation, as it provides an accessible way to collect human-guided demonstrations and develop learned policies for common household tasks, while its open-source nature encourages collaboration and innovation in the field.
Authors: Jimmy Wu, William Chong, Robert Holmberg, Aaditya Prasad, Yihuai Gao, Oussama Khatib, Shuran Song, Szymon Rusinkiewicz, Jeannette Bohg
Link: https://arxiv.org/abs/2412.10447v1
Date: 2024-12-11
Summary:
Exploiting the promise of recent advances in imitation learning for mobile manipulation will require the collection of large numbers of human-guided demonstrations. This paper proposes an open-source design for an inexpensive, robust, and flexible mobile manipulator that can support arbitrary arms, enabling a wide range of real-world household mobile manipulation tasks. Crucially, our design uses powered casters to enable the mobile base to be fully holonomic, able to control all planar degrees of freedom independently and simultaneously. This feature makes the base more maneuverable and simplifies many mobile manipulation tasks, eliminating the kinematic constraints that create complex and time-consuming motions in nonholonomic bases. We equip our robot with an intuitive mobile phone teleoperation interface to enable easy data acquisition for imitation learning. In our experiments, we use this interface to collect data and show that the resulting learned policies can successfully perform a variety of common household mobile manipulation tasks.
--------------------------------------------------------------------------------------------------------
Climate Aware Deep Neural Networks (CADNN) for Wind Power Simulation
Wind power forecasting is crucial for integrating renewable energy into power grids effectively. This paper proposes using Deep Neural Networks with climate datasets to improve wind power predictions. The approach leverages CMIP climate projections and compares various neural network architectures for optimal performance. This could be particularly valuable for power grid operators, energy companies, and policy makers who need accurate wind power forecasts to manage grid stability, optimize energy dispatch, and plan for future renewable energy integration.
Authors: Ali Forootani, Danial Esmaeili Aliabadi, Daniela Thraen
Link: https://arxiv.org/abs/2412.12160v1
Date: 2024-12-11
Summary:
Wind power forecasting plays a critical role in modern energy systems, facilitating the integration of renewable energy sources into the power grid. Accurate prediction of wind energy output is essential for managing the inherent intermittency of wind power, optimizing energy dispatch, and ensuring grid stability. This paper proposes the use of Deep Neural Network (DNN)-based predictive models that leverage climate datasets, including wind speed, atmospheric pressure, temperature, and other meteorological variables, to improve the accuracy of wind power simulations. In particular, we focus on the Coupled Model Intercomparison Project (CMIP) datasets, which provide climate projections, as inputs for training the DNN models. These models aim to capture the complex nonlinear relationships between the CMIP-based climate data and actual wind power generation at wind farms located in Germany. Our study compares various DNN architectures, specifically Multilayer Perceptron (MLP), Long Short-Term Memory (LSTM) networks, and Transformer-enhanced LSTM models, to identify the best configuration among these architectures for climate-aware wind power simulation. The implementation of this framework involves the development of a Python package (CADNN) designed to support multiple tasks, including statistical analysis of the climate data, data visualization, preprocessing, DNN training, and performance evaluation. We demonstrate that the DNN models, when integrated with climate data, significantly enhance forecasting accuracy. This climate-aware approach offers a deeper understanding of the time-dependent climate patterns that influence wind power generation, providing more accurate predictions and making it adaptable to other geographical regions.
--------------------------------------------------------------------------------------------------------
Discrete Subgraph Sampling for Interpretable Graph based Visual Question Answering
As AI systems become more complex, explaining their decision-making processes becomes increasingly important. This paper focuses on making visual question answering systems more interpretable by integrating discrete subset sampling methods. Rather than adding explanations after the fact, the system generates explanatory subgraphs alongside its predictions. This approach could be particularly valuable in applications where understanding the AI's reasoning process is crucial, such as medical diagnosis, autonomous vehicle decision-making, or educational systems where transparency in visual reasoning is essential for user trust and system validation.
Authors: Pascal Tilli, Ngoc Thang Vu
Link: https://arxiv.org/abs/2412.08263v1
Date: 2024-12-11
Summary:
Explainable artificial intelligence (XAI) aims to make machine learning models more transparent. While many approaches focus on generating explanations post-hoc, interpretable approaches, which generate the explanations intrinsically alongside the predictions, are relatively rare. In this work, we integrate different discrete subset sampling methods into a graph-based visual question answering system to compare their effectiveness in generating interpretable explanatory subgraphs intrinsically. We evaluate the methods on the GQA dataset and show that the integrated methods effectively mitigate the performance trade-off between interpretability and answer accuracy, while also achieving strong co-occurrences between answer and question tokens. Furthermore, we conduct a human evaluation to assess the interpretability of the generated subgraphs using a comparative setting with the extended Bradley-Terry model, showing that the answer and question token co-occurrence metrics strongly correlate with human preferences. Our source code is publicly available.
--------------------------------------------------------------------------------------------------------
Go-Oracle: Automated Test Oracle for Go Concurrency Bugs
Concurrency bugs in Go programs present a significant challenge for developers, particularly due to Go's dual concurrency mechanisms. This paper introduces an automated approach to classify test executions as pass or fail using transformer-based neural networks trained on execution traces. Developed in collaboration with Bytedance developers, this tool could be particularly valuable for software development teams working with Go, especially in infrastructure systems where concurrent programming is common. By automating the test oracle process, it could significantly reduce the time and expertise needed to identify concurrency bugs.
Authors: Foivos Tsimpourlas, Chao Peng, Carlos Rosuero, Ping Yang, Ajitha Rajan
Link: https://arxiv.org/abs/2412.08061v1
Date: 2024-12-11
Summary:
The Go programming language has gained significant traction for developing software, especially in various infrastructure systems. Nonetheless, concurrency bugs have become a prevalent issue within Go, presenting a unique challenge due to the language's dual concurrency mechanisms-communicating sequential processes and shared memory. Detecting concurrency bugs and accurately classifying program executions as pass or fail presents an immense challenge, even for domain experts. We conducted a survey with expert developers at Bytedance that confirmed this challenge. Our work seeks to address the test oracle problem for Go programs, to automatically classify test executions as pass or fail. This problem has not been investigated in the literature for Go programs owing to its distinctive programming model. Our approach involves collecting both passing and failing execution traces from various subject Go programs. We capture a comprehensive array of execution events using the native Go execution tracer. Subsequently, we preprocess and encode these traces before training a transformer-based neural network to effectively classify the traces as either passing or failing. The evaluation of our approach encompasses 8 subject programs sourced from the GoBench repository. These subject programs are routinely used as benchmarks in an industry setting. Encouragingly, our test oracle, Go-Oracle, demonstrates high accuracies even when operating with a limited dataset, showcasing the efficacy and potential of our methodology. Developers at Bytedance strongly agreed that they would use the Go-Oracle tool over the current practice of manual inspections to classify tests for Go programs as pass or fail.
--------------------------------------------------------------------------------------------------------
Neptune: The Long Orbit to Benchmarking Long Video Understanding
Understanding long videos remains a significant challenge in AI, with most existing datasets and models focusing on short clips. This paper introduces a semi-automatic pipeline for generating challenging question-answer sets for long video understanding. Using large language and vision models to generate dense, time-aligned captions and questions, Neptune could be particularly valuable for developing and evaluating AI systems that need to understand extended video content, such as surveillance systems, educational video analysis, or content moderation platforms where comprehending long-form video context is crucial.
Authors: Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hornung, Nitesh Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong, Cordelia Schmid, Mikhail Sirotenko, Yukun Zhu, Tobias Weyand
Link: https://arxiv.org/abs/2412.09582v1
Date: 2024-12-12
Summary:
This paper describes a semi-automatic pipeline to generate challenging question-answer-decoy sets for understanding long videos. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune
--------------------------------------------------------------------------------------------------------
Continuous Patient Monitoring with AI: Real-Time Analysis of Video in Hospital Care Settings
Healthcare settings require constant patient monitoring, particularly for high-risk patients. This paper presents an AI-driven platform for continuous video monitoring in hospitals, developed by LookDeep Health. The system analyzes patient behavior and interactions in real-time, with particular focus on fall prevention and safety monitoring. This technology could be particularly valuable for hospitals and care facilities looking to enhance patient safety, reduce adverse events, and optimize staff resource allocation, especially in settings with high-risk or vulnerable patient populations where continuous monitoring is essential but labor-intensive.
Authors: Paolo Gabriel, Peter Rehani, Tyler Troy, Tiffany Wyatt, Michael Choma, Narinder Singh
Link: https://arxiv.org/abs/2412.13152v1
Date: 2024-12-17
Summary:
This study introduces an AI-driven platform for continuous and passive patient monitoring in hospital settings, developed by LookDeep Health. Leveraging advanced computer vision, the platform provides real-time insights into patient behavior and interactions through video analysis, securely storing inference results in the cloud for retrospective evaluation. The dataset, compiled in collaboration with 11 hospital partners, encompasses over 300 high-risk fall patients and over 1,000 days of inference, enabling applications such as fall detection and safety monitoring for vulnerable patient populations. To foster innovation and reproducibility, an anonymized subset of this dataset is publicly available. The AI system detects key components in hospital rooms, including individual presence and role, furniture location, motion magnitude, and boundary crossings. Performance evaluation demonstrates strong accuracy in object detection (macro F1-score = 0.92) and patient-role classification (F1-score = 0.98), as well as reliable trend analysis for the "patient alone" metric (mean logistic regression accuracy = 0.82 \pm 0.15). These capabilities enable automated detection of patient isolation, wandering, or unsupervised movement-key indicators for fall risk and other adverse events. This work establishes benchmarks for validating AI-driven patient monitoring systems, highlighting the platform's potential to enhance patient safety and care by providing continuous, data-driven insights into patient behavior and interactions.
--------------------------------------------------------------------------------------------------------