Week Ending 7.28.2024
RESEARCH WATCH: 7.28.2024
Lessons from Learning to Spin "Pens"
This paper explores in-hand manipulation of pen-like objects, a crucial skill for using many everyday tools. Current learning-based methods struggle with this task due to limited high-quality demonstrations and simulation-to-real-world gaps. The researchers use reinforcement learning to train an oracle policy in simulation, generating a dataset for pre-training and real-world trajectory replay. They then fine-tune the policy using real-world data. With fewer than 50 trajectories, their approach can rotate various pen-like objects for multiple revolutions. This work could improve robotic manipulation capabilities for tasks involving tools like screwdrivers or hammers in manufacturing, maintenance, or household robotics applications.
Authors: Jun Wang, Ying Yuan, Haichuan Che, Haozhi Qi, Yi Ma, Jitendra Malik, Xiaolong Wang
Link: https://arxiv.org/abs/2407.18902v1
Date: 2024-07-26
Summary:
In-hand manipulation of pen-like objects is an important skill in our daily lives, as many tools such as hammers and screwdrivers are similarly shaped. However, current learning-based methods struggle with this task due to a lack of high-quality demonstrations and the significant gap between simulation and the real world. In this work, we push the boundaries of learning-based in-hand manipulation systems by demonstrating the capability to spin pen-like objects. We first use reinforcement learning to train an oracle policy with privileged information and generate a high-fidelity trajectory dataset in simulation. This serves two purposes: 1) pre-training a sensorimotor policy in simulation; 2) conducting open-loop trajectory replay in the real world. We then fine-tune the sensorimotor policy using these real-world trajectories to adapt it to the real world dynamics. With less than 50 trajectories, our policy learns to rotate more than ten pen-like objects with different physical properties for multiple revolutions. We present a comprehensive analysis of our design choices and share the lessons learned during development.
--------------------------------------------------------------------------------------------------------
AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild
This study addresses the challenge of 3D hand reconstruction in unconstrained environments, which is hindered by a lack of diverse in-the-wild datasets. The researchers propose AttentionHand, a method for generating controllable hand images guided by text prompts. By creating numerous varied hand images aligned with 3D labels, they build a new dataset to bridge the gap between indoor and outdoor scenes. The approach uses four input modalities and employs text and visual attention stages in a diffusion-based pipeline. This technology could enhance hand tracking in augmented reality, sign language recognition, or gesture-based interfaces for various applications.
Authors: Junho Park, Kyeongbo Kong, Suk-Ju Kang
Link: https://arxiv.org/abs/2407.18034v1
Date: 2024-07-25
Summary:
Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occclusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e, an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt). These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.
--------------------------------------------------------------------------------------------------------
OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
This paper introduces OVR, a large-scale dataset for annotating temporal repetitions in videos. Containing over 72,000 annotated videos from Kinetics and Ego4D, it covers both third-person and first-person viewpoints across diverse actions. The dataset includes repetition counts, start/end times, and free-form descriptions of repeating elements. The researchers also propose a baseline transformer model, OVRCounter, for localizing and counting repetitions in videos up to 320 frames long. This dataset and model could benefit action recognition, sports analysis, exercise tracking, and automated video indexing applications, enabling more nuanced understanding of repetitive actions in diverse video content.
Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Andrew Zisserman
Link: https://arxiv.org/abs/2407.17085v1
Date: 2024-07-24
Summary:
We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The performance is also compared to a prior repetition counting model. The dataset is available for download at: https://sites.google.com/view/openvocabreps/
--------------------------------------------------------------------------------------------------------
Patched RTC: evaluating LLMs for diverse software development tasks
This paper presents Patched Round-Trip Correctness (Patched RTC), a novel evaluation technique for Large Language Models (LLMs) in software development tasks like bug fixing and code review. It extends the original Round-Trip Correctness method to work with any LLM and task, offering a self-evaluating framework without human intervention. The study implements Patched RTC in an open-source framework called patchwork, allowing transparent evaluation across various workflows. This approach could significantly improve the assessment and selection of LLMs for software development tasks, potentially enhancing code quality, reducing bugs, and streamlining the development process in various industries.
Authors: Asankhaya Sharma
Link: https://arxiv.org/abs/2407.16557v1
Date: 2024-07-23
Summary:
This paper introduces Patched Round-Trip Correctness (Patched RTC), a novel evaluation technique for Large Language Models (LLMs) applied to diverse software development tasks, particularly focusing on "outer loop" activities such as bug fixing, code review, and documentation updates. Patched RTC extends the original Round-Trip Correctness method to work with any LLM and downstream task, offering a self-evaluating framework that measures consistency and robustness of model responses without human intervention. The study demonstrates a correlation between Patched RTC scores and task-specific accuracy metrics, presenting it as an alternative to the LLM-as-Judge paradigm for open-domain task evaluation. We implement Patched RTC in an open-source framework called patchwork, allowing for transparent evaluation during inference across various patchflows. Experiments comparing GPT-3.5 and GPT-4 models across different software development tasks reveal that Patched RTC effectively distinguishes model performance and task difficulty. The paper also explores the impact of consistency prompts on improving model accuracy, suggesting that Patched RTC can guide prompt refinement and model selection for complex software development workflows.
--------------------------------------------------------------------------------------------------------
Psychomatics -- A Multidisciplinary Framework for Understanding Artificial Minds
This paper introduces Psychomatics, a multidisciplinary framework bridging cognitive science, linguistics, and computer science to understand how Large Language Models (LLMs) process information compared to human cognition. The researchers focus on language development and use, drawing parallels between LLMs and biological systems. They highlight LLMs' ability to map and manipulate complex linguistic patterns but note their limitations in experiential, emotional, and embodied aspects of cognition. This framework could inform the development of more human-like AI systems, potentially improving natural language interfaces, enhancing AI-human interaction, and deepening our understanding of both artificial and biological intelligence.
Authors: Giuseppe Riva, Fabrizia Mantovani, Brenda K. Wiederhold, Antonella Marchetti, Andrea Gaggioli
Link: https://arxiv.org/abs/2407.16444v1
Date: 2024-07-23
Summary:
Although LLMs and other artificial intelligence systems demonstrate cognitive skills similar to humans, like concept learning and language acquisition, the way they process information fundamentally differs from biological cognition. To better understand these differences this paper introduces Psychomatics, a multidisciplinary framework bridging cognitive science, linguistics, and computer science. It aims to better understand the high-level functioning of LLMs, focusing specifically on how LLMs acquire, learn, remember, and use information to produce their outputs. To achieve this goal, Psychomatics will rely on a comparative methodology, starting from a theory-driven research question - is the process of language development and use different in humans and LLMs? - drawing parallels between LLMs and biological systems. Our analysis shows how LLMs can map and manipulate complex linguistic patterns in their training data. Moreover, LLMs can follow Grice's Cooperative Principle to provide relevant and informative responses. However, human cognition draws from multiple sources of meaning, including experiential, emotional, and imaginative facets, which transcend mere language processing and are rooted in our social and developmental trajectories. Moreover, current LLMs lack physical embodiment, reducing their ability to make sense of the intricate interplay between perception, action, and cognition that shapes human understanding and expression. Ultimately, Psychomatics holds the potential to yield transformative insights into the nature of language, cognition, and intelligence, both artificial and biological. Moreover, by drawing parallels between LLMs and human cognitive processes, Psychomatics can inform the development of more robust and human-like AI systems.
--------------------------------------------------------------------------------------------------------
A deeper look at depth pruning of LLMs
This study explores different block importance metrics for pruning Large Language Models (LLMs), including adaptive metrics like Shapley value. The researchers extend their analysis to individual self-attention and feed-forward layers, finding that self-attention layers are more amenable to pruning. They also investigate performance recovery techniques using lightweight adapters. This work could lead to more efficient LLM deployment, reducing computational resources and energy consumption while maintaining performance. Potential applications include optimizing language models for mobile devices, improving real-time language processing in resource-constrained environments, and making large-scale language models more accessible for various applications.
Authors: Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, Pavlo Molchanov
Link: https://arxiv.org/abs/2407.16286v1
Date: 2024-07-23
Summary:
Large Language Models (LLMs) are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the self-attention layers to be more amendable to pruning, even allowing removal of upto 33% of the self-attention layers without incurring any performance degradation on MMLU for Mistral 7b (significant reduction in costly maintenance of KV-cache). Finally, we look at simple performance recovery techniques to emulate the pruned layers by training lightweight additive bias or low-rank linear adapters. Performance recovery using emulated updates avoids performance degradation for the initial blocks (up to 5% absolute improvement on MMLU), which is either competitive or superior to the learning-based technique.
--------------------------------------------------------------------------------------------------------
DDK: Distilling Domain Knowledge for Efficient Large Language Models
This paper introduces DDK, a new Large Language Model (LLM) distillation framework that dynamically adjusts the composition of the distillation dataset based on domain performance differences between teacher and student models. This approach addresses the issue of uneven knowledge transfer across domains in existing distillation methods. The researchers demonstrate that DDK significantly improves student model performance, outperforming both continuously pretrained baselines and existing knowledge distillation methods. This technique could lead to more efficient and effective LLMs for domain-specific applications, such as specialized chatbots, industry-specific language models, or tailored language understanding systems for various sectors.
Authors: Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai, Jie Liu, Ge Zhang, Jiakai Wang, Yanan Wu, Congnan Liu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng
Link: https://arxiv.org/abs/2407.16154v1
Date: 2024-07-23
Summary:
Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a high-performing LLM (i.e., the teacher model). Prevailing techniques in LLM distillation typically use a black-box model API to generate high-quality pretrained and aligned datasets, or utilize white-box distillation by altering the loss function to better transfer knowledge from the teacher LLM. However, these methods ignore the knowledge differences between the student and teacher LLMs across domains. This results in excessive focus on domains with minimal performance gaps and insufficient attention to domains with large gaps, reducing overall performance. In this paper, we introduce a new LLM distillation framework called DDK, which dynamically adjusts the composition of the distillation dataset in a smooth manner according to the domain performance differences between the teacher and student models, making the distillation process more stable and effective. Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.
--------------------------------------------------------------------------------------------------------
Towards Robust Knowledge Tracing Models via k-Sparse Attention
This study proposes sparseKT, a framework to improve the robustness and generalization of attention-based Deep Learning Knowledge Tracing (DLKT) models. The researchers incorporate a k-selection module to pick items with the highest attention scores, using two sparsification heuristics. This approach helps DLKT models focus on relevant student interactions and achieves comparable predictive performance to state-of-the-art models. The sparseKT framework could enhance personalized learning systems, adaptive educational technologies, and intelligent tutoring systems by providing more accurate and robust predictions of student knowledge and performance across various educational contexts.
Authors: Shuyan Huang, Zitao Liu, Xiangyu Zhao, Weiqi Luo, Jian Weng
Link: https://arxiv.org/abs/2407.17097v1
Date: 2024-07-24
Summary:
Knowledge tracing (KT) is the problem of predicting students' future performance based on their historical interaction sequences. With the advanced capability of capturing contextual long-term dependency, attention mechanism becomes one of the essential components in many deep learning based KT (DLKT) models. In spite of the impressive performance achieved by these attentional DLKT models, many of them are often vulnerable to run the risk of overfitting, especially on small-scale educational datasets. Therefore, in this paper, we propose \textsc{sparseKT}, a simple yet effective framework to improve the robustness and generalization of the attention based DLKT approaches. Specifically, we incorporate a k-selection module to only pick items with the highest attention scores. We propose two sparsification heuristics : (1) soft-thresholding sparse attention and (2) top-$K$ sparse attention. We show that our \textsc{sparseKT} is able to help attentional KT models get rid of irrelevant student interactions and have comparable predictive performance when compared to 11 state-of-the-art KT models on three publicly available real-world educational datasets. To encourage reproducible research, we make our data and code publicly available at \url{https://github.com/pykt-team/pykt-toolkit}\footnote{We merged our model to the \textsc{pyKT} benchmark at \url{https://pykt.org/}.}.
--------------------------------------------------------------------------------------------------------
From Sands to Mansions: Enabling Automatic Full-Life-Cycle Cyberattack Construction with LLM
This paper introduces AURORA, an automatic end-to-end cyberattack construction and emulation framework leveraging Large Language Models (LLMs). AURORA can autonomously build multi-stage cyberattack plans based on Cyber Threat Intelligence reports, construct emulation infrastructures, and execute attack procedures. The framework incorporates a wider range of attack techniques than professional red teams and can construct attacks and infrastructures in minutes without human intervention. This technology could significantly enhance cybersecurity testing and evaluation, helping organizations identify vulnerabilities, improve defense strategies, and train security personnel more effectively against advanced and evolving cyber threats.
Authors: Lingzhi Wang, Jiahui Wang, Kyle Jung, Kedar Thiagarajan, Emily Wei, Xiangmin Shen, Yan Chen, Zhenyuan Li
Link: https://arxiv.org/abs/2407.16928v1
Date: 2024-07-24
Summary:
The escalating battles between attackers and defenders in cybersecurity make it imperative to test and evaluate defense capabilities from the attackers' perspective. However, constructing full-life-cycle cyberattacks and performing red team emulations requires significant time and domain knowledge from security experts. Existing cyberattack simulation frameworks face challenges such as limited technical coverage, inability to conduct full-life-cycle attacks, and the need for manual infrastructure building. These limitations hinder the quality and diversity of the constructed attacks. In this paper, we leveraged the capabilities of Large Language Models (LLMs) in summarizing knowledge from existing attack intelligence and generating executable machine code based on human knowledge. we proposed AURORA, an automatic end-to-end cyberattack construction and emulation framework. AURORA can autonomously build multi-stage cyberattack plans based on Cyber Threat Intelligence (CTI) reports, construct the emulation infrastructures, and execute the attack procedures. We also developed an attack procedure knowledge graph to integrate knowledge about attack techniques throughout the full life cycle of advanced cyberattacks from various sources. We constructed and evaluated more than 20 full-life-cycle cyberattacks based on existing CTI reports. Compared to previous attack simulation frameworks, AURORA can construct multi-step attacks and the infrastructures in several minutes without human intervention. Furthermore, AURORA incorporates a wider range (40% more) of attack techniques into the constructed attacks in a more efficient way than the professional red teams. To benefit further research, we open-sourced the dataset containing the execution files and infrastructures of 20 emulated cyberattacks.
--------------------------------------------------------------------------------------------------------
Take a Step and Reconsider: Sequence Decoding for Self-Improved Neural Combinatorial Optimization
This paper presents a novel sequence decoding method for self-improved learning in Neural Combinatorial Optimization (NCO). The approach uses sampling without replacement and incrementally follows the best solution found, repeating the process from intermediate partial solutions. By modifying the policy to ignore previously sampled sequences, it increases solution diversity. The method shows strong performance on various optimization problems, including the Traveling Salesman, Capacitated Vehicle Routing, and Job Shop Scheduling Problems. This technique could enhance optimization algorithms for logistics, supply chain management, manufacturing scheduling, and other industries relying on complex combinatorial problem-solving.
Authors: Jonathan Pirnay, Dominik G. Grimm
Link: https://arxiv.org/abs/2407.17206v1
Date: 2024-07-24
Summary:
The constructive approach within Neural Combinatorial Optimization (NCO) treats a combinatorial optimization problem as a finite Markov decision process, where solutions are built incrementally through a sequence of decisions guided by a neural policy network. To train the policy, recent research is shifting toward a 'self-improved' learning methodology that addresses the limitations of reinforcement learning and supervised approaches. Here, the policy is iteratively trained in a supervised manner, with solutions derived from the current policy serving as pseudo-labels. The way these solutions are obtained from the policy determines the quality of the pseudo-labels. In this paper, we present a simple and problem-independent sequence decoding method for self-improved learning based on sampling sequences without replacement. We incrementally follow the best solution found and repeat the sampling process from intermediate partial solutions. By modifying the policy to ignore previously sampled sequences, we force it to consider only unseen alternatives, thereby increasing solution diversity. Experimental results for the Traveling Salesman and Capacitated Vehicle Routing Problem demonstrate its strong performance. Furthermore, our method outperforms previous NCO approaches on the Job Shop Scheduling Problem.
--------------------------------------------------------------------------------------------------------
Advancing Brain Imaging Analysis Step-by-step via Progressive Self-paced Learning
This study introduces the Progressive Self-Paced Distillation (PSPD) framework for brain imaging analysis, addressing challenges such as heterogeneity, individual variations, and small dataset sizes. PSPD employs an adaptive and progressive pacing and distillation mechanism, allowing for dynamic curriculum adjustments based on past and present model states. The framework demonstrates superior performance and generalization capabilities across various convolutional neural networks using the Alzheimer's Disease Neuroimaging Initiative dataset. This approach could significantly improve medical image analysis, potentially enhancing early diagnosis, treatment planning, and research in neurodegenerative diseases and other brain disorders.
Authors: Yanwu Yang, Hairui Chen, Jiesi Hu, Xutao Guo, Ting Ma
Link: https://arxiv.org/abs/2407.16128v1
Date: 2024-07-23
Summary:
Recent advancements in deep learning have shifted the development of brain imaging analysis. However, several challenges remain, such as heterogeneity, individual variations, and the contradiction between the high dimensionality and small size of brain imaging datasets. These issues complicate the learning process, preventing models from capturing intrinsic, meaningful patterns and potentially leading to suboptimal performance due to biases and overfitting. Curriculum learning (CL) presents a promising solution by organizing training examples from simple to complex, mimicking the human learning process, and potentially fostering the development of more robust and accurate models. Despite its potential, the inherent limitations posed by small initial training datasets present significant challenges, including overfitting and poor generalization. In this paper, we introduce the Progressive Self-Paced Distillation (PSPD) framework, employing an adaptive and progressive pacing and distillation mechanism. This allows for dynamic curriculum adjustments based on the states of both past and present models. The past model serves as a teacher, guiding the current model with gradually refined curriculum knowledge and helping prevent the loss of previously acquired knowledge. We validate PSPD's efficacy and adaptability across various convolutional neural networks using the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, underscoring its superiority in enhancing model performance and generalization capabilities. The source code for this approach will be released at https://github.com/Hrychen7/PSPD.
--------------------------------------------------------------------------------------------------------
GenRec: A Flexible Data Generator for Recommendations
This paper presents GenRec, a novel framework for generating synthetic user-item interactions that exhibit realistic properties observed in recommendation scenarios. Based on a stochastic generative process using latent factor modeling, GenRec offers high flexibility and a wide range of hyper-parameters for customizing interaction generation. The framework addresses the scarcity of realistic datasets in benchmarking recommender systems and social network analysis methods. GenRec could be valuable for researchers and developers in e-commerce, social media, content streaming platforms, and other recommendation-based systems, enabling more robust testing and development of recommendation algorithms without relying on sensitive user data.
Authors: Erica Coppolillo, Simone Mungari, Ettore Ritacco, Giuseppe Manco
Link: https://arxiv.org/abs/2407.16594v1
Date: 2024-07-23
Summary:
The scarcity of realistic datasets poses a significant challenge in benchmarking recommender systems and social network analysis methods and techniques. A common and effective solution is to generate synthetic data that simulates realistic interactions. However, although various methods have been proposed, the existing literature still lacks generators that are fully adaptable and allow easy manipulation of the underlying data distributions and structural properties. To address this issue, the present work introduces GenRec, a novel framework for generating synthetic user-item interactions that exhibit realistic and well-known properties observed in recommendation scenarios. The framework is based on a stochastic generative process based on latent factor modeling. Here, the latent factors can be exploited to yield long-tailed preference distributions, and at the same time they characterize subpopulations of users and topic-based item clusters. Notably, the proposed framework is highly flexible and offers a wide range of hyper-parameters for customizing the generation of user-item interactions. The code used to perform the experiments is publicly available at https://anonymous.4open.science/r/GenRec-DED3.
--------------------------------------------------------------------------------------------------------
On the Use of Immersive Digital Technologies for Designing and Operating UAVs
This paper provides a comprehensive overview of current research and developments involving immersive digital technologies, such as Digital Twin (DT) and Extended Reality (XR), for Unmanned Aerial Vehicles (UAVs). The authors explore the integration of these technologies with Artificial Intelligence algorithms to create more intelligent, adaptive, and responsive UAV systems. They also discuss research gaps and suggest future directions. This work could inform the development of advanced UAV control systems, enhancing their use in applications such as communication relay networks, disaster response, precision agriculture, and urban planning, by improving situational awareness and decision-making capabilities.
Authors: Yousef Emami, Kai Li, Luis Almeida, Wei Ni
Link: https://arxiv.org/abs/2407.16288v1
Date: 2024-07-23
Summary:
Unmanned Aerial Vehicles (UAVs) provide agile and safe solutions to communication relay networks, offering improved throughput. However, their modeling and control present challenges, and real-world deployment is hindered by the gap between simulation and reality. Moreover, enhancing situational awareness is critical. Several works in the literature proposed integrating UAV operation with immersive digital technologies, such as Digital Twin (DT) and Extended Reality (XR), to address these challenges. This paper provides a comprehensive overview of current research and developments involving immersive digital technologies for UAVs, including the latest advancements and emerging trends. We also explore the integration of DT and XR with Artificial Intelligence (AI) algorithms to create more intelligent, adaptive, and responsive UAV systems. Finally, we provide discussions, identify gaps in current research, and suggest future directions for studying the application of immersive technologies in UAVs, fostering further innovation and development in this field. We envision the fusion of DTs with XR will transform how UAVs operate, offering tools that enhance visualization, improve decision-making, and enable effective collaboration.
--------------------------------------------------------------------------------------------------------
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
This paper introduces AppWorld Engine, a high-quality execution environment of 9 day-to-day apps operable via 457 APIs, and AppWorld Benchmark, a suite of 750 natural, diverse, and challenging autonomous agent tasks. These tools address the gap in existing benchmarks for tool use, which typically cover only simple API call sequences. AppWorld supports robust programmatic evaluation with state-based unit tests, allowing for different task completion methods while checking for unexpected changes. This benchmark could significantly advance the development and evaluation of interactive coding agents, potentially improving AI assistants, automation tools, and software testing methodologies across various industries.
Authors: Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian
Link: https://arxiv.org/abs/2407.18901v1
Date: 2024-07-26
Summary:
Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built $\textbf{AppWorld Engine}$, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created $\textbf{AppWorld Benchmark}$ (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.
--------------------------------------------------------------------------------------------------------
Adversarial Robust Decision Transformer: Enhancing Robustness of RvS via Minimax Returns-to-go
This study proposes the Adversarial Robust Decision Transformer (ARDT), a worst-case-aware Reinforcement Learning via Supervised Learning (RvS) algorithm. ARDT learns and conditions the policy on in-sample minimax returns-to-go, aligning the target return with the worst-case return learned through minimax expectile regression. This approach enhances robustness against powerful test-time adversaries in sequential games and continuous adversarial RL environments. ARDT could improve the performance and reliability of AI systems in adversarial or uncertain environments, with potential applications in game theory, cybersecurity, financial modeling, and autonomous systems operating in complex, dynamic scenarios.
Authors: Xiaohang Tang, Afonso Marques, Parameswaran Kamalaruban, Ilija Bogunovic
Link: https://arxiv.org/abs/2407.18414v1
Date: 2024-07-25
Summary:
Decision Transformer (DT), as one of the representative Reinforcement Learning via Supervised Learning (RvS) methods, has achieved strong performance in offline learning tasks by leveraging the powerful Transformer architecture for sequential decision-making. However, in adversarial environments, these methods can be non-robust, since the return is dependent on the strategies of both the decision-maker and adversary. Training a probabilistic model conditioned on observed return to predict action can fail to generalize, as the trajectories that achieve a return in the dataset might have done so due to a weak and suboptimal behavior adversary. To address this, we propose a worst-case-aware RvS algorithm, the Adversarial Robust Decision Transformer (ARDT), which learns and conditions the policy on in-sample minimax returns-to-go. ARDT aligns the target return with the worst-case return learned through minimax expectile regression, thereby enhancing robustness against powerful test-time adversaries. In experiments conducted on sequential games with full data coverage, ARDT can generate a maximin (Nash Equilibrium) strategy, the solution with the largest adversarial robustness. In large-scale sequential games and continuous adversarial RL environments with partial data coverage, ARDT demonstrates significantly superior robustness to powerful test-time adversaries and attains higher worst-case returns compared to contemporary DT methods.
--------------------------------------------------------------------------------------------------------
Co-designing an AI Impact Assessment Report Template with AI Practitioners and AI Compliance Experts
This paper presents a co-designed template for AI impact assessment reports, developed through an iterative process with AI practitioners and compliance experts. The template is grounded in the EU AI Act, NIST's AI Risk Management Framework, and ISO 42001 AI Management System. It effectively provides necessary information for impact assessments and documents the broad impacts of AI systems. This tool could be valuable for companies developing AI systems, helping them ensure compliance with regulations, guide the design stage of AI uses, and assess potential impacts. It may also aid policymakers and regulators in standardizing AI impact assessments across industries.
Authors: Edyta Bogucka, Marios Constantinides, Sanja Šćepanović, Daniele Quercia
Link: https://arxiv.org/abs/2407.17374v1
Date: 2024-07-24
Summary:
In the evolving landscape of AI regulation, it is crucial for companies to conduct impact assessments and document their compliance through comprehensive reports. However, current reports lack grounding in regulations and often focus on specific aspects like privacy in relation to AI systems, without addressing the real-world uses of these systems. Moreover, there is no systematic effort to design and evaluate these reports with both AI practitioners and AI compliance experts. To address this gap, we conducted an iterative co-design process with 14 AI practitioners and 6 AI compliance experts and proposed a template for impact assessment reports grounded in the EU AI Act, NIST's AI Risk Management Framework, and ISO 42001 AI Management System. We evaluated the template by producing an impact assessment report for an AI-based meeting companion at a major tech company. A user study with 8 AI practitioners from the same company and 5 AI compliance experts from industry and academia revealed that our template effectively provides necessary information for impact assessments and documents the broad impacts of AI systems. Participants envisioned using the template not only at the pre-deployment stage for compliance but also as a tool to guide the design stage of AI uses.
--------------------------------------------------------------------------------------------------------
Dataset Distribution Impacts Model Fairness: Single vs. Multi-Task Learning
This study evaluates the performance of skin lesion classification using ResNet-based CNNs, focusing on patient sex variations in training data and three different learning strategies. The researchers present a linear programming method for generating datasets with varying patient sex and class labels. They find that sex-specific training data yields better results, single-task models exhibit sex bias, and datasets including male patients enhance model performance for the male subgroup. This research could inform the development of fairer and more accurate medical image analysis models, potentially improving diagnostic accuracy and reducing biases in healthcare applications.
Authors: Ralf Raumanns, Gerard Schouten, Josien P. W. Pluim, Veronika Cheplygina
Link: https://arxiv.org/abs/2407.17543v1
Date: 2024-07-24
Summary:
The influence of bias in datasets on the fairness of model predictions is a topic of ongoing research in various fields. We evaluate the performance of skin lesion classification using ResNet-based CNNs, focusing on patient sex variations in training data and three different learning strategies. We present a linear programming method for generating datasets with varying patient sex and class labels, taking into account the correlations between these variables. We evaluated the model performance using three different learning strategies: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our observations include: 1) sex-specific training data yields better results, 2) single-task models exhibit sex bias, 3) the reinforcement approach does not remove sex bias, 4) the adversarial model eliminates sex bias in cases involving only female patients, and 5) datasets that include male patients enhance model performance for the male subgroup, even when female patients are the majority. To generalise these findings, in future research, we will examine more demographic attributes, like age, and other possibly confounding factors, such as skin colour and artefacts in the skin lesions. We make all data and models available on GitHub.
--------------------------------------------------------------------------------------------------------
Lawma: The Power of Specialization for Legal Tasks
This paper conducts a comprehensive study of 260 legal text classification tasks, comparing the performance of GPT-4 and a fine-tuned Llama 3 model. The researchers demonstrate that a lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points. They also show that a single model can be fine-tuned on all 260 tasks simultaneously with only a small loss in accuracy. This work could significantly impact empirical legal research, offering a more efficient and accurate alternative to traditional human annotation or prompting commercial models for legal text classification tasks.
Authors: Ricardo Dominguez-Olmedo, Vedant Nanda, Rediet Abebe, Stefan Bechtold, Christoph Engel, Jens Frankenreiter, Krishna Gummadi, Moritz Hardt, Michael Livermore
Link: https://arxiv.org/abs/2407.16615v1
Date: 2024-07-23
Summary:
Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal scholars are increasingly turning to prompting commercial models, hoping that it will alleviate the significant cost of human annotation. Despite growing use, our understanding of how to best utilize large language models for legal tasks remains limited. We conduct a comprehensive study of 260 legal text classification tasks, nearly all new to the machine learning community. Starting from GPT-4 as a baseline, we show that it has non-trivial but highly varied zero-shot accuracy, often exhibiting performance that may be insufficient for legal work. We then demonstrate that a lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points. We find that larger models respond better to fine-tuning than smaller models. A few tens to hundreds of examples suffice to achieve high classification accuracy. Notably, we can fine-tune a single model on all 260 tasks simultaneously at a small loss in accuracy relative to having a separate model for each task. Our work points to a viable alternative to the predominant practice of prompting commercial models. For concrete legal tasks with some available labeled data, researchers are better off using a fine-tuned open-source model.
--------------------------------------------------------------------------------------------------------
This systematic review analyzes recent literature to identify privacy threats in Federated Learning (FL) within Internet of Things (IoT) environments and evaluates defensive measures to mitigate these threats. The researchers identify various privacy threats, including inference attacks, poisoning attacks, and eavesdropping, along with defensive measures such as Differential Privacy and Secure Multi-Party Computation. This work could inform the development of more secure and privacy-preserving FL systems for IoT applications, potentially improving data protection in smart homes, industrial IoT, healthcare IoT, and other connected environments.
Authors: Adel ElZemity, Budi Arief
Link: https://arxiv.org/abs/2407.18096v1
Date: 2024-07-25
Summary:
Federated Learning (FL) in the Internet of Things (IoT) environments can enhance machine learning by utilising decentralised data, but at the same time, it might introduce significant privacy and security concerns due to the constrained nature of IoT devices. This represents a research challenge that we aim to address in this paper. We systematically analysed recent literature to identify privacy threats in FL within IoT environments, and evaluate the defensive measures that can be employed to mitigate these threats. Using a Systematic Literature Review (SLR) approach, we searched five publication databases (Scopus, IEEE Xplore, Wiley, ACM, and Science Direct), collating relevant papers published between 2017 and April 2024, a period which spans from the introduction of FL until now. Guided by the PRISMA protocol, we selected 49 papers to focus our systematic review on. We analysed these papers, paying special attention to the privacy threats and defensive measures -- specifically within the context of IoT -- using inclusion and exclusion criteria tailored to highlight recent advances and critical insights. We identified various privacy threats, including inference attacks, poisoning attacks, and eavesdropping, along with defensive measures such as Differential Privacy and Secure Multi-Party Computation. These defences were evaluated for their effectiveness in protecting privacy without compromising the functional integrity of FL in IoT settings. Our review underscores the necessity for robust and efficient privacy-preserving strategies tailored for IoT environments. Notably, there is a need for strategies against replay, evasion, and model stealing attacks. Exploring lightweight defensive measures and emerging technologies such as blockchain may help improve the privacy of FL in IoT, leading to the creation of FL models that can operate under variable network conditions.
--------------------------------------------------------------------------------------------------------
This study presents a model-free approach to reinforcement learning for continuous-time linear-quadratic control problems where the volatility of state processes depends on both state and control variables. The researchers devise an actor-critic algorithm to learn the optimal policy parameter directly and introduce a novel exploration schedule. They prove that the algorithm achieves a sublinear regret bound and provide convergence rates. This work could enhance control systems in various applications, such as robotics, autonomous vehicles, and financial modeling, where continuous-time dynamics and state-dependent volatility are important considerations.
Authors: Yilie Huang, Yanwei Jia, Xun Yu Zhou
Link: https://arxiv.org/abs/2407.17226v1
Date: 2024-07-24
Summary:
We study reinforcement learning (RL) for a class of continuous-time linear-quadratic (LQ) control problems for diffusions where volatility of the state processes depends on both state and control variables. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly. Our main contributions include the introduction of a novel exploration schedule and a regret analysis of the proposed algorithm. We provide the convergence rate of the policy parameter to the optimal one, and prove that the algorithm achieves a regret bound of $O(N^{\frac{3}{4}})$ up to a logarithmic factor. We conduct a simulation study to validate the theoretical results and demonstrate the effectiveness and reliability of the proposed algorithm. We also perform numerical comparisons between our method and those of the recent model-based stochastic LQ RL studies adapted to the state- and control-dependent volatility setting, demonstrating a better performance of the former in terms of regret bounds.
--------------------------------------------------------------------------------------------------------