Week Ending 5.19.2024

RESEARCH WATCH: 5.19.2024

Low-Rank Adaptation (LoRA) enables memory-efficient fine-tuning of large language models by training only low-rank weight perturbations. This study compares LoRA's performance to full fine-tuning across programming and math domains. While underperforming full fine-tuning, LoRA exhibits beneficial regularization - maintaining capabilities outside the target domain and generating more diverse outputs. Understanding LoRA's strengths could guide applications aiming to specialize large models efficiently.

Authors: Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham

Link: https://arxiv.org/abs/2405.09673v1

Date: 2024-05-15

Summary:

Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ($\approx$100K prompt-response pairs) and continued pretraining ($\approx$10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain. We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

--------------------------------------------------------------------------------------------------------

Adaptation of Distinct Semantics for Uncertain Areas in Polyp Segmentation

Accurate polyp segmentation from colonoscopy images can aid polyp diagnosis and surgery planning. However, achieving high segmentation performance is challenging due to polyp variations. This work introduces ADSNet, an architecture that modifies misclassified details and recovers weak features for improved polyp segmentation. With its ability to flexibly integrate different encoders and decoders, ADSNet could enhance computer-aided polyp detection systems for better clinical outcomes.

Authors: Quang Vinh Nguyen, Van Thong Huynh, Soo-Hyung Kim

Link: https://arxiv.org/abs/2405.07523v1

Date: 2024-05-13

Summary:

Colonoscopy is a common and practical method for detecting and treating polyps. Segmenting polyps from colonoscopy image is useful for diagnosis and surgery progress. Nevertheless, achieving excellent segmentation performance is still difficult because of polyp characteristics like shape, color, condition, and obvious non-distinction from the surrounding context. This work presents a new novel architecture namely Adaptation of Distinct Semantics for Uncertain Areas in Polyp Segmentation (ADSNet), which modifies misclassified details and recovers weak features having the ability to vanish and not be detected at the final stage. The architecture consists of a complementary trilateral decoder to produce an early global map. A continuous attention module modifies semantics of high-level features to analyze two separate semantics of the early global map. The suggested method is experienced on polyp benchmarks in learning ability and generalization ability, experimental results demonstrate the great correction and recovery ability leading to better segmentation performance compared to the other state of the art in the polyp image segmentation task. Especially, the proposed architecture could be experimented flexibly for other CNN-based encoders, Transformer-based encoders, and decoder backbones.

--------------------------------------------------------------------------------------------------------

I-CTRL: Imitation to Control Humanoid Robots Through Constrained Reinforcement Learning

Retargeting human motions to humanoid robots often sacrifices physics feasibility for visual fidelity, hindering practical robot deployment. I-CTRL employs constrained reinforcement learning to refine retargeted motions, achieving physics-based imitation that resembles the reference human trajectory. This framework's ability to follow large motion datasets with a single agent could streamline controlling humanoid robots, advancing applications like robotic assistants mimicking human dexterity.

Authors: Yashuai Yan, Esteve Valls Mascaro, Tobias Egle, Dongheui Lee

Link: https://arxiv.org/abs/2405.08726v1

Date: 2024-05-14

Summary:

This paper addresses the critical need for refining robot motions that, despite achieving a high visual similarity through human-to-humanoid retargeting methods, fall short of practical execution in the physical realm. Existing techniques in the graphics community often prioritize visual fidelity over physics-based feasibility, posing a significant challenge for deploying bipedal systems in practical applications. Our research introduces a constrained reinforcement learning algorithm to produce physics-based high-quality motion imitation onto legged humanoid robots that enhance motion resemblance while successfully following the reference human trajectory. We name our framework: I-CTRL. By reformulating the motion imitation problem as a constrained refinement over non-physics-based retargeted motions, our framework excels in motion imitation with simple and unique rewards that generalize across four robots. Moreover, our framework can follow large-scale motion datasets with a unique RL agent. The proposed approach signifies a crucial step forward in advancing the control of bipedal robots, emphasizing the importance of aligning visual and physical realism for successful motion imitation.

--------------------------------------------------------------------------------------------------------

Societal Adaptation to Advanced AI

While AI risk mitigation often focuses on controlling AI development, this approach becomes less feasible as advanced AI proliferates. This paper proposes increasing societal adaptation by reducing negative impacts from AI diffusion. A conceptual framework identifies interventions to avoid, defend against, and remedy AI misuse across scenarios like election manipulation. Implementing this cycle could build resilience against advanced AI risks, benefiting governments, industries and society.

Authors: Jamie Bernardi, Gabriel Mukobi, Hilary Greaves, Lennart Heim, Markus Anderljung

Link: https://arxiv.org/abs/2405.10295v1

Date: 2024-05-16

Summary:

Existing strategies for managing risks from advanced AI systems often focus on affecting what AI systems are developed and how they diffuse. However, this approach becomes less feasible as the number of developers of advanced AI grows, and impedes beneficial use-cases as well as harmful ones. In response, we urge a complementary approach: increasing societal adaptation to advanced AI, that is, reducing the expected negative impacts from a given level of diffusion of a given AI capability. We introduce a conceptual framework which helps identify adaptive interventions that avoid, defend against and remedy potentially harmful uses of AI systems, illustrated with examples in election manipulation, cyberterrorism, and loss of control to AI decision-makers. We discuss a three-step cycle that society can implement to adapt to AI. Increasing society's ability to implement this cycle builds its resilience to advanced AI. We conclude with concrete recommendations for governments, industry, and third-parties.

--------------------------------------------------------------------------------------------------------

From Sora What We Can See: A Survey of Text-to-Video Generation

Generating videos from text inputs, as showcased by OpenAI's Sora, is a milestone towards artificial general intelligence. This survey comprehensively reviews text-to-video generation algorithms categorized by evolutionary generators, pursuit of excellence, and achieving realism. It outlines widely used datasets, evaluation metrics, and identifies open challenges. The insights could guide future research advancing this emerging field with applications in content creation, virtual environments, and human-AI interaction.

Authors: Rui Sun, Yumin Zhang, Tejal Shah, Jiahao Sun, Shuoying Zhang, Wenqi Li, Haoran Duan, Bo Wei, Rajiv Ranjan

Link: https://arxiv.org/abs/2405.10674v1

Date: 2024-05-17

Summary:

With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development.

--------------------------------------------------------------------------------------------------------

The Platonic Representation Hypothesis

This paper argues that representations learned by AI models, especially deep networks, are converging towards a shared statistical model representing an "ideal reality" akin to Plato's concept. It surveys literature examples of representation convergence and demonstrates convergence across vision and language modalities as models grow larger. Understanding the forces driving this hypothesized convergence and its implications could influence the design of more capable and generalizable AI systems.

Authors: Minyoung Huh, Brian Cheung, Tongzhou Wang, Phillip Isola

Link: https://arxiv.org/abs/2405.07987v1

Date: 2024-05-13

Summary:

We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.

--------------------------------------------------------------------------------------------------------

TimeX++: Learning Time-Series Explanations with Information Bottleneck

Explaining deep learning models for time series data is crucial for interpretability in applications like environmental monitoring. TimeX++ introduces an information bottleneck-based objective to learn high-quality time series explanations avoiding trivial solutions and distribution shift issues. Its parametric network generates explanation-embedded instances preserving label information. Outperforming baselines, TimeX++ could enhance the transparency of time series models for applications requiring reliable explanations.

Authors: Zichuan Liu, Tianchun Wang, Jimeng Shi, Xu Zheng, Zhuomin Chen, Lei Song, Wenqian Dong, Jayantha Obeysekera, Farhad Shirani, Dongsheng Luo

Link: https://arxiv.org/abs/2405.09308v1

Date: 2024-05-15

Summary:

Explaining deep learning models operating on time series data is crucial in various applications of interest which require interpretable and transparent insights from time series signals. In this work, we investigate this problem from an information theoretic perspective and show that most existing measures of explainability may suffer from trivial solutions and distributional shift issues. To address these issues, we introduce a simple yet practical objective function for time series explainable learning. The design of the objective function builds upon the principle of information bottleneck (IB), and modifies the IB objective function to avoid trivial solutions and distributional shift issues. We further present TimeX++, a novel explanation framework that leverages a parametric network to produce explanation-embedded instances that are both in-distributed and label-preserving. We evaluate TimeX++ on both synthetic and real-world datasets comparing its performance against leading baselines, and validate its practical efficacy through case studies in a real-world environmental application. Quantitative and qualitative evaluations show that TimeX++ outperforms baselines across all datasets, demonstrating a substantial improvement in explanation quality for time series data. The source code is available at \url{https://github.com/zichuan-liu/TimeXplusplus}.

--------------------------------------------------------------------------------------------------------

Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation

While large-scale pretraining benefits robot learning, current paradigms only pretrain visual representations from abundant data. This work uses contact microphones as tactile sensors, leveraging audio-visual pretraining to boost manipulation performance in low-data regimes common in robotics. As the first approach doing multisensory pretraining for robotic manipulation, it could improve capabilities of robots operating in contact-rich tasks like assistants grasping objects.

Authors: Jared Mejia, Victoria Dean, Tess Hellebrekers, Abhinav Gupta

Link: https://arxiv.org/abs/2405.08576v1

Date: 2024-05-14

Summary:

Although pre-training on a large amount of data is beneficial for robot learning, current paradigms only perform large-scale pretraining for visual representations, whereas representations for other modalities are trained from scratch. In contrast to the abundance of visual data, it is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing. Such pretraining becomes increasingly crucial in the low-data regimes common in robotics applications. In this paper, we address this gap by using contact microphones as an alternative tactile sensor. Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. To the best of our knowledge, our method is the first approach leveraging large-scale multisensory pre-training for robotic manipulation. For supplementary information including videos of real robot experiments, please see https://sites.google.com/view/hearing-touch.

--------------------------------------------------------------------------------------------------------

Progressive enhancement and restoration for mural images under low-light and defected conditions based on multi-receptive field strategy

Ancient murals provide invaluable cultural insights but suffer continuous damage and poor visibility from low lighting. This work proposes MER, a two-stage model to enhance and restore damaged, low-light mural images. With commendable performance and a dedicated website utilizing MER, this approach could aid archaeological efforts by batch processing mural images to digitally preserve this cultural heritage.

Authors: Xiameng Wei, Binbin Fan, Ying Wang, Yanxiang Feng, Laiyi Fu

Link: https://arxiv.org/abs/2405.08245v1

Date: 2024-05-14

Summary:

Ancient murals are valuable cultural heritage with great archaeological value. They provide insights into ancient religions, ceremonies, folklore, among other things through their content. However, due to long-term oxidation and inadequate protection, ancient murals have suffered continuous damage, including peeling and mold etc. Additionally, since ancient murals were typically painted indoors, the light intensity in images captured by digital devices is often low. The poor visibility hampers the further restoration of damaged areas. To address the escalating damage to ancient frescoes and facilitate batch restoration at archaeological sites, we propose a two-stage restoration model which called MER(Mural Enhancement and Restoration net) for ancient murals that are damaged and have been captured in low light. Our two-stage model not only enhances the visual quality of restored images but also achieves commendable results in relevant metric evaluations compared with other competitors. Furthermore, we have launched a website dedicated to the restoration of ancient mural paintings, utilizing the proposed model. Code is available at https://gitee.com/bbfan2024/MER.git.

--------------------------------------------------------------------------------------------------------

No Joke: An Embodied Conversational Agent Greeting Older Adults with Humour or a Smile Unrelated to Initial Acceptance

As conversational AI assistants for older adults increase, understanding factors influencing their initial acceptance is important. This study evaluated if positive first impressions like laughter or smiles from an embodied agent impacted older adults' attitudes. Countering expectations, results showed no effects beyond general technology attitudes. These insights could guide designing acceptable user onboarding experiences for deploying conversational AI assistants among older populations.

Authors: Ge "Rikaku" Li, Katie Seaborn

Link: https://arxiv.org/abs/2405.08242v1

Date: 2024-05-14

Summary:

Embodied conversation agents (ECAs) are increasingly being developed for older adults as assistants or companions. Older adults may not be familiar with ECAs, influencing uptake and acceptability. First impressions can correlate strongly with subsequent judgments, even of computer agents, and could influence acceptance. Using the circumplex model of affect, we developed three versions of an ECA -- laughing, smiling, and neutral in expression -- to evaluate how positive first impressions affect acceptance. Results from 249 older adults indicated no statistically significant effects except for general attitudes towards technology and intelligent agents. This questions the potential of laughter, jokes, puns, and smiles as a method of initial engagement for older adults.

--------------------------------------------------------------------------------------------------------

On-device Online Learning and Semantic Management of TinyML Systems

Tiny Machine Learning (TinyML) enables real-time on-device AI on embedded systems, but practical implementation faces challenges. This study bridges the gap between prototyping single TinyML models and developing reliable production systems by proposing: online learning to adapt models to evolving data, federated meta-learning for generalizing across heterogeneous devices with scarce data, and semantic management for jointly handling diverse models and devices at scale. Evaluated on real-world applications like image classification and audio sensing, the proposed methods could facilitate widespread adoption of efficient AI on resource-constrained edge devices.

Authors: Haoyu Ren, Xue Li, Darko Anicic, Thomas A. Runkler

Link: https://arxiv.org/abs/2405.07601v2

Date: 2024-05-15

Summary:

Recent advances in Tiny Machine Learning (TinyML) empower low-footprint embedded devices for real-time on-device Machine Learning. While many acknowledge the potential benefits of TinyML, its practical implementation presents unique challenges. This study aims to bridge the gap between prototyping single TinyML models and developing reliable TinyML systems in production: (1) Embedded devices operate in dynamically changing conditions. Existing TinyML solutions primarily focus on inference, with models trained offline on powerful machines and deployed as static objects. However, static models may underperform in the real world due to evolving input data distributions. We propose online learning to enable training on constrained devices, adapting local models towards the latest field conditions. (2) Nevertheless, current on-device learning methods struggle with heterogeneous deployment conditions and the scarcity of labeled data when applied across numerous devices. We introduce federated meta-learning incorporating online learning to enhance model generalization, facilitating rapid learning. This approach ensures optimal performance among distributed devices by knowledge sharing. (3) Moreover, TinyML's pivotal advantage is widespread adoption. Embedded devices and TinyML models prioritize extreme efficiency, leading to diverse characteristics ranging from memory and sensors to model architectures. Given their diversity and non-standardized representations, managing these resources becomes challenging as TinyML systems scale up. We present semantic management for the joint management of models and devices at scale. We demonstrate our methods through a basic regression example and then assess them in three real-world TinyML applications: handwritten character image classification, keyword audio classification, and smart building presence detection, confirming our approaches' effectiveness.

--------------------------------------------------------------------------------------------------------

Building a Luganda Text-to-Speech Model From Crowdsourced Data

Text-to-speech (TTS) development for African languages like Luganda is limited by the scarcity of high-quality single-speaker recordings. This work shows that utilizing multiple speakers with similar intonation from crowdsourced data like Common Voice, combined with preprocessing to enhance audio quality, can significantly improve Luganda TTS over existing approaches. The proposed methodology enables developing more natural-sounding TTS systems for low-resource languages, benefiting applications requiring speech interfaces in local languages.

Authors: Sulaiman Kagumire, Andrew Katumba, Joyce Nakatumba-Nabende, John Quinn

Link: https://arxiv.org/abs/2405.10211v1

Date: 2024-05-16

Summary:

Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.

--------------------------------------------------------------------------------------------------------

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

Recent advances in generative speech language models (SLMs) operating on discrete speech tokens present a new paradigm for text-to-speech synthesis. This paper evaluates the capabilities and limitations of this innovative approach through extensive experiments across dimensions like speaking style, intelligibility, and speaker consistency. While excelling at generating varied prosody and spontaneous behaviors, SLM-based TTS lags in intelligibility and speaker consistency compared to conventional TTS. These insights benchmark SLM-based speech synthesis and guide future advancements leveraging the scalability and context-awareness of generative SLMs.

Authors: Siyang Wang, Éva Székely

Link: https://arxiv.org/abs/2405.09768v1

Date: 2024-05-16

Summary:

Recent advances in generative language modeling applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech language models (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-speech noises or hallucination. As the adoption of this innovative paradigm in speech synthesis increases, there is a clear need for an in-depth evaluation of its capabilities and limitations. In this paper, we evaluate TTS from a discrete token-based SLM, through both automatic metrics and listening tests. We examine five key dimensions: speaking style, intelligibility, speaker consistency, prosodic variation, spontaneous behaviour. Our results highlight the model's strength in generating varied prosody and spontaneous outputs. It is also rated higher in naturalness and context appropriateness in listening tests compared to a conventional TTS. However, the model's performance in intelligibility and speaker consistency lags behind traditional TTS. Additionally, we show that increasing the scale of SLMs offers a modest boost in robustness. Our findings aim to serve as a benchmark for future advancements in generative SLMs for speech synthesis.

--------------------------------------------------------------------------------------------------------

Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection

While large language models (LLMs) demonstrate strong zero-shot performance across tasks, their ability to process potentially idiomatic language remains unclear. This work evaluates the performance of LLMs like GPT-4 on three idiomaticity datasets, comparing them to fine-tuned task-specific models. Although LLMs give competitive results with consistent scale improvements, they do not match fine-tuned models' performance even at the largest scales. The investigation provides benchmarks for LLMs on idiomatic language understanding, crucial for applications involving processing of colloquial expressions.

Authors: Dylan Phelps, Thomas Pickard, Maggie Mi, Edward Gow-Smith, Aline Villavicencio

Link: https://arxiv.org/abs/2405.09279v1

Date: 2024-05-15

Summary:

Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.

--------------------------------------------------------------------------------------------------------

How to Surprisingly Consider Recommendations? A Knowledge-Graph-based Approach Relying on Complex Network Metrics

Traditional recommender systems focus on similarity, often suggesting popular items rather than surfacing unexpectedly relevant options. This study proposes a knowledge graph-based approach to incorporate unexpectedness into recommendations by reranking based on their impact on structural network metrics of user-item graphs. Evaluated on music listening and movie viewing data, reranking via complex network metrics leads to more surprising yet relevant recommendation lists compared to conventional techniques. This approach could help recommenders balance familiarity and novelty when suggesting content.

Authors: Oliver Baumann, Durgesh Nandini, Anderson Rossanez, Mirco Schoenfeld, Julio Cesar dos Reis

Link: https://arxiv.org/abs/2405.08465v1

Date: 2024-05-14

Summary:

Traditional recommendation proposals, including content-based and collaborative filtering, usually focus on similarity between items or users. Existing approaches lack ways of introducing unexpectedness into recommendations, prioritizing globally popular items over exposing users to unforeseen items. This investigation aims to design and evaluate a novel layer on top of recommender systems suited to incorporate relational information and suggest items with a user-defined degree of surprise. We propose a Knowledge Graph (KG) based recommender system by encoding user interactions on item catalogs. Our study explores whether network-level metrics on KGs can influence the degree of surprise in recommendations. We hypothesize that surprisingness correlates with certain network metrics, treating user profiles as subgraphs within a larger catalog KG. The achieved solution reranks recommendations based on their impact on structural graph metrics. Our research contributes to optimizing recommendations to reflect the metrics. We experimentally evaluate our approach on two datasets of LastFM listening histories and synthetic Netflix viewing profiles. We find that reranking items based on complex network metrics leads to a more unexpected and surprising composition of recommendation lists.

--------------------------------------------------------------------------------------------------------

Building imaginary-time thermal filed theory with artificial neural networks

This study introduces a neural network-based approach to estimate effective field theory actions from configurations following the Boltzmann distribution at different temperatures. Employing continuous mixture autoregressive networks on simple quantum field configurations, the methodology can construct actions at specified temperatures and interpolate to intermediate temperatures. With its ability to explore phase diagrams in detail, this neural network approach to thermal field theories could provide insights into phenomena like phase transitions and critical behavior.

Authors: Tian Xu, Lingxiao Wang, Lianyi He, Kai Zhou, Yin Jiang

Link: https://arxiv.org/abs/2405.10493v1

Date: 2024-05-17

Summary:

In this study, we introduce a novel approach in quantum field theories to estimate the action using the artificial neural networks (ANNs). The estimation is achieved by learning on system configurations governed by the Boltzmann factor, $e^{-S}$ at different temperatures within the imaginary time formalism of thermal field theory. We focus on 0+1 dimensional quantum field with kink/anti-kink configurations to demonstrate the feasibility of the method. The continuous-mixture autoregressive networks (CANs) enable the construction of accurate effective actions with tractable probability density estimation. Our numerical results demonstrate that this methodology not only facilitates the construction of effective actions at specified temperatures but also adeptly estimates the action at intermediate temperatures using data from both lower and higher temperature ensembles. This capability is especially valuable for the detailed exploration of phase diagrams.

--------------------------------------------------------------------------------------------------------

Harnessing the power of longitudinal medical imaging for eye disease prognosis using Transformer-based sequence modeling

While deep learning excels at diagnosing diseases from medical images, standard approaches only assess the current state, neglecting prognostic value from longitudinal patient data. For slowly progressing eye diseases, this work proposes the Longitudinal Transformer for Survival Analysis (LTSA) to forecast future disease risk from sequences of fundus images over time. Significantly outperforming single-timepoint baselines on AMD and glaucoma prognosis, LTSA demonstrates how leveraging longitudinal imaging could improve treatment planning by dynamically monitoring disease progression.

Authors: Gregory Holste, Mingquan Lin, Ruiwen Zhou, Fei Wang, Lei Liu, Qi Yan, Sarah H. Van Tassel, Kyle Kovacs, Emily Y. Chew, Zhiyong Lu, Zhangyang Wang, Yifan Peng

Link: https://arxiv.org/abs/2405.08780v1

Date: 2024-05-14

Summary:

Deep learning has enabled breakthroughs in automated diagnosis from medical imaging, with many successful applications in ophthalmology. However, standard medical image classification approaches only assess disease presence at the time of acquisition, neglecting the common clinical setting of longitudinal imaging. For slow, progressive eye diseases like age-related macular degeneration (AMD) and primary open-angle glaucoma (POAG), patients undergo repeated imaging over time to track disease progression and forecasting the future risk of developing disease is critical to properly plan treatment. Our proposed Longitudinal Transformer for Survival Analysis (LTSA) enables dynamic disease prognosis from longitudinal medical imaging, modeling the time to disease from sequences of fundus photography images captured over long, irregular time periods. Using longitudinal imaging data from the Age-Related Eye Disease Study (AREDS) and Ocular Hypertension Treatment Study (OHTS), LTSA significantly outperformed a single-image baseline in 19/20 head-to-head comparisons on late AMD prognosis and 18/20 comparisons on POAG prognosis. A temporal attention analysis also suggested that, while the most recent image is typically the most influential, prior imaging still provides additional prognostic value.

--------------------------------------------------------------------------------------------------------

The Power of Combined Modalities in Interactive Robot Learning

This research explores how combining diverse feedback modalities, termed "meta-modalities", impacts interactive robot learning from humans. Unlike prior work focusing on individual modalities, this study evaluates their combined effect through user studies. The findings reveal that while different modalities are perceived distinctly, their combination significantly improves learning outcomes and usability compared to traditional preference/scalar feedback. These insights open avenues for enhancing interactive robot learning systems by offering richer feedback options.

Authors: Helen Beierling, Anna-Lisa Vollmer

Link: https://arxiv.org/abs/2405.07817v1

Date: 2024-05-13

Summary:

This study contributes to the evolving field of robot learning in interaction with humans, examining the impact of diverse input modalities on learning outcomes. It introduces the concept of "meta-modalities" which encapsulate additional forms of feedback beyond the traditional preference and scalar feedback mechanisms. Unlike prior research that focused on individual meta-modalities, this work evaluates their combined effect on learning outcomes. Through a study with human participants, we explore user preferences for these modalities and their impact on robot learning performance. Our findings reveal that while individual modalities are perceived differently, their combination significantly improves learning behavior and usability. This research not only provides valuable insights into the optimization of human-robot interactive task learning but also opens new avenues for enhancing the interactive freedom and scaffolding capabilities provided to users in such settings.

--------------------------------------------------------------------------------------------------------

Realistic Evaluation of Toxicity in Large Language Models

While large language models (LLMs) are indispensable tools, their exposure to large datasets introduces risks of generating toxic or biased content despite safeguards. This paper introduces the Thoroughly Engineered Toxicity (TET) dataset containing adversarially crafted prompts to bypass LLM safety layers. Extensive evaluations on popular LLMs demonstrate TET's pivotal role as a rigorous toxicity benchmark, highlighting concerning model behaviors missed by normal prompts. Such realistic evaluations are crucial for assessing and mitigating toxicity risks as LLMs become ubiquitous.

Authors: Tinh Son Luong, Thanh-Thien Le, Linh Ngo Van, Thien Huu Nguyen

Link: https://arxiv.org/abs/2405.10659v1

Date: 2024-05-17

Summary:

Large language models (LLMs) have become integral to our professional workflows and daily lives. Nevertheless, these machine companions of ours have a critical flaw: the huge amount of data which endows them with vast and diverse knowledge, also exposes them to the inevitable toxicity and bias. While most LLMs incorporate defense mechanisms to prevent the generation of harmful content, these safeguards can be easily bypassed with minimal prompt engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity (TET) dataset, comprising manually crafted prompts designed to nullify the protective layers of such models. Through extensive evaluations, we demonstrate the pivotal role of TET in providing a rigorous benchmark for evaluation of toxicity awareness in several popular LLMs: it highlights the toxicity in the LLMs that might remain hidden when using normal prompts, thus revealing subtler issues in their behavior.

--------------------------------------------------------------------------------------------------------

BonnBot-I Plus: A Bio-diversity Aware Precise Weed Management Robotic Platform

Integrating ecological considerations into precision weeding robots is a modern agricultural challenge. This article presents the BonnBot-I Plus robotic platform which factors in bio-diversity concerns during weed management in arable farms. A novel observation model enhances weeding performance, while field experiments demonstrate the system's capability in handling diverse weed scenarios with minimal losses. By balancing effective weed control with preserving beneficial plant diversity, such bio-diversity aware robotic platforms could promote sustainable farming practices.

Authors: Alireza Ahmadi, Michael Halstead, Claus Smitt, Chris McCool

Link: https://arxiv.org/abs/2405.09118v1

Date: 2024-05-15

Summary:

In this article, we focus on the critical tasks of plant protection in arable farms, addressing a modern challenge in agriculture: integrating ecological considerations into the operational strategy of precision weeding robots like \bbot. This article presents the recent advancements in weed management algorithms and the real-world performance of \bbot\ at the University of Bonn's Klein-Altendorf campus. We present a novel Rolling-view observation model for the BonnBot-Is weed monitoring section which leads to an average absolute weeding performance enhancement of $3.4\%$. Furthermore, for the first time, we show how precision weeding robots could consider bio-diversity-aware concerns in challenging weeding scenarios. We carried out comprehensive weeding experiments in sugar-beet fields, covering both weed-only and mixed crop-weed situations, and introduced a new dataset compatible with precision weeding. Our real-field experiments revealed that our weeding approach is capable of handling diverse weed distributions, with a minimal loss of only $11.66\%$ attributable to intervention planning and $14.7\%$ to vision system limitations highlighting required improvements of the vision system.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithMay 20, 2024Comment