Week Ending 5.12.2024

 

RESEARCH WATCH: 5.12.2024

 

GLHF: General Learned Evolutionary Algorithm Via Hyper Functions

Pretrained optimization models can efficiently solve new optimization challenges, but face limitations in efficiency and generalization. This paper proposes GPOM, a novel pretrained black-box optimization model tailored for continuous optimization tasks. GPOM constructs a population-based model, demonstrating superior performance over existing methods on benchmarks and robotic control tasks, especially for high-dimensional problems. Its strong generalization capabilities across diverse settings highlight GPOM's potential for applications requiring efficient optimization solutions.

Authors:  Xiaobin Li, Kai Wu, Yujian Betterest Li, Xiaoyu Zhang, Handing Wang, Jing Liu

Link:  https://arxiv.org/abs/2405.03728v1

Date: 2024-05-06

Summary:

Pretrained Optimization Models (POMs) leverage knowledge gained from optimizing various tasks, providing efficient solutions for new optimization challenges through direct usage or fine-tuning. Despite the inefficiencies and limited generalization abilities observed in current POMs, our proposed model, the general pre-trained optimization model (GPOM), addresses these shortcomings. GPOM constructs a population-based pretrained Black-Box Optimization (BBO) model tailored for continuous optimization. Evaluation on the BBOB benchmark and two robot control tasks demonstrates that GPOM outperforms other pretrained BBO models significantly, especially for high-dimensional tasks. Its direct optimization performance exceeds that of state-of-the-art evolutionary algorithms and POMs. Furthermore, GPOM exhibits robust generalization capabilities across diverse task distributions, dimensions, population sizes, and optimization horizons.

--------------------------------------------------------------------------------------------------------

ERATTA: Extreme RAG for Table To Answers with Large Language Models

Large language models with residual augmented-generation have seen widespread use, but their applications have been limited. This work proposes a unique system leveraging multiple language models for data authentication, query routing, retrieval, and prompting to enable scalable question answering from large, varying data tables. With real-time response times under 10 seconds and high confidence across domains, this system shows promise for applications requiring efficient information extraction from enterprise data products.

Authors:  Sohini Roychowdhury, Marko Krema, Anvar Mahammad, Brian Moore, Arijit Mukherjee, Punit Prakashchandra

Link:  https://arxiv.org/abs/2405.03963v1

Date: 2024-05-07

Summary:

Large language models (LLMs) with residual augmented-generation (RAG) have been the optimal choice for scalable generative AI solutions in the recent past. However, the choice of use-cases that incorporate RAG with LLMs have been either generic or extremely domain specific, thereby questioning the scalability and generalizability of RAG-LLM approaches. In this work, we propose a unique LLM-based system where multiple LLMs can be invoked to enable data authentication, user query routing, data retrieval and custom prompting for question answering capabilities from data tables that are highly varying and large in size. Our system is tuned to extract information from Enterprise-level data products and furnish real time responses under 10 seconds. One prompt manages user-to-data authentication followed by three prompts to route, fetch data and generate a customizable prompt natural language responses. Additionally, we propose a five metric scoring module that detects and reports hallucinations in the LLM responses. Our proposed system and scoring metrics achieve >90% confidence scores across hundreds of user queries in the sustainability, financial health and social media domains. Extensions to the proposed extreme RAG architectures can enable heterogeneous source querying using LLMs.

--------------------------------------------------------------------------------------------------------

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

While recent text-to-video models can generate high-quality single-scene videos, generating coherent multi-scene videos remains challenging. This paper introduces TALC, a framework enhancing text-to-video models to recognize temporal alignment between video scenes and text descriptions, enabling generation of visually consistent multi-scene videos adhering to text. With significant performance gains, TALC has potential applications in fields like filmmaking, advertising, and content creation.

Authors:  Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang

Link:  https://arxiv.org/abs/2405.04682v1

Date: 2024-05-07

Summary:

Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red panda climbing a tree') and second scene description (e.g., `the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation. The project website is https://talc-mst2v.github.io/.

--------------------------------------------------------------------------------------------------------

A First Step in Using Machine Learning Methods to Enhance Interaction Analysis for Embodied Learning Environments

Analyzing complex multimodal data from embodied mixed-reality learning environments is crucial but labor-intensive. This study combines machine learning and multimodal analytics to support researchers' efforts, presenting a case study to investigate students' learning progressions through visualizing their states, actions, gaze, affect, and movement. This approach can simplify researchers' tasks and provide insights into embodied learning, with potential applications in educational research and learning environment design.

Authors:  Joyce Fonteles, Eduardo Davalos, Ashwin T. S., Yike Zhang, Mengxi Zhou, Efrat Ayalon, Alicia Lane, Selena Steinberg, Gabriella Anton, Joshua Danish, Noel Enyedy, Gautam Biswas

Link:  https://arxiv.org/abs/2405.06203v1

Date: 2024-05-10

Summary:

Investigating children's embodied learning in mixed-reality environments, where they collaboratively simulate scientific processes, requires analyzing complex multimodal data to interpret their learning and coordination behaviors. Learning scientists have developed Interaction Analysis (IA) methodologies for analyzing such data, but this requires researchers to watch hours of videos to extract and interpret students' learning patterns. Our study aims to simplify researchers' tasks, using Machine Learning and Multimodal Learning Analytics to support the IA processes. Our study combines machine learning algorithms and multimodal analyses to support and streamline researcher efforts in developing a comprehensive understanding of students' scientific engagement through their movements, gaze, and affective responses in a simulated scenario. To facilitate an effective researcher-AI partnership, we present an initial case study to determine the feasibility of visually representing students' states, actions, gaze, affect, and movement on a timeline. Our case study focuses on a specific science scenario where students learn about photosynthesis. The timeline allows us to investigate the alignment of critical learning moments identified by multimodal and interaction analysis, and uncover insights into students' temporal learning progressions.

--------------------------------------------------------------------------------------------------------

Precision Rehabilitation for Patients Post-Stroke based on Electronic Health Records and Machine Learning

This study utilized natural language processing and machine learning on electronic health records data to identify impactful rehabilitation exercises and predict functional outcomes for post-stroke patients. Identifying key exercises and accurately forecasting patient-specific progress can inform precision rehabilitation approaches, potentially improving post-stroke care and recovery.

Authors:  Fengyi Gao, Xingyu Zhang, Sonish Sivarajkumar, Parker Denny, Bayan Aldhahwani, Shyam Visweswaran, Ryan Shi, William Hogan, Allyn Bove, Yanshan Wang

Link:  https://arxiv.org/abs/2405.05993v1

Date: 2024-05-09

Summary:

In this study, we utilized statistical analysis and machine learning methods to examine whether rehabilitation exercises can improve patients post-stroke functional abilities, as well as forecast the improvement in functional abilities. Our dataset is patients' rehabilitation exercises and demographic information recorded in the unstructured electronic health records (EHRs) data and free-text rehabilitation procedure notes. We collected data for 265 stroke patients from the University of Pittsburgh Medical Center. We employed a pre-existing natural language processing (NLP) algorithm to extract data on rehabilitation exercises and developed a rule-based NLP algorithm to extract Activity Measure for Post-Acute Care (AM-PAC) scores, covering basic mobility (BM) and applied cognitive (AC) domains, from procedure notes. Changes in AM-PAC scores were classified based on the minimal clinically important difference (MCID), and significance was assessed using Friedman and Wilcoxon tests. To identify impactful exercises, we used Chi-square tests, Fisher's exact tests, and logistic regression for odds ratios. Additionally, we developed five machine learning models-logistic regression (LR), Adaboost (ADB), support vector machine (SVM), gradient boosting (GB), and random forest (RF)-to predict outcomes in functional ability. Statistical analyses revealed significant associations between functional improvements and specific exercises. The RF model achieved the best performance in predicting functional outcomes. In this study, we identified three rehabilitation exercises that significantly contributed to patient post-stroke functional ability improvement in the first two months. Additionally, the successful application of a machine learning model to predict patient-specific functional outcomes underscores the potential for precision rehabilitation.

--------------------------------------------------------------------------------------------------------

From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks

Deep neural networks face significant challenges in deployment due to high memory, energy, and computation costs. This comprehensive survey covers recent research on model compression techniques, hardware accelerator design, and security approaches to enable efficient and secure deployment of DNNs. With insights into optimizing DNNs from algorithms to hardware, this survey can guide development of high-performance, cost-effective, and secure AI solutions across various applications.

Authors:  Xue Geng, Zhe Wang, Chunyun Chen, Qing Xu, Kaixin Xu, Chao Jin, Manas Gupta, Xulei Yang, Zhenghua Chen, Mohamed M. Sabry Aly, Jie Lin, Min Wu, Xiaoli Li

Link:  https://arxiv.org/abs/2405.06038v1

Date: 2024-05-09

Summary:

Deep neural networks (DNNs) have been widely used in many artificial intelligence (AI) tasks. However, deploying them brings significant challenges due to the huge cost of memory, energy, and computation. To address these challenges, researchers have developed various model compression techniques such as model quantization and model pruning. Recently, there has been a surge in research of compression methods to achieve model efficiency while retaining the performance. Furthermore, more and more works focus on customizing the DNN hardware accelerators to better leverage the model compression techniques. In addition to efficiency, preserving security and privacy is critical for deploying DNNs. However, the vast and diverse body of related works can be overwhelming. This inspires us to conduct a comprehensive survey on recent research toward the goal of high-performance, cost-efficient, and safe deployment of DNNs. Our survey first covers the mainstream model compression techniques such as model quantization, model pruning, knowledge distillation, and optimizations of non-linear operations. We then introduce recent advances in designing hardware accelerators that can adapt to efficient model compression approaches. Additionally, we discuss how homomorphic encryption can be integrated to secure DNN deployment. Finally, we discuss several issues, such as hardware evaluation, generalization, and integration of various compression approaches. Overall, we aim to provide a big picture of efficient DNNs, from algorithm to hardware accelerators and security perspectives.

--------------------------------------------------------------------------------------------------------

Guiding the Way: A Comprehensive Examination of AI Guidelines in Global Media

As AI adoption in news media increases, this study analyzes 37 AI guidelines from 17 countries, revealing shared principles like transparency, fairness, and journalistic value preservation. Highlighting uneven geographical distribution, the findings serve as a resource for media organizations, policymakers and stakeholders working towards an inclusive, equitable digital future for journalism aided by responsible AI implementation.

Authors:  M. F. de-Lima-Santos, W. N. Yeung, T. Dodds

Link:  https://arxiv.org/abs/2405.04706v1

Date: 2024-05-07

Summary:

With the increasing adoption of artificial intelligence (AI) technologies in the news industry, media organizations have begun publishing guidelines that aim to promote the responsible, ethical, and unbiased implementation of AI-based technologies. These guidelines are expected to serve journalists and media workers by establishing best practices and a framework that helps them navigate ever-evolving AI tools. Drawing on institutional theory and digital inequality concepts, this study analyzes 37 AI guidelines for media purposes in 17 countries. Our analysis reveals key thematic areas, such as transparency, accountability, fairness, privacy, and the preservation of journalistic values. Results highlight shared principles and best practices that emerge from these guidelines, including the importance of human oversight, explainability of AI systems, disclosure of automated content, and protection of user data. However, the geographical distribution of these guidelines, highlighting the dominance of Western nations, particularly North America and Europe, can further ongoing concerns about power asymmetries in AI adoption and consequently isomorphism outside these regions. Our results may serve as a resource for news organizations, policymakers, and stakeholders looking to navigate the complex AI development toward creating a more inclusive and equitable digital future for the media industry worldwide.

--------------------------------------------------------------------------------------------------------

When Are Combinations of Humans and AI Useful?

Through a meta-analysis of over 100 studies, this work provides insights into when combining humans and AI is beneficial over either alone. Identifying performance trade-offs across tasks and human-AI performance differences, the findings point to promising ways to improve human-AI collaboration systems for various applications involving decision-making and content creation.

Authors:  Michelle Vaccaro, Abdullah Almaatouq, Thomas Malone

Link:  https://arxiv.org/abs/2405.06087v1

Date: 2024-05-09

Summary:

Inspired by the increasing use of AI to augment humans, researchers have studied human-AI systems involving different tasks, systems, and populations. Despite such a large body of work, we lack a broad conceptual understanding of when combinations of humans and AI are better than either alone. Here, we addressed this question by conducting a meta-analysis of over 100 recent experimental studies reporting over 300 effect sizes. First, we found that, on average, human-AI combinations performed significantly worse than the best of humans or AI alone. Second, we found performance losses in tasks that involved making decisions and significantly greater gains in tasks that involved creating content. Finally, when humans outperformed AI alone, we found performance gains in the combination, but when the AI outperformed humans alone we found losses. These findings highlight the heterogeneity of the effects of human-AI collaboration and point to promising avenues for improving human-AI systems.

--------------------------------------------------------------------------------------------------------

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

This paper introduces a benchmark to test the ability of current interpretability methods to detect "alignment fakers" - AI language models pretending to be aligned while evaluated but misbehaving when given the opportunity. With promising detection strategies identified, this work has implications for developing trustworthy and transparently behaving AI systems for high-stakes applications.

Authors:  Joshua Clymer, Caden Juang, Severin Field

Link:  https://arxiv.org/abs/2405.05466v1

Date: 2024-05-08

Summary:

Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity. Can current interpretability methods catch these 'alignment fakers?' To answer this question, we introduce a benchmark that consists of 324 pairs of LLMs fine-tuned to select actions in role-play scenarios. One model in each pair is consistently benign (aligned). The other model misbehaves in scenarios where it is unlikely to be caught (alignment faking). The task is to identify the alignment faking model using only inputs where the two models behave identically. We test five detection strategies, one of which identifies 98% of alignment-fakers.

--------------------------------------------------------------------------------------------------------

VACO: a Multi-perspective Development of a Therapeutic and Motivational Virtual Robotic Agent for Concentration for children with ADHD

Presenting a novel approach leveraging AI to support attention training for children with ADHD through a motivational virtual agent, this work outlines a participatory development process incorporating perspectives from parents, clinicians, developers and end-users. The system's potential to aid ADHD therapy by adapting software to users' needs highlights the value of inclusive, multi-stakeholder AI development for healthcare applications.

Authors:  Birte Richter, Ira-Katharina Petras, Anna-Lisa Vollmer, Ayla Luong, Michael Siniatchkin, Britta Wrede

Link:  https://arxiv.org/abs/2405.03354v1

Date: 2024-05-06

Summary:

In this work, we present (i) a novel approach how artificial intelligence can support in the therapy for better concentration of children with Attention Deficit Hyperactivity Disorder (ADHD) through motivational attention training with a virtual robotic agent and (ii) a development process in which different stakeholders are included with their perspectives. Therefore, we present three participative approaches to include the perspectives of different stakeholders. An online survey (Study I) was conducted with parents in Germany with the aim of ascertaining whether they would use software to promote their children's attention, what influences their attitude towards using it, and what requirements it would have to meet. About half of the parents would be willing to use software to promote attention. To develop the software as close to practice as possible, one of the developers took part in an intensive training for ADHD with the aim of testing which of the elements are technically feasible. Afterward, a first prototype was presented to clinicians (Study II) to make further adjustments. A first feasibility test (Study III) was conducted with the end users to check if the system works and if children and adolescents can use it. Attentional performance software offers multiple opportunities in the treatment of ADHD if the system is adapted to the needs of the practitioner and end user. This development process requires a lot of time and close interdisciplinary collaboration.

--------------------------------------------------------------------------------------------------------

Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

Substance use disorders have far-reaching impacts, necessitating data-driven research to understand trends and effects. This paper introduces Reddit-Impacts, a named entity recognition dataset curated from social media discussions on substance use experiences. By enabling automatic detection of clinical and social impacts from text data, the dataset can facilitate insights into how substance use affects individual health and societal dynamics, informing effective public health strategies.

Authors:  Yao Ge, Sudeshna Das, Karen O'Connor, Mohammed Ali Al-Garadi, Graciela Gonzalez-Hernandez, Abeed Sarker

Link:  https://arxiv.org/abs/2405.06145v1

Date: 2024-05-09

Summary:

Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks.

--------------------------------------------------------------------------------------------------------

Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry

Identifying peptides from tandem mass spectrometry data is crucial for understanding protein functions, but traditional database search methods face limitations. This work proposes DeepSearch, a deep learning-based end-to-end search method adopting a data-driven scoring approach and enabling zero-shot profiling of modifications. With reduced bias and robust performance across datasets, DeepSearch offers a promising direction for advancing peptide identification and proteomics research.

Authors:  Yonghan Yu, Ming Li

Link:  https://arxiv.org/abs/2405.06511v1

Date: 2024-05-08

Summary:

Peptide identification in mass spectrometry-based proteomics is crucial for understanding protein function and dynamics. Traditional database search methods, though widely used, rely on heuristic scoring functions and statistical estimations have to be introduced for a higher identification rate. Here, we introduce DeepSearch, the first deep learning-based end-to-end database search method for tandem mass spectrometry. DeepSearch leverages a modified transformer-based encoder-decoder architecture under the contrastive learning framework. Unlike conventional methods that rely on ion-to-ion matching, DeepSearch adopts a data-driven approach to score peptide spectrum matches. DeepSearch is also the first deep learning-based method that can profile variable post-translational modifications in a zero-shot manner. We showed that DeepSearch's scoring scheme expressed less bias and did not require any statistical estimation. We validated DeepSearch's accuracy and robustness across various datasets, including those from species with diverse protein compositions and a modification-enriched dataset. DeepSearch sheds new light on database search methods in tandem mass spectrometry.

--------------------------------------------------------------------------------------------------------

Prototype2Code: End-to-end Front-end Code Generation from UI Design Prototypes

While UI-to-code technologies streamline development, existing approaches face robustness and quality issues. Prototype2Code introduces an end-to-end framework incorporating design linting, structure optimization, and responsive layout support to generate more readable, maintainable code aligning closely with design prototypes. Outperforming commercial tools, Prototype2Code can enhance developer productivity and meet industrial needs for front-end applications across devices.

Authors:  Shuhong Xiao, Yunnong Chen, Jiazhi Li, Liuqing Chen, Lingyun Sun, Tingting Zhou

Link:  https://arxiv.org/abs/2405.04975v1

Date: 2024-05-08

Summary:

UI-to-code technology has streamlined the front-end development process, reducing repetitive tasks for engineers. prior research mainly use design prototypes as inputs, with the effectiveness of the generated code heavily dependent on these prototypes' quality, leading to compromised robustness. Moreover, these approaches also exhibit shortcomings in code quality, including issues such as disorganized UI structures and the inability to support responsive layouts. To address these challenges, we introduce Prototype2Code, which achieves end-to-end front-end code generation with business demands. For Prototype2Code, we incorporate design linting into the workflow, addressing the detection of fragmented elements and perceptual groups, enhancing the robustness of the generated outcomes. By optimizing the hierarchical structure and intelligently recognizing UI element types, Prototype2Code generates code that is more readable and structurally clearer. To meet responsive design requirements, Prototype2Code primarily supports flexbox layout model, ensuring code compatibility across various device sizes. To validate the efficacy, we compare Prototype2Code with the commercial code generation platform CodeFun and Screenshot-to-code based on GPT-4 with vision. Employing structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and mean squared error (MSE) for visual similarity assessment, Prototype2Code's rendered UI effects align most closely with the design prototypes, exhibiting the minimal errors. We also conduct a user study with five experienced front-end engineers, inviting them to review and revise code generated by the three methods. As a result, Prototype2Code surpasses other methods in readability, usability, and maintainability, better meeting the business needs of industrial development.

--------------------------------------------------------------------------------------------------------

SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems

Reinforcement learning often suffers from memory constraints impacting training efficiency. SwiftRL explores Processing-In-Memory architectures to accelerate RL workloads, achieving near-linear scaling by optimizing RL algorithms like Q-learning for PIM hardware. With superior performance over CPUs and GPUs demonstrated, SwiftRL can enable more efficient RL applications by leveraging emerging PIM systems.

Authors:  Kailash Gogineni, Sai Santosh Dayapule, Juan Gómez-Luna, Karthikeya Gogineni, Peng Wei, Tian Lan, Mohammad Sadrosadati, Onur Mutlu, Guru Venkataramani

Link:  https://arxiv.org/abs/2405.03967v1

Date: 2024-05-07

Summary:

Reinforcement Learning (RL) trains agents to learn optimal behavior by maximizing reward signals from experience datasets. However, RL training often faces memory limitations, leading to execution latencies and prolonged training times. To overcome this, SwiftRL explores Processing-In-Memory (PIM) architectures to accelerate RL workloads. We achieve near-linear performance scaling by implementing RL algorithms like Tabular Q-learning and SARSA on UPMEM PIM systems and optimizing for hardware. Our experiments on OpenAI GYM environments using UPMEM hardware demonstrate superior performance compared to CPU and GPU implementations.

--------------------------------------------------------------------------------------------------------

Position: Leverage Foundational Models for Black-Box Optimization

While large language models have driven innovation across AI domains, their impact on black-box optimization for experimental design has been limited. This position paper frames black-box optimization around foundational sequence models, discussing how language models can revolutionize optimization by enriching task comprehension from text, devising superior strategies via flexible sequence modeling, and enhancing performance prediction for unseen search spaces.

Authors:  Xingyou Song, Yingtao Tian, Robert Tjarko Lange, Chansoo Lee, Yujin Tang, Yutian Chen

Link:  https://arxiv.org/abs/2405.03547v2

Date: 2024-05-09

Summary:

Undeniably, Large Language Models (LLMs) have stirred an extraordinary wave of innovation in the machine learning research domain, resulting in substantial impact across diverse fields such as reinforcement learning, robotics, and computer vision. Their incorporation has been rapid and transformative, marking a significant paradigm shift in the field of machine learning research. However, the field of experimental design, grounded on black-box optimization, has been much less affected by such a paradigm shift, even though integrating LLMs with optimization presents a unique landscape ripe for exploration. In this position paper, we frame the field of black-box optimization around sequence-based foundation models and organize their relationship with previous literature. We discuss the most promising ways foundational language models can revolutionize optimization, which include harnessing the vast wealth of information encapsulated in free-form text to enrich task comprehension, utilizing highly flexible sequence models such as Transformers to engineer superior optimization strategies, and enhancing performance prediction over previously unseen search spaces.

--------------------------------------------------------------------------------------------------------

Active Sensing for Multiuser Beam Tracking with Reconfigurable Intelligent Surface

This paper tackles beam tracking between an access point and multiple mobile users, where the access point dynamically adjusts beamformers and a reconfigurable intelligent surface based on received pilots. Proposing a deep learning framework using recurrent and graph neural networks to leverage channel state trajectory, it demonstrates improved performance over existing data-driven schemes for this challenging active sensing problem.

Authors:  Han Han, Tao Jiang, Wei Yu

Link:  https://arxiv.org/abs/2405.03129v1

Date: 2024-05-06

Summary:

This paper studies a beam tracking problem in which an access point (AP), in collaboration with a reconfigurable intelligent surface (RIS), dynamically adjusts its downlink beamformers and the reflection pattern at the RIS in order to maintain reliable communications with multiple mobile user equipments (UEs). Specifically, the mobile UEs send uplink pilots to the AP periodically during the channel sensing intervals, the AP then adaptively configures the beamformers and the RIS reflection coefficients for subsequent data transmission based on the received pilots. This is an active sensing problem, because channel sensing involves configuring the RIS coefficients during the pilot stage and the optimal sensing strategy should exploit the trajectory of channel state information (CSI) from previously received pilots. Analytical solution to such an active sensing problem is very challenging. In this paper, we propose a deep learning framework utilizing a recurrent neural network (RNN) to automatically summarize the time-varying CSI obtained from the periodically received pilots into state vectors. These state vectors are then mapped to the AP beamformers and RIS reflection coefficients for subsequent downlink data transmissions, as well as the RIS reflection coefficients for the next round of uplink channel sensing. The mappings from the state vectors to the downlink beamformers and the RIS reflection coefficients for both channel sensing and downlink data transmission are performed using graph neural networks (GNNs) to account for the interference among the UEs. Simulations demonstrate significant and interpretable performance improvement of the proposed approach over the existing data-driven methods with nonadaptive channel sensing schemes.

--------------------------------------------------------------------------------------------------------

Are EEG-to-Text Models Working?

Current open-vocabulary EEG-to-text translation models face crucial limitations around inflated metrics from flawed evaluation setups and lack of noise input benchmarking. This work analyzes these limitations, proposing methodologies to differentiate models truly learning from EEG signals versus memorizing data. The findings highlight needs for stricter evaluation practices to develop robust brain-computer interfaces and communication systems.

Authors:  Hyejeong Jo, Yiqian Yang, Juhyeok Han, Yiqun Duan, Hui Xiong, Won Hee Lee

Link:  https://arxiv.org/abs/2405.06459v1

Date: 2024-05-10

Summary:

This work critically analyzes existing models for open-vocabulary EEG-to-Text translation. We identify a crucial limitation: previous studies often employed implicit teacher-forcing during evaluation, artificially inflating performance metrics. Additionally, they lacked a critical benchmark - comparing model performance on pure noise inputs. We propose a methodology to differentiate between models that truly learn from EEG signals and those that simply memorize training data. Our analysis reveals that model performance on noise data can be comparable to that on EEG data. These findings highlight the need for stricter evaluation practices in EEG-to-Text research, emphasizing transparent reporting and rigorous benchmarking with noise inputs. This approach will lead to more reliable assessments of model capabilities and pave the way for robust EEG-to-Text communication systems.

--------------------------------------------------------------------------------------------------------

A review on discriminative self-supervised learning methods

Self-supervised learning enables extracting robust features from unlabeled data in computer vision. This comprehensive review examines the evolution and current landscape of discriminative self-supervised approaches like contrastive, distillation, and clustering methods that leverage abundant unlabeled data. With comparative analyses on benchmarks, it provides valuable perspective on this increasingly important paradigm.

Authors:  Nikolaos Giakoumoglou, Tania Stathaki

Link:  https://arxiv.org/abs/2405.04969v1

Date: 2024-05-08

Summary:

In the field of computer vision, self-supervised learning has emerged as a method to extract robust features from unlabeled data, where models derive labels autonomously from the data itself, without the need for manual annotation. This paper provides a comprehensive review of discriminative approaches of self-supervised learning within the domain of computer vision, examining their evolution and current status. Through an exploration of various methods including contrastive, self-distillation, knowledge distillation, feature decorrelation, and clustering techniques, we investigate how these approaches leverage the abundance of unlabeled data. Finally, we have comparison of self-supervised learning methods on the standard ImageNet classification benchmark.

--------------------------------------------------------------------------------------------------------

Detecting music deepfakes is easy but actually hard

As generative models proliferate, detecting artificially created content like music deepfakes is critical. This paper demonstrates training highly accurate classifiers to detect music deepfakes but cautions that achieving high accuracy is insufficient. It exposes potential pitfalls around calibration, robustness, generalization, interpretability and recourse, providing a nuanced perspective on developing reliable detection tools against emerging threats.

Authors:  Darius Afchar, Gabriel Meseguer Brocal, Romain Hennequin

Link:  https://arxiv.org/abs/2405.04181v1

Date: 2024-05-07

Summary:

In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. The ability to create credible minute-long music deepfakes in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and fake reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a music deepfake detector, a tool that will help in the regulation of music forgery. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that a good test score is not the end of the story. We step back from the straightforward ML framework and expose many facets that could be problematic with such a deployed detector: calibration, robustness to audio manipulation, generalisation to unseen models, interpretability and possibility for recourse. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of fake content checkers.

--------------------------------------------------------------------------------------------------------

Thoughtful Things: Building Human-Centric Smart Devices with Small Language Models

While smart devices automate tasks, understanding and controlling their complex behaviors remains challenging for users. This work proposes "thoughtful things" - devices leveraging lightweight on-device language models to interpret unconstrained voice commands and explain responses. Deploying implementations on hardware, the proposed framework using formal modeling and data synthesis can make smart devices truly user-friendly without cloud dependencies.

Authors:  Evan King, Haoxiang Yu, Sahil Vartak, Jenna Jacob, Sangsu Lee, Christine Julien

Link:  https://arxiv.org/abs/2405.03821v1

Date: 2024-05-06

Summary:

Everyday devices like light bulbs and kitchen appliances are now embedded with so many features and automated behaviors that they have become complicated to actually use. While such "smart" capabilities can better support users' goals, the task of learning the "ins and outs" of different devices is daunting. Voice assistants aim to solve this problem by providing a natural language interface to devices, yet such assistants cannot understand loosely-constrained commands, they lack the ability to reason about and explain devices' behaviors to users, and they rely on connectivity to intrusive cloud infrastructure. Toward addressing these issues, we propose thoughtful things: devices that leverage lightweight, on-device language models to take actions and explain their behaviors in response to unconstrained user commands. We propose an end-to-end framework that leverages formal modeling, automated training data synthesis, and generative language models to create devices that are both capable and thoughtful in the presence of unconstrained user goals and inquiries. Our framework requires no labeled data and can be deployed on-device, with no cloud dependency. We implement two thoughtful things (a lamp and a thermostat) and deploy them on real hardware, evaluating their practical performance.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.