Week Ending 12.1.2024

RESEARCH WATCH: 12.1.2024

Reverse Thinking Makes LLMs Stronger Reasoners

In the realm of artificial intelligence, reasoning has long been a challenging frontier. This groundbreaking research introduces Reverse-Enhanced Thinking (RevThink), a novel framework that mimics human cognitive processes by enabling large language models to reason both forward and backward. By augmenting datasets with structured reasoning paths and implementing multi-task learning objectives, the researchers demonstrate significant improvements in reasoning performance across various domains like commonsense, mathematical, and logical reasoning. The approach is particularly noteworthy for its sample efficiency, showing remarkable results with minimal training data and strong generalization capabilities.

Authors: Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, Tomas Pfister

Link: https://arxiv.org/abs/2411.19865v1

Date: 2024-11-29

Summary:

Reverse thinking plays a crucial role in human reasoning. Humans can reason not only from a problem to a solution but also in reverse, i.e., start from the solution and reason towards the problem. This often enhances overall reasoning performance as it enables consistency checks between their forward and backward thinking. To enable Large Language Models (LLMs) to perform reverse thinking, we introduce Reverse-Enhanced Thinking (RevThink), a framework composed of data augmentation and learning objectives. In RevThink, we augment the dataset by collecting structured forward-backward reasoning from a teacher model, consisting of: (1) the original question, (2) forward reasoning, (3) backward question, and (4) backward reasoning. We then employ three objectives to train a smaller student model in a multi-task learning fashion: (a) generate forward reasoning from a question, (b) generate a backward question from a question, and (c) generate backward reasoning from the backward question. Experiments across 12 datasets covering commonsense, math, and logical reasoning show an average 13.53% improvement over the student model's zero-shot performance and a 6.84% improvement over the strongest knowledge distillation baselines. Moreover, our method demonstrates sample efficiency -- using only 10% of the correct forward reasoning from the training data, it outperforms a standard fine-tuning method trained on 10x more forward reasoning. RevThink also exhibits strong generalization to out-of-distribution held-out datasets.

--------------------------------------------------------------------------------------------------------

Q-learning-based Model-free Safety Filter

Safety is paramount in robotics, especially in complex and unpredictable environments. This innovative research presents a model-free safety filter framework that addresses critical challenges in robotic control systems. By leveraging Q-learning and introducing a novel reward formulation, the researchers developed a flexible, plug-and-play approach to safeguard task-specific policies. The method's key strength lies in its ability to filter out potentially unsafe actions without relying on complex system models. Demonstrated through simulations and real-world experiments with a soft robotic limb, this approach offers a promising solution for enhancing robotic system safety across diverse applications.

Authors: Guo Ning Sue, Yogita Choudhary, Richard Desatnik, Carmel Majidi, John Dolan, Guanya Shi

Link: https://arxiv.org/abs/2411.19809v1

Date: 2024-11-29

Summary:

Ensuring safety via safety filters in real-world robotics presents significant challenges, particularly when the system dynamics is complex or unavailable. To handle this issue, learning-based safety filters recently gained popularity, which can be classified as model-based and model-free methods. Existing model-based approaches requires various assumptions on system model (e.g., control-affine), which limits their application in complex systems, and existing model-free approaches need substantial modifications to standard RL algorithms and lack versatility. This paper proposes a simple, plugin-and-play, and effective model-free safety filter learning framework. We introduce a novel reward formulation and use Q-learning to learn Q-value functions to safeguard arbitrary task specific nominal policies via filtering out their potentially unsafe actions. The threshold used in the filtering process is supported by our theoretical analysis. Due to its model-free nature and simplicity, our framework can be seamlessly integrated with various RL algorithms. We validate the proposed approach through simulations on double integrator and Dubin's car systems and demonstrate its effectiveness in real-world experiments with a soft robotic limb.

--------------------------------------------------------------------------------------------------------

Improving generalization of robot locomotion policies via Sharpness-Aware Reinforcement Learning

Bridging the gap between simulation and real-world robotic performance remains a significant challenge in robotics. This research introduces a novel approach integrating sharpness-aware optimization into gradient-based reinforcement learning algorithms. By focusing on finding flatter minima in the loss landscape, the method enhances policy robustness to environmental variations and action perturbations. The approach demonstrates improved action noise tolerance and generalization, particularly in contact-rich environments. This research offers a promising pathway to more adaptable and reliable robotic locomotion systems, potentially revolutionizing how robots interact with complex, dynamic environments.

Authors: Severin Bochem, Eduardo Gonzalez-Sanchez, Yves Bicker, Gabriele Fadini

Link: https://arxiv.org/abs/2411.19732v1

Date: 2024-11-29

Summary:

Reinforcement learning often requires extensive training data. Simulation-to-real transfer offers a promising approach to address this challenge in robotics. While differentiable simulators offer improved sample efficiency through exact gradients, they can be unstable in contact-rich environments and may lead to poor generalization. This paper introduces a novel approach integrating sharpness-aware optimization into gradient-based reinforcement learning algorithms. Our simulation results demonstrate that our method, tested on contact-rich environments, significantly enhances policy robustness to environmental variations and action perturbations while maintaining the sample efficiency of first-order methods. Specifically, our approach improves action noise tolerance compared to standard first-order methods and achieves generalization comparable to zeroth-order methods. This improvement stems from finding flatter minima in the loss landscape, associated with better generalization. Our work offers a promising solution to balance efficient learning and robust sim-to-real transfer in robotics, potentially bridging the gap between simulation and real-world performance.

--------------------------------------------------------------------------------------------------------

BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning

In the era of data-driven decision-making, efficient computational methods are crucial. This research introduces BPQP, a groundbreaking differentiable convex optimization framework designed to streamline end-to-end learning processes. By reformulating backward pass calculations as a simplified quadratic programming problem, the researchers achieve remarkable computational efficiency, often reducing execution time by an order of magnitude compared to existing methods. The framework's adaptability allows it to leverage evolving solver technologies, making it a flexible solution for complex optimization challenges across various domains, from machine learning to operations research.

Authors: Jianming Pan, Zeqi Ye, Xiao Yang, Xu Yang, Weiqing Liu, Lewen Wang, Jiang Bian

Link: https://arxiv.org/abs/2411.19285v1

Date: 2024-11-28

Summary:

Data-driven decision-making processes increasingly utilize end-to-end learnable deep neural networks to render final decisions. Sometimes, the output of the forward functions in certain layers is determined by the solutions to mathematical optimization problems, leading to the emergence of differentiable optimization layers that permit gradient back-propagation. However, real-world scenarios often involve large-scale datasets and numerous constraints, presenting significant challenges. Current methods for differentiating optimization problems typically rely on implicit differentiation, which necessitates costly computations on the Jacobian matrices, resulting in low efficiency. In this paper, we introduce BPQP, a differentiable convex optimization framework designed for efficient end-to-end learning. To enhance efficiency, we reformulate the backward pass as a simplified and decoupled quadratic programming problem by leveraging the structural properties of the KKT matrix. This reformulation enables the use of first-order optimization algorithms in calculating the backward pass gradients, allowing our framework to potentially utilize any state-of-the-art solver. As solver technologies evolve, BPQP can continuously adapt and improve its efficiency. Extensive experiments on both simulated and real-world datasets demonstrate that BPQP achieves a significant improvement in efficiency--typically an order of magnitude faster in overall execution time compared to other differentiable optimization layers. Our results not only highlight the efficiency gains of BPQP but also underscore its superiority over differentiable optimization layer baselines.

--------------------------------------------------------------------------------------------------------

RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data

Wearable technology is rapidly evolving, and this research presents a cutting-edge approach to extracting meaningful insights from motion sensor data. The RelCon framework introduces a novel self-supervised learning method that uses a learnable distance measure to capture semantic similarities in accelerometer time-series segments. Trained on an extensive dataset of 1 billion segments from over 87,000 participants, the model demonstrates exceptional performance across multiple downstream tasks. This breakthrough has significant implications for health monitoring, fitness tracking, and understanding human movement patterns across diverse contexts.

Authors: Maxwell A. Xu, Jaya Narain, Gregory Darnell, Haraldur Hallgrimsson, Hyewon Jeong, Darren Forde, Richard Fineman, Karthik J. Raghuram, James M. Rehg, Shirley Ren

Link: https://arxiv.org/abs/2411.18822v1

Date: 2024-11-27

Summary:

We present RelCon, a novel self-supervised \textit{Rel}ative \textit{Con}trastive learning approach that uses a learnable distance measure in combination with a softened contrastive loss for training an motion foundation model from wearable sensors. The learnable distance measure captures motif similarity and domain-specific semantic information such as rotation invariance. The learned distance provides a measurement of semantic similarity between a pair of accelerometer time-series segments, which is used to measure the distance between an anchor and various other sampled candidate segments. The self-supervised model is trained on 1 billion segments from 87,376 participants from a large wearables dataset. The model achieves strong performance across multiple downstream tasks, encompassing both classification and regression. To our knowledge, we are the first to show the generalizability of a self-supervised learning model with motion data from wearables across distinct evaluation tasks.

--------------------------------------------------------------------------------------------------------

MM-Path: Multi-modal, Multi-granularity Path Representation Learning -- Extended Version

Intelligent transportation systems are becoming increasingly sophisticated, and this research offers a novel approach to path representation learning. The MM-Path framework integrates information from road networks and remote sensing images, addressing the challenges of semantic alignment across different data modalities. By developing a multi-granularity alignment strategy and a graph-based cross-modal fusion component, the researchers create a more comprehensive path representation model. This approach has profound implications for urban planning, navigation systems, and geospatial analysis, providing a more nuanced understanding of spatial environments.

Authors: Ronghui Xu, Hanyin Cheng, Chenjuan Guo, Hongfan Gao, Jilin Hu, Sean Bin Yang, Bin Yang

Link: https://arxiv.org/abs/2411.18428v2

Date: 2024-11-28

Summary:

Developing effective path representations has become increasingly essential across various fields within intelligent transportation. Although pre-trained path representation learning models have shown improved performance, they predominantly focus on the topological structures from single modality data, i.e., road networks, overlooking the geometric and contextual features associated with path-related images, e.g., remote sensing images. Similar to human understanding, integrating information from multiple modalities can provide a more comprehensive view, enhancing both representation accuracy and generalization. However, variations in information granularity impede the semantic alignment of road network-based paths (road paths) and image-based paths (image paths), while the heterogeneity of multi-modal data poses substantial challenges for effective fusion and utilization. In this paper, we propose a novel Multi-modal, Multi-granularity Path Representation Learning Framework (MM-Path), which can learn a generic path representation by integrating modalities from both road paths and image paths. To enhance the alignment of multi-modal data, we develop a multi-granularity alignment strategy that systematically associates nodes, road sub-paths, and road paths with their corresponding image patches, ensuring the synchronization of both detailed local information and broader global contexts. To address the heterogeneity of multi-modal data effectively, we introduce a graph-based cross-modal residual fusion component designed to comprehensively fuse information across different modalities and granularities. Finally, we conduct extensive experiments on two large-scale real-world datasets under two downstream tasks, validating the effectiveness of the proposed MM-Path. The code is available at: https://github.com/decisionintelligence/MM-Path.

--------------------------------------------------------------------------------------------------------

Randomized-Grid Search for Hyperparameter Tuning in Decision Tree Model to Improve Performance of Cardiovascular Disease Classification

Machine learning in healthcare diagnosis requires precise model optimization. This research introduces a novel Randomized-Grid Search method for hyperparameter tuning, specifically targeting decision tree models for cardiovascular disease classification. By combining the global exploration of random search with the focused approach of grid search, the method offers a more efficient and effective optimization strategy. The approach demonstrates improved model performance, accuracy, and generalization, presenting a promising solution for enhancing machine learning applications in medical diagnostics and potentially saving lives through more accurate disease prediction.

Authors: Abhay Kumar Pathak, Mrityunjay Chaubey, Manjari Gupta

Link: https://arxiv.org/abs/2411.18234v1

Date: 2024-11-27

Summary:

Cardiovascular disease refers to any critical condition that impacts the heart. Because heart diseases can be life-threatening. Researchers are focusing on designing smart systems to accurately diagnose them based on electronic health data, with the aid of machine learning algorithms. Heart disease classification using machine learning (ML) algorithms such as Support Vector Machine(SVM), Na\"ive Bayes(NB), Decision Trees (DTs) and Random Forests (RFs) are often hindered by overfitting. These ML algorithms need extensive hyperparameter tuning. Random Search offers a faster, and, more efficient exploration of hyperparameter space, but, it may overlook optimal regions. Grid Search, though exhaustive, but, it is computationally expensive and inefficient, particularly with high-dimensional data. To address these limitations, Randomized-Grid Search, a novel hybrid optimization method is proposed that combines the global exploration strengths of Random Search with the focused, and, exhaustive search of Grid Search in the most promising regions. This hybrid approach efficiently balances exploration and exploitation. The proposed model optimizes the hyperparameter for Decision Tree model. The proposed model is applied to UCI heart disease dataset for classification. It enhances model performance, provides improved accuracy, generalization, and computational efficiency. Experimental results demonstrate that Randomized-Grid Search outperforms traditional methods by significant margins. The proposed model provides a more effective solution for machine learning applications in healthcare diagnosis.

--------------------------------------------------------------------------------------------------------

Machine Learning and Multi-source Remote Sensing in Forest Carbon Stock Estimation: A Review

Climate change mitigation relies on accurate environmental monitoring, and this comprehensive review explores the intersection of machine learning and remote sensing for forest carbon stock estimation. Analyzing 25 research papers, the study highlights the effectiveness of various machine learning methods, with Random Forest and Extreme Gradient Boosting showing particularly promising results. By utilizing multi-sensor approaches like Sentinel-1, Sentinel-2, and LiDAR, researchers can develop more precise and scalable methods for quantifying forest carbon, providing crucial insights for environmental conservation and climate policy.

Authors: Autumn Nguyen, Sulagna Saha

Link: https://arxiv.org/abs/2411.17624v1

Date: 2024-11-26

Summary:

Quantifying forest carbon is crucial for informing decisions and policies that will protect the planet. Machine learning (ML) and remote sensing (RS) techniques have been used to do this task more effectively, yet there lacks a systematic review on the most recent ML methods and RS combinations, especially with the consideration of forest characteristics. This study systematically analyzed 25 papers meeting strict inclusion criteria from over 80 related studies, identifying 28 ML methods and key combinations of RS data. Random Forest had the most frequent appearance (88\% of studies), while Extreme Gradient Boosting showed superior performance in 75\% of the studies in which it was compared with other methods. Sentinel-1 emerged as the most utilized remote sensing source, with multi-sensor approaches (e.g., Sentinel-1, Sentinel-2, and LiDAR) proving especially effective. Our findings provide grounds for recommending best practices in integrating machine learning and remote sensing for accurate and scalable forest carbon stock estimation.

--------------------------------------------------------------------------------------------------------

BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving

Mathematical modeling is a complex cognitive task, and this research aims to enhance Large Language Models' (LLMs) reasoning capabilities. By introducing the StructuredOR dataset with comprehensive modeling process annotations, the researchers developed BPP-Search, an advanced algorithm integrating reinforcement learning into a tree-of-thought structure. The approach significantly outperforms existing methods in solving mathematical modeling problems, demonstrating superior accuracy and efficiency. This breakthrough has potential applications in operations research, scientific problem-solving, and developing more sophisticated AI reasoning systems.

Authors: Teng Wang, Wing-Yin Yu, Zhenqi He, Zehua Liu, Xiongwei Han, Hailei Gong, Han Wu, Wei Shi, Ruifeng She, Fangzhou Zhu, Tao Zhong

Link: https://arxiv.org/abs/2411.17404v1

Date: 2024-11-26

Summary:

LLMs exhibit advanced reasoning capabilities, offering the potential to transform natural language questions into mathematical models. However, existing open-source operations research datasets lack detailed annotations of the modeling process, such as variable definitions, focusing solely on objective values, which hinders reinforcement learning applications. To address this, we release the StructuredOR dataset, annotated with comprehensive labels that capture the complete mathematical modeling process. We further propose BPP-Search, a algorithm that integrates reinforcement learning into a tree-of-thought structure using Beam search, a Process reward model, and a pairwise Preference algorithm. This approach enables efficient exploration of tree structures, avoiding exhaustive search while improving accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP datasets show that BPP-Search significantly outperforms state-of-the-art methods, including Chain-of-Thought, Self-Consistency, and Tree-of-Thought. In tree-based reasoning, BPP-Search also surpasses Process Reward Model combined with Greedy or Beam Search, demonstrating superior accuracy and efficiency, and enabling faster retrieval of correct solutions.

--------------------------------------------------------------------------------------------------------

Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos

In the digital age, content moderation is crucial for maintaining safe online environments. This research evaluates Large Language Models (LLMs) for detecting sensitive content across text, images, and videos. By comparing models like GPT, Gemini, and Llama, the study demonstrates LLMs' potential to outperform traditional content filtering techniques. The research highlights the models' ability to achieve higher accuracy and lower false positive/negative rates, offering a promising solution for websites, social media platforms, and video-sharing services to more effectively regulate and moderate potentially harmful content.

Authors: Nouar AlDahoul, Myles Joshua Toledo Tan, Harishwar Reddy Kasireddy, Yasir Zaki

Link: https://arxiv.org/abs/2411.17123v1

Date: 2024-11-26

Summary:

The widespread dissemination of hate speech, harassment, harmful and sexual content, and violence across websites and media platforms presents substantial challenges and provokes widespread concern among different sectors of society. Governments, educators, and parents are often at odds with media platforms about how to regulate, control, and limit the spread of such content. Technologies for detecting and censoring the media contents are a key solution to addressing these challenges. Techniques from natural language processing and computer vision have been used widely to automatically identify and filter out sensitive content such as offensive languages, violence, nudity, and addiction in both text, images, and videos, enabling platforms to enforce content policies at scale. However, existing methods still have limitations in achieving high detection accuracy with fewer false positives and false negatives. Therefore, more sophisticated algorithms for understanding the context of both text and image may open rooms for improvement in content censorship to build a more efficient censorship system. In this paper, we evaluate existing LLM-based content moderation solutions such as OpenAI moderation model and Llama-Guard3 and study their capabilities to detect sensitive contents. Additionally, we explore recent LLMs such as GPT, Gemini, and Llama in identifying inappropriate contents across media outlets. Various textual and visual datasets like X tweets, Amazon reviews, news articles, human photos, cartoons, sketches, and violence videos have been utilized for evaluation and comparison. The results demonstrate that LLMs outperform traditional techniques by achieving higher accuracy and lower false positive and false negative rates. This highlights the potential to integrate LLMs into websites, social media platforms, and video-sharing services for regulatory and content moderation purposes.

--------------------------------------------------------------------------------------------------------

Event-based Spiking Neural Networks for Object Detection: A Review of Datasets, Architectures, Learning Rules, and Implementation

Neuromorphic computing represents a promising frontier in energy-efficient artificial intelligence. This comprehensive review explores Spiking Neural Networks (SNNs) for computer vision object detection, analyzing 151 research articles. By examining various architectures, learning methods, and implementation techniques, the research provides insights into the potential of bio-inspired neural networks. The study highlights SNN's advantages in energy consumption, latency, and memory efficiency, offering a roadmap for developing more sophisticated and energy-efficient computer vision systems with potential applications in robotics, autonomous vehicles, and intelligent sensing technologies.

Authors: Craig Iaboni, Pramod Abichandani

Link: https://arxiv.org/abs/2411.17006v1

Date: 2024-11-26

Summary:

Spiking Neural Networks (SNNs) represent a biologically inspired paradigm offering an energy-efficient alternative to conventional artificial neural networks (ANNs) for Computer Vision (CV) applications. This paper presents a systematic review of datasets, architectures, learning methods, implementation techniques, and evaluation methodologies used in CV-based object detection tasks using SNNs. Based on an analysis of 151 journal and conference articles, the review codifies: 1) the effectiveness of fully connected, convolutional, and recurrent architectures; 2) the performance of direct unsupervised, direct supervised, and indirect learning methods; and 3) the trade-offs in energy consumption, latency, and memory in neuromorphic hardware implementations. An open-source repository along with detailed examples of Python code and resources for building SNN models, event-based data processing, and SNN simulations are provided. Key challenges in SNN training, hardware integration, and future directions for CV applications are also identified.

--------------------------------------------------------------------------------------------------------

Square ice Coulomb phase as a percolated vertex lattice

In the realm of quantum materials and statistical physics, understanding complex lattice systems remains a crucial challenge. This groundbreaking research explores the square ice model, a fascinating system where spin arrangements create an extensively degenerate ground state. By utilizing a innovative loop flip algorithm, researchers map the system's properties through vertex distributions rather than traditional spin textures. The study reveals remarkable insights into how constrained vertex configurations behave, closely resembling a vertex gas. This approach not only advances theoretical understanding of Coulomb phases but also provides a robust method for analyzing complex lattice systems, with potential implications for understanding magnetic materials, quantum computing, and computational physics.

Authors: Johann Coraux, Nicodème Rougier, Benjamin Canals, Nicolas Rougemaille

Link: https://arxiv.org/abs/2411.16533v1

Date: 2024-11-25

Summary:

The square ice is a canonical example of a Coulomb phase in two dimensions: Its ground state is extensively degenerate and satisfies a local constraint on the spin arrangement (the so-called ice rule). In this paper, we use a loop flip algorithm to explore the properties of this ground state that we analyze not in terms of a spin texture, but rather in terms of a spatial distribution of ice-rule satisfying vertices. More specifically, we determine for various lattice sizes the average vertex populations characterizing the ice manifold, the pairwise vertex correlations, and the size distribution of vertex clusters. Comparing these results to those obtained from random, constraint-free vertex tilings, the square ice manifold is found to resemble an almost ideal vertex gas, and the cluster size distribution of ice-rule satisfying vertices is well approximated by percolation theory. Remarkably, this description remains reasonably accurate when monopoles are present in a dilute amount, allowing a direct comparison with experiments. Revising former experimental results on two artificial square ice systems, we illustrate the interest of our approach to spot the presence of a Coulomb phase from a vertex analysis.

--------------------------------------------------------------------------------------------------------

Leveraging Foundation Models To learn the shape of semi-fluid deformable objects

Machine learning's frontier continues to expand with innovative approaches to object recognition and characterization. This research tackles the complex challenge of understanding semi-fluid deformable objects, particularly focusing on weld pool analysis. By leveraging foundation models and a novel knowledge distillation framework, the researchers developed a groundbreaking method for extracting object keypoints without extensive manual labeling. The approach uses a teacher-student model where foundation models guide a smaller generative network, achieving impressive results with a 13.4-pixel error in keypoint retrieval. This technique holds significant promise for robotics, manufacturing, and automated visual inspection, offering a versatile approach to characterizing complex, dynamic objects with minimal prior training.

Authors: Omar El Assal, Carlos M. Mateo, Sebastien Ciron, David Fofi

Link: https://arxiv.org/abs/2411.16802v1

Date: 2024-11-25

Summary:

One of the difficulties imposed on the manipulation of deformable objects is their characterization and the detection of representative keypoints for the purpose of manipulation. A keen interest was manifested by researchers in the last decade to characterize and manipulate deformable objects of non-fluid nature, such as clothes and ropes. Even though several propositions were made in the regard of object characterization, however researchers were always confronted with the need of pixel-level information of the object through images to extract relevant information. This usually is accomplished by means of segmentation networks trained on manually labeled data for this purpose. In this paper, we address the subject of characterizing weld pool to define stable features that serve as information for further motion control objectives. We achieve this by employing different pipelines. The first one consists of characterizing fluid deformable objects through the use of a generative model that is trained using a teacher-student framework. And in the second one we leverage foundation models by using them as teachers to characterize the object in the image, without the need of any pre-training and any dataset. The performance of knowledge distillation from foundation models into a smaller generative model shows prominent results in the characterization of deformable objects. The student network was capable of learning to retrieve the keypoitns of the object with an error of 13.4 pixels. And the teacher was evaluated based on its capacities to retrieve pixel level information represented by the object mask, with a mean Intersection Over Union (mIoU) of 75.26%.

--------------------------------------------------------------------------------------------------------

A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation

As deep learning continues to revolutionize computer vision, understanding model reliability becomes paramount. This comprehensive review delves into probabilistic image segmentation, exploring how uncertainty quantification can enhance algorithmic performance and trustworthiness. By examining both aleatoric (data-based) and epistemic (model-based) uncertainties, the research provides critical insights into how machine learning models can express their own confidence levels. The study identifies four key applications: analyzing annotation inconsistencies, correlating prediction errors, expanding model hypothesis spaces, and improving active learning techniques. This work is crucial for high-stakes domains like medical imaging, autonomous systems, and scientific research, where understanding model limitations is as important as understanding its capabilities.

Authors: M. M. A. Valiuddin, R. J. G. van Sloun, C. G. A. Viviers, P. H. N. de With, F. van der Sommen

Link: https://arxiv.org/abs/2411.16370v1

Date: 2024-11-25

Summary:

Advancements in image segmentation play an integral role within the greater scope of Deep Learning-based computer vision. Furthermore, their widespread applicability in critical real-world tasks has given rise to challenges related to the reliability of such algorithms. Hence, uncertainty quantification has been extensively studied within this context, enabling expression of model ignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to prevent uninformed decision making. Due to the rapid adoption of Convolutional Neural Network (CNN)-based segmentation models in high-stake applications, a substantial body of research has been published on this very topic, causing its swift expansion into a distinct field. This work provides a comprehensive overview of probabilistic segmentation by discussing fundamental concepts in uncertainty that govern advancements in the field as well as the application to various tasks. We identify that quantifying aleatoric and epistemic uncertainty approximates Bayesian inference w.r.t. to either latent variables or model parameters, respectively. Moreover, literature on both uncertainties trace back to four key applications; (1) to quantify statistical inconsistencies in the annotation process due ambiguous images, (2) correlating prediction error with uncertainty, (3) expanding the model hypothesis space for better generalization, and (4) active learning. Then, a discussion follows that includes an overview of utilized datasets for each of the applications and comparison of the available methods. We also highlight challenges related to architectures, uncertainty-based active learning, standardization and benchmarking, and recommendations for future work such as methods based on single forward passes and models that appropriately leverage volumetric data.

--------------------------------------------------------------------------------------------------------

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Human pose analysis is critical in digital media, computer vision, and interactive technologies. This research introduces UniPose, a groundbreaking framework using Large Language Models to comprehend, generate, and edit human poses across multiple modalities like images, text, and 3D SMPL poses. By employing a pose tokenizer and specialized visual encoders, UniPose enables seamless knowledge transfer across pose-related tasks. The framework represents a significant step towards a general-purpose system for human pose manipulation, with potential applications in animation, sports analysis, virtual reality, and human-computer interaction.

Authors: Yiheng Li, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

Link: https://arxiv.org/abs/2411.16781v1

Date: 2024-11-25

Summary:

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

--------------------------------------------------------------------------------------------------------

Real-Time Anomaly Detection in Video Streams

Ensuring safety and security in dynamic environments requires advanced video analysis techniques. This thesis presents an innovative artificial intelligence system for real-time danger detection in video streams. By combining temporal and spatial analysis with object detection, human pose recognition, and motion analysis, the research develops a robust anomaly detection framework. Utilizing neural network models like YOLO and Convolutional Recurrent Neural Networks, the approach offers flexible processing for continuous and finite video streams, with potential applications in surveillance, public safety, industrial monitoring, and autonomous security systems.

Authors: Fabien Poirier

Link: https://arxiv.org/abs/2411.19731v1

Date: 2024-11-29

Summary:

This thesis is part of a CIFRE agreement between the company Othello and the LIASD laboratory. The objective is to develop an artificial intelligence system that can detect real-time dangers in a video stream. To achieve this, a novel approach combining temporal and spatial analysis has been proposed. Several avenues have been explored to improve anomaly detection by integrating object detection, human pose detection, and motion analysis. For result interpretability, techniques commonly used for image analysis, such as activation and saliency maps, have been extended to videos, and an original method has been proposed. The proposed architecture performs binary or multiclass classification depending on whether an alert or the cause needs to be identified. Numerous neural networkmodels have been tested, and three of them have been selected. You Only Looks Once (YOLO) has been used for spatial analysis, a Convolutional Recurrent Neuronal Network (CRNN) composed of VGG19 and a Gated Recurrent Unit (GRU) for temporal analysis, and a multi-layer perceptron for classification. These models handle different types of data and can be combined in parallel or in series. Although the parallel mode is faster, the serial mode is generally more reliable. For training these models, supervised learning was chosen, and two proprietary datasets were created. The first dataset focuses on objects that may play a potential role in anomalies, while the second consists of videos containing anomalies or non-anomalies. This approach allows for the processing of both continuous video streams and finite videos, providing greater flexibility in detection.

--------------------------------------------------------------------------------------------------------

Beautimeter: Harnessing GPT for Assessing Architectural and Urban Beauty based on the 15 Properties of Living Structure

Aesthetic evaluation has long been a subjective process, but this research introduces Beautimeter, a groundbreaking AI tool for architectural and urban design assessment. Leveraging Christopher Alexander's theory of living structure and GPT's natural language processing capabilities, the tool evaluates spatial environments based on 15 fundamental properties. By generating nuanced insights into architectural aesthetics, Beautimeter offers architects, urban planners, and designers a powerful analytical tool. The research demonstrates the potential of AI in enhancing design processes and creating more human-centered, emotionally resonant built environments.

Authors: Bin Jiang

Link: https://arxiv.org/abs/2411.19094v1

Date: 2024-11-28

Summary:

Beautimeter is a new tool powered by generative pre-trained transformer (GPT) technology, designed to evaluate architectural and urban beauty. Rooted in Christopher Alexander's theory of centers, this work builds on the idea that all environments possess, to varying degrees, an innate sense of life. Alexander identified 15 fundamental properties, such as levels of scale and thick boundaries, that characterize living structure, which Beautimeter uses as a basis for its analysis. By integrating GPT's advanced natural language processing capabilities, Beautimeter assesses the extent to which a structure embodies these 15 properties, enabling a nuanced evaluation of architectural and urban aesthetics. Using ChatGPT, the tool helps users generate insights into the perceived beauty and coherence of spaces. We conducted a series of case studies, evaluating images of architectural and urban environments, as well as carpets, paintings, and other artifacts. The results demonstrate Beautimeter's effectiveness in analyzing aesthetic qualities across diverse contexts. Our findings suggest that by leveraging GPT technology, Beautimeter offers architects, urban planners, and designers a powerful tool to create spaces that resonate deeply with people. This paper also explores the implications of such technology for architecture and urban design, highlighting its potential to enhance both the design process and the assessment of built environments. Keywords: Living structure, structural beauty, Christopher Alexander, AI in Design, human centered design

--------------------------------------------------------------------------------------------------------

DGNN-YOLO: Dynamic Graph Neural Networks with YOLO11 for Small Object Detection and Tracking in Traffic Surveillance

Traffic safety is a critical global challenge, and this research presents an innovative framework for small object detection and tracking. By integrating dynamic graph neural networks with YOLO11, the researchers developed a robust system for identifying and tracking pedestrians, cyclists, and motorbikes in complex traffic scenarios. The approach effectively models spatial-temporal relationships, overcoming challenges like occlusion and low resolution. With superior precision and recall rates, this technology offers significant potential for intelligent transportation systems, urban planning, and enhancing road safety through advanced monitoring and analysis.

Authors: Shahriar Soudeep, M. F. Mridha, Md Abrar Jahin, Nilanjan Dey

Link: https://arxiv.org/abs/2411.17251v1

Date: 2024-11-26

Summary:

Accurate detection and tracking of small objects such as pedestrians, cyclists, and motorbikes are critical for traffic surveillance systems, which are crucial in improving road safety and decision-making in intelligent transportation systems. However, traditional methods struggle with challenges such as occlusion, low resolution, and dynamic traffic conditions, necessitating innovative approaches to address these limitations. This paper introduces DGNN-YOLO, a novel framework integrating dynamic graph neural networks (DGNN) with YOLO11 to enhance small object detection and tracking in traffic surveillance systems. The framework leverages YOLO11's advanced spatial feature extraction capabilities for precise object detection and incorporates DGNN to model spatial-temporal relationships for robust real-time tracking dynamically. By constructing and updating graph structures, DGNN-YOLO effectively represents objects as nodes and their interactions as edges, ensuring adaptive and accurate tracking in complex and dynamic environments. Extensive experiments demonstrate that DGNN-YOLO consistently outperforms state-of-the-art methods in detecting and tracking small objects under diverse traffic conditions, achieving the highest precision (0.8382), recall (0.6875), and mAP@0.5:0.95 (0.6476), showcasing its robustness and scalability, particularly in challenging scenarios involving small and occluded objects. This work provides a scalable, real-time traffic surveillance and analysis solution, significantly contributing to intelligent transportation systems.

--------------------------------------------------------------------------------------------------------

Near-Field Wideband Beamforming for RIS Based on Fresnel Zone

Wireless communication technologies are continuously evolving, and this research addresses challenges in millimeter-wave and terahertz communication systems. By leveraging Fresnel zone properties, the researchers developed an innovative beamforming approach for Reconfigurable Intelligent Surfaces (RIS) that mitigates near-field beam split effects. The method ensures consistent signal focus across different frequencies, potentially revolutionizing high-bandwidth wireless communications. With implications for 6G networks, satellite communications, and advanced wireless infrastructure, this research represents a significant step towards more reliable and efficient communication technologies.

Authors: Qiumo Yu, Linglong Dai

Link: https://arxiv.org/abs/2411.18878v1

Date: 2024-11-28

Summary:

Reconfigurable intelligent surface (RIS) has emerged as a promising solution to overcome the challenges of high path loss and easy signal blockage in millimeter-wave (mmWave) and terahertz (THz) communication systems. With the increase of RIS aperture and system bandwidth, the near-field beam split effect emerges, which causes beams at different frequencies to focus on distinct physical locations, leading to a significant gain loss of beamforming. To address this problem, we leverage the property of Fresnel zone that the beam split disappears for RIS elements along a single Fresnel zone and propose beamforming design on the two dimensions of along and across the Fresnel zones. The phase shift of RIS elements along the same Fresnel zone are designed aligned, so that the signal reflected by these element can add up in-phase at the receiver regardless of the frequency. Then the expression of equivalent channel is simplified to the Fourier transform of reflective intensity across Fresnel zones modulated by the designed phase. Based on this relationship, we prove that the uniformly distributed in-band gain with aligned phase along the Fresnel zone leads to the upper bound of achievable rate. Finally, we design phase shifts of RIS to approach this upper bound by adopting the stationary phase method as well as the Gerchberg-Saxton (GS) algorithm. Simulation results validate the effectiveness of our proposed Fresnel zone-based method in mitigating the near-field beam split effect.

--------------------------------------------------------------------------------------------------------

Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

Speeding up large language model inference is crucial for practical AI applications. This research introduces SVIP, a dynamic draft length policy for speculative decoding that adaptively determines sequence lengths based on token generation difficulty. By considering entropy in draft token distributions, the approach achieves significant walltime speedups in text generation tasks. The training-free method is compatible with existing speculative decoding techniques, offering a flexible solution for improving AI model performance. This research has broad implications for making AI text generation more efficient and responsive.

Authors: Ziyin Zhang, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Rui Wang, Zhaopeng Tu

Link: https://arxiv.org/abs/2411.18462v1

Date: 2024-11-27

Summary:

Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe & CaPE and EAGLE-2.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithDecember 2, 2024Comment