Week Ending 2.9.2025
RESEARCH WATCH: 2.9.2025
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
Multimodal large language models (MLLMs) struggle to interpret time visually, a fundamental cognitive skill. This research investigates MLLMs' capabilities in reading analogue clocks and yearly calendars through two curated datasets: ClockQA (various clock styles) and CalendarQA (calendar images with time-related questions). By testing visual recognition, numerical reasoning, and temporal inference, the study reveals significant challenges in time understanding for MLLMs. The research provides insights into machine perception of time, highlighting the complexity of translating visual time representations into comprehensible information across different clock designs and calendar formats.
Authors: Rohit Saxena, Aryo Pradipta Gema, Pasquale Minervini
Link: https://arxiv.org/abs/2502.05092v1
Date: 2025-02-07
Summary:
Understanding time from visual representations is a fundamental cognitive skill, yet it remains a challenge for multimodal large language models (MLLMs). In this work, we investigate the capabilities of MLLMs in interpreting time and date through analogue clocks and yearly calendars. To facilitate this, we curated a structured dataset comprising two subsets: 1) $\textit{ClockQA}$, which comprises various types of clock styles$-$standard, black-dial, no-second-hand, Roman numeral, and arrow-hand clocks$-$paired with time related questions; and 2) $\textit{CalendarQA}$, which consists of yearly calendar images with questions ranging from commonly known dates (e.g., Christmas, New Year's Day) to computationally derived ones (e.g., the 100th or 153rd day of the year). We aim to analyse how MLLMs can perform visual recognition, numerical reasoning, and temporal inference when presented with time-related visual data. Our evaluations show that despite recent advancements, reliably understanding time remains a significant challenge for MLLMs.
--------------------------------------------------------------------------------------------------------
Vision-language-action (VLA) models show promise in robotics by translating visual and linguistic inputs into actions, but lack reliability. This study bridges VLA models with cognitive architectures by probing OpenVLA's hidden layers to uncover symbolic representations of object properties, relations, and action states. Experiments on pick-and-place tasks demonstrated high accuracy in encoding symbolic states across model layers. The research introduces an integrated DIARC-OpenVLA system that enables real-time state monitoring, potentially improving interpretability and robustness in robotic manipulation through enhanced understanding of internal model representations.
Authors: Hong Lu, Hengxu Li, Prithviraj Singh Shahani, Stephanie Herbers, Matthias Scheutz
Link: https://arxiv.org/abs/2502.04558v1
Date: 2025-02-06
Summary:
Vision-language-action (VLA) models hold promise as generalist robotics solutions by translating visual and linguistic inputs into robot actions, yet they lack reliability due to their black-box nature and sensitivity to environmental changes. In contrast, cognitive architectures (CA) excel in symbolic reasoning and state monitoring but are constrained by rigid predefined execution. This work bridges these approaches by probing OpenVLA's hidden layers to uncover symbolic representations of object properties, relations, and action states, enabling integration with a CA for enhanced interpretability and robustness. Through experiments on LIBERO-spatial pick-and-place tasks, we analyze the encoding of symbolic states across different layers of OpenVLA's Llama backbone. Our probing results show consistently high accuracies (> 0.90) for both object and action states across most layers, though contrary to our hypotheses, we did not observe the expected pattern of object states being encoded earlier than action states. We demonstrate an integrated DIARC-OpenVLA system that leverages these symbolic representations for real-time state monitoring, laying the foundation for more interpretable and reliable robotic manipulation.
--------------------------------------------------------------------------------------------------------
Hepatocellular carcinoma (HCC) is a leading cause of cancer mortality, with early detection crucial for survival. This study introduces the Hierarchical Sparse Query Transformer (HSQformer), an AI model combining convolutional and vision transformer technologies to enhance ultrasound HCC diagnosis. By leveraging sparse latent space representations, the model captures hierarchical details without complex adjustments. Tested across single-center, multi-center, and high-risk scenarios, HSQformer outperformed existing models and matched senior radiologists' diagnostic capabilities, demonstrating significant potential for improving early HCC screening and potentially saving lives through more accurate medical imaging analysis.
Authors: Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Lida Chen, Wei Wang, Qinghua Huang
Link: https://arxiv.org/abs/2502.03772v1
Date: 2025-02-06
Summary:
Hepatocellular carcinoma (HCC) ranks as the third leading cause of cancer-related mortality worldwide, with early detection being crucial for improving patient survival rates. However, early screening for HCC using ultrasound suffers from insufficient sensitivity and is highly dependent on the expertise of radiologists for interpretation. Leveraging the latest advancements in artificial intelligence (AI) in medical imaging, this study proposes an innovative Hierarchical Sparse Query Transformer (HSQformer) model that combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to enhance the accuracy of HCC diagnosis in ultrasound screening. The HSQformer leverages sparse latent space representations to capture hierarchical details at various granularities without the need for complex adjustments, and adopts a modular, plug-and-play design philosophy, ensuring the model's versatility and ease of use. The HSQformer's performance was rigorously tested across three distinct clinical scenarios: single-center, multi-center, and high-risk patient testing. In each of these settings, it consistently outperformed existing state-of-the-art models, such as ConvNext and SwinTransformer. Notably, the HSQformer even matched the diagnostic capabilities of senior radiologists and comprehensively surpassed those of junior radiologists. The experimental results from this study strongly demonstrate the effectiveness and clinical potential of AI-assisted tools in HCC screening. The full code is available at https://github.com/Asunatan/HSQformer.
--------------------------------------------------------------------------------------------------------
A Scalable Approach to Probabilistic Neuro-Symbolic Verification
Neuro-Symbolic AI integrates neural learning with symbolic reasoning, presenting challenges in system verification. This research addresses the complexity of verifying probabilistic reasoning systems by proposing an approximate, relaxation-based verification approach. The study demonstrates that existing exact verification methods are computationally hard, and introduces a technique that scales exponentially better than solver-based solutions. By analyzing a real-world autonomous driving dataset, the research provides a pathway to formally verifying the robustness of probabilistic neuro-symbolic systems, potentially improving safety and reliability in critical domains like autonomous technology.
Authors: Vasileios Manginas, Nikolaos Manginas, Edward Stevinson, Sherwin Varghese, Nikos Katzouris, Georgios Paliouras, Alessio Lomuscio
Link: https://arxiv.org/abs/2502.03274v1
Date: 2025-02-05
Summary:
Neuro-Symbolic Artificial Intelligence (NeSy AI) has emerged as a promising direction for integrating neural learning with symbolic reasoning. In the probabilistic variant of such systems, a neural network first extracts a set of symbols from sub-symbolic input, which are then used by a symbolic component to reason in a probabilistic manner towards answering a query. In this work, we address the problem of formally verifying the robustness of such NeSy probabilistic reasoning systems, therefore paving the way for their safe deployment in critical domains. We analyze the complexity of solving this problem exactly, and show that it is $\mathrm{NP}^{\# \mathrm{P}}$-hard. To overcome this issue, we propose the first approach for approximate, relaxation-based verification of probabilistic NeSy systems. We demonstrate experimentally that the proposed method scales exponentially better than solver-based solutions and apply our technique to a real-world autonomous driving dataset, where we verify a safety property under large input dimensionalities and network sizes.
--------------------------------------------------------------------------------------------------------
Intelligent Sensing-to-Action for Robust Autonomy at the Edge: Opportunities and Challenges
Autonomous edge computing in robotics, smart cities, and vehicles relies on seamless sensing-to-action loops for real-time decision-making. This article explores how proactive, context-aware adaptations can enhance efficiency by dynamically adjusting sensing and computation. By investigating multi-agent sensing-action loops and neuromorphic computing principles, the research highlights strategies for optimizing resource use, reducing latency, and improving cross-layer interdependencies. The study emphasizes the importance of end-to-end co-design that aligns algorithmic models with hardware and environmental dynamics, potentially revolutionizing energy-efficient autonomy in complex, dynamic environments.
Authors: Amit Ranjan Trivedi, Sina Tayebati, Hemant Kumawat, Nastaran Darabi, Divake Kumar, Adarsh Kumar Kosta, Yeshwanth Venkatesha, Dinithi Jayasuriya, Nethmi Jayasinghe, Priyadarshini Panda, Saibal Mukhopadhyay, Kaushik Roy
Link: https://arxiv.org/abs/2502.02692v1
Date: 2025-02-04
Summary:
Autonomous edge computing in robotics, smart cities, and autonomous vehicles relies on the seamless integration of sensing, processing, and actuation for real-time decision-making in dynamic environments. At its core is the sensing-to-action loop, which iteratively aligns sensor inputs with computational models to drive adaptive control strategies. These loops can adapt to hyper-local conditions, enhancing resource efficiency and responsiveness, but also face challenges such as resource constraints, synchronization delays in multi-modal data fusion, and the risk of cascading errors in feedback loops. This article explores how proactive, context-aware sensing-to-action and action-to-sensing adaptations can enhance efficiency by dynamically adjusting sensing and computation based on task demands, such as sensing a very limited part of the environment and predicting the rest. By guiding sensing through control actions, action-to-sensing pathways can improve task relevance and resource use, but they also require robust monitoring to prevent cascading errors and maintain reliability. Multi-agent sensing-action loops further extend these capabilities through coordinated sensing and actions across distributed agents, optimizing resource use via collaboration. Additionally, neuromorphic computing, inspired by biological systems, provides an efficient framework for spike-based, event-driven processing that conserves energy, reduces latency, and supports hierarchical control--making it ideal for multi-agent optimization. This article highlights the importance of end-to-end co-design strategies that align algorithmic models with hardware and environmental dynamics and improve cross-layer interdependencies to improve throughput, precision, and adaptability for energy-efficient edge autonomy in complex environments.
--------------------------------------------------------------------------------------------------------
Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances
Bias detection in regulatory frameworks faces computational challenges due to exponentially growing subgroup numbers. This research reformulates bias detection as a point-to-subspace problem, showing efficient subsampling for supremum norm. By addressing the complexity of testing bias across multiple subgroups, the study provides a probabilistically approximately correct approach to bias detection. The research is particularly relevant when reference data comes from surveys with inherent uncertainties, offering a more computationally feasible method for identifying potential biases in datasets and machine learning models.
Authors: German Martinez Matilla, Jakub Marecek
Link: https://arxiv.org/abs/2502.02623v1
Date: 2025-02-04
Summary:
Sample complexity of bias estimation is a lower bound on the runtime of any bias detection method. Many regulatory frameworks require the bias to be tested for all subgroups, whose number grows exponentially with the number of protected attributes. Unless one wishes to run a bias detection with a doubly-exponential run-time, one should like to have polynomial complexity of bias detection for a single subgroup. At the same time, the reference data may be based on surveys, and thus come with non-trivial uncertainty. Here, we reformulate bias detection as a point-to-subspace problem on the space of measures and show that, for supremum norm, it can be subsampled efficiently. In particular, our probabilistically approximately correct (PAC) results are corroborated by tests on well-known instances.
--------------------------------------------------------------------------------------------------------
Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning
Reinforcement learning from human feedback (RLHF) is crucial for aligning generative AI models. This study develops a novel approach to fine-tune diffusion models using continuous-time reinforcement learning, treating score matching as controls or actions. By formulating the process as a stochastic control problem, the research creates a new policy optimization framework for continuous-time RL. The method was validated on Text2Image models, demonstrating potential for enhancing generative AI's ability to align with input prompts and improve overall model performance.
Authors: Hanyang Zhao, Haoxian Chen, Ji Zhang, David D. Yao, Wenpin Tang
Link: https://arxiv.org/abs/2502.01819v1
Date: 2025-02-03
Summary:
Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area use a discrete-time formulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tune diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby making connections to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models of Stable Diffusion v1.5.
--------------------------------------------------------------------------------------------------------
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Retrieval-Augmented Generation (RAG) has primarily focused on textual content, leaving multi-modal video knowledge unexplored. VideoRAG introduces a framework for processing extremely long-context videos through a dual-channel architecture. By integrating graph-based textual knowledge grounding and multi-modal context encoding, the system can process unlimited-length videos while maintaining semantic dependencies. Evaluated on a benchmark of 160+ videos totaling 134+ hours, VideoRAG shows significant performance improvements in long video understanding, potentially revolutionizing how AI processes and comprehends extensive video content.
Authors: Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
Link: https://arxiv.org/abs/2502.01549v1
Date: 2025-02-03
Summary:
Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic dependencies through specialized multi-modal retrieval paradigms. Through comprehensive empirical evaluation on our proposed LongerVideos benchmark-comprising over 160 videos totaling 134+ hours across lecture, documentary, and entertainment categories-VideoRAG demonstrates substantial performance compared to existing RAG alternatives and long video understanding methods. The source code of VideoRAG implementation and the benchmark dataset are openly available at: https://github.com/HKUDS/VideoRAG.
--------------------------------------------------------------------------------------------------------
A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers
Neural network-based Optimal Transport (OT) is an emerging field with applications in domain translation, image processing, and computational biology. This research provides a theoretical investigation of adversarial minimax OT solvers from a statistical learning perspective. By establishing upper bounds on generalization error for approximate OT maps, the study offers insights into the mathematical properties of neural network-based OT methods. The work paves the way for more rigorous understanding of these solvers, potentially improving their reliability and performance across various domains.
Authors: Roman Tarasov, Petr Mokrov, Milena Gazdieva, Evgeny Burnaev, Alexander Korotin
Link: https://arxiv.org/abs/2502.01310v1
Date: 2025-02-03
Summary:
Neural network based Optimal Transport (OT) is a recent and fruitful direction in the generative modeling community. It finds its applications in various fields such as domain translation, image super-resolution, computational biology and others. Among the existing approaches to OT, of considerable interest are adversarial minimax solvers based on semi-dual formulations of OT problems. While promising, these methods lack theoretical investigation from a statistical learning perspective. Our work fills this gap by establishing upper bounds on the generalization error of an approximate OT map recovered by the minimax quadratic OT solver. Importantly, the bounds we derive depend solely on some standard statistical and mathematical properties of the considered functional classes (neural networks). While our analysis focuses on the quadratic OT, we believe that similar bounds could be derived for more general OT formulations, paving the promising direction for future research.
--------------------------------------------------------------------------------------------------------
Layer by Layer: Uncovering Hidden Representations in Language Models
Conventional wisdom suggests that final layers of language models are most important for feature extraction and text generation. This research challenges this notion, demonstrating that intermediate layers can encode richer representations. By developing a unified framework of representation quality metrics, the study reveals how model layers balance information compression and signal preservation. The findings suggest that mid-depth embeddings can outperform final-layer outputs, opening new directions for model analysis and optimization across different architectures and domains.
Authors: Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, Ravid Shwartz-Ziv
Link: https://arxiv.org/abs/2502.02013v1
Date: 2025-02-04
Summary:
From extracting features to generating text, the outputs of large language models (LLMs) typically rely on their final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each model layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer's performance. Through extensive experiments on 32 text-embedding tasks and comparisons across model architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features. These findings challenge the standard focus on final-layer embeddings and open new directions for model analysis and optimization, including strategic use of mid-layer representations for more robust and accurate AI systems.
--------------------------------------------------------------------------------------------------------
Strategic Learning with Local Explanations as Feedback
This research explores algorithmic decision problems where agents respond strategically to decision makers' models. As demand grows for clear, actionable explanations, the study investigates how partial model disclosures can maximize utility without harming agents. By examining local and global explanation methods, the research proposes a framework for safe and effective model disclosure. The approach aims to balance decision maker outcomes with agent welfare, potentially revolutionizing how AI systems provide explanations and interact strategically in complex decision-making environments.
Authors: Kiet Q. H. Vo, Siu Lun Chau, Masahiro Kato, Yixin Wang, Krikamol Muandet
Link: https://arxiv.org/abs/2502.04058v1
Date: 2025-02-06
Summary:
We investigate algorithmic decision problems where agents can respond strategically to the decision maker's (DM) models. The demand for clear and actionable explanations from DMs to (potentially strategic) agents continues to rise. While prior work often treats explanations as full model disclosures, explanations in practice might convey only partial information, which can lead to misinterpretations and harmful responses. When full disclosure of the predictive model is neither feasible nor desirable, a key open question is how DMs can use explanations to maximise their utility without compromising agent welfare. In this work, we explore well-known local and global explanation methods, and establish a necessary condition to prevent explanations from misleading agents into self-harming actions. Moreover, with conditional homogeneity, we establish that action recommendation (AR)-based explanations are sufficient for non-harmful responses, akin to the revelation principle in information design. To operationalise AR-based explanations, we propose a simple algorithm to jointly optimise the predictive model and AR policy to balance DM outcomes with agent welfare. Our empirical results demonstrate the benefits of this approach as a more refined strategy for safe and effective partial model disclosure in algorithmic decision-making.
--------------------------------------------------------------------------------------------------------
Position: Emergent Machina Sapiens Urge Rethinking Multi-Agent Paradigms
As AI systems become more autonomous, challenges emerge in coordinating unaligned agents in shared environments. This paper advocates for reimagining multi-agent frameworks beyond predefined rules and static objectives. The researchers propose that AI agents should dynamically adjust goals, form coalitions, and evolve relationships through social feedback. By emphasizing self-organizing and context-aware system design, the study calls for a fundamental shift in understanding how autonomous AI agents can coexist and interact meaningfully in complex, collaborative scenarios.
Authors: Hepeng Li, Yuhong Liu, Jun Yan
Link: https://arxiv.org/abs/2502.04388v1
Date: 2025-02-05
Summary:
Artificially intelligent (AI) agents that are capable of autonomous learning and independent decision-making hold great promise for addressing complex challenges across domains like transportation, energy systems, and manufacturing. However, the surge in AI systems' design and deployment driven by various stakeholders with distinct and unaligned objectives introduces a crucial challenge: how can uncoordinated AI systems coexist and evolve harmoniously in shared environments without creating chaos? To address this, we advocate for a fundamental rethinking of existing multi-agent frameworks, such as multi-agent systems and game theory, which are largely limited to predefined rules and static objective structures. We posit that AI agents should be empowered to dynamically adjust their objectives, make compromises, form coalitions, and safely compete or cooperate through evolving relationships and social feedback. Through this paper, we call for a shift toward the emergent, self-organizing, and context-aware nature of these systems.
--------------------------------------------------------------------------------------------------------
Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator
Current preference optimization methods lack robust theoretical justification for sampling dispreferred completions. This research develops a novel framework by formulating preference optimization as minimizing negative log-likelihood and estimating normalization constants through sampling strategies. By proposing the Monte Carlo Preference Optimization (MC-PO) algorithm, the study introduces a method for more effectively selecting and sampling hard negative examples in AI model training, potentially improving alignment and performance across various benchmarks.
Authors: Zhuotong Chen, Fang Liu, Xuan Zhu, Yanjun Qi, Mohammad Ghavamzadeh
Link: https://arxiv.org/abs/2502.04567v1
Date: 2025-02-06
Summary:
Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
--------------------------------------------------------------------------------------------------------
Ensuring Reliability via Hyperparameter Selection: Review and Advances
Hyperparameter selection is critical in deploying AI models, especially for pre-trained systems. This paper reviews the Learn-Then-Test framework, exploring extensions for providing statistical guarantees on population risk measures. By approaching hyperparameter selection as a multiple hypothesis testing problem, the research offers insights into optimizing model selection across various scenarios. The study includes applications in communication systems and provides a theoretical foundation for more reliable and statistically grounded model development.
Authors: Amirmohammad Farzaneh, Osvaldo Simeone
Link: https://arxiv.org/abs/2502.04206v1
Date: 2025-02-06
Summary:
Hyperparameter selection is a critical step in the deployment of artificial intelligence (AI) models, particularly in the current era of foundational, pre-trained, models. By framing hyperparameter selection as a multiple hypothesis testing problem, recent research has shown that it is possible to provide statistical guarantees on population risk measures attained by the selected hyperparameter. This paper reviews the Learn-Then-Test (LTT) framework, which formalizes this approach, and explores several extensions tailored to engineering-relevant scenarios. These extensions encompass different risk measures and statistical guarantees, multi-objective optimization, the incorporation of prior knowledge and dependency structures into the hyperparameter selection process, as well as adaptivity. The paper also includes illustrative applications for communication systems.
--------------------------------------------------------------------------------------------------------
ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization
The optimal bit-width for balancing model size and accuracy remains debated. ParetoQ introduces a unified framework comparing quantization across 1 to 4 bits, revealing a significant learning transition between 2 and 3 bits. The research demonstrates that ternary and 2-bit quantization can maintain performance while significantly reducing model size and computational requirements. By optimizing training schemes and quantization functions, the study offers promising approaches for creating more efficient large language models with minimal performance loss.
Authors: Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra
Link: https://arxiv.org/abs/2502.02631v1
Date: 2025-02-04
Summary:
The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
--------------------------------------------------------------------------------------------------------
Regularized interpolation in 4D neural fields enables optimization of 3D printed geometries
3D printing faces challenges in producing geometries with precise properties. This research encodes volumetric representations into neural fields using a novel regularization strategy that minimizes output variations. By encouraging smooth interpolation between observed volumes, the approach allows extraction of 'imagined' 3D shapes under different manufacturing parameters. The framework enables data-driven optimization of geometric fidelity, potentially reducing post-processing, material waste, and production costs while helping manufacturers realize complex and feature-rich designs.
Authors: Christos Margadji, Andi Kuswoyo, Sebastian W. Pattinson
Link: https://arxiv.org/abs/2502.01517v1
Date: 2025-02-03
Summary:
The ability to accurately produce geometries with specified properties is perhaps the most important characteristic of a manufacturing process. 3D printing is marked by exceptional design freedom and complexity but is also prone to geometric and other defects that must be resolved for it to reach its full potential. Ultimately, this will require both astute design decisions and timely parameter adjustments to maintain stability that is challenging even with expert human operators. While machine learning is widely investigated in 3D printing, existing methods typically overlook spatial features that vary across prints and thus find it difficult to produce desired geometries. Here, we encode volumetric representations of printed parts into neural fields and apply a new regularization strategy, based on minimizing the partial derivative of the field's output with respect to a single, non-learnable parameter. By thus encouraging small input changes to yield only small output variations, we encourage smooth interpolation between observed volumes and hence realistic geometry predictions. This framework therefore allows the extraction of 'imagined' 3D shapes, revealing how a part would look if manufactured under previously unseen parameters. The resulting continuous field is used for data-driven optimization to maximize geometric fidelity between expected and produced geometries, reducing post-processing, material waste, and production costs. By optimizing process parameters dynamically, our approach enables advanced planning strategies, potentially allowing manufacturers to better realize complex and feature-rich designs.
--------------------------------------------------------------------------------------------------------
An Annotated Reading of 'The Singer of Tales' in the LLM Era
This paper examines the Parry-Lord oral-formulaic theory through the lens of large language models (LLMs). By comparing oral narrative composition with LLM generation, the research explores similarities and differences in how stories are learned, composed, and transmitted. The study provides insights into narrative generation, highlighting potential implications for society and AI policy by drawing parallels between traditional oral storytelling and modern generative AI techniques.
Authors: Kush R. Varshney
Link: https://arxiv.org/abs/2502.05148v1
Date: 2025-02-07
Summary:
The Parry-Lord oral-formulaic theory was a breakthrough in understanding how oral narrative poetry is learned, composed, and transmitted by illiterate bards. In this paper, we provide an annotated reading of the mechanism underlying this theory from the lens of large language models (LLMs) and generative artificial intelligence (AI). We point out the the similarities and differences between oral composition and LLM generation, and comment on the implications to society and AI policy.
--------------------------------------------------------------------------------------------------------
Large language models often struggle with complex reasoning tasks. This research proposes a novel approach to enhance LLM reasoning through a two-stage training paradigm: format tuning and self-improvement via reinforcement learning. The Chain-of-Action-Thought (COAT) reasoning method aims to internalize searching capabilities within a single LLM. Evaluated on mathematical reasoning benchmarks, the Satori model demonstrates improved performance and strong generalization across various tasks.
Authors: Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan
Link: https://arxiv.org/abs/2502.02508v1
Date: 2025-02-04
Summary:
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.
--------------------------------------------------------------------------------------------------------
Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review
As global populations age, technologies supporting elderly care become crucial. This systematic review examines deep learning facial expression recognition (FER) systems tailored for older adults. Analyzing 31 studies, the research highlights challenges in developing age-inclusive datasets and reliable emotion recognition technologies. The study emphasizes the need for diverse, privacy-conscious solutions that can support healthcare, mental health monitoring, and personalized care for elderly populations.
Authors: F. Xavier Gaya-Morey, Jose M. Buades-Rubio, Philippe Palanque, Raquel Lacuesta, Cristina Manresa-Yee
Link: https://arxiv.org/abs/2502.02618v1
Date: 2025-02-04
Summary:
The rapid aging of the global population has highlighted the need for technologies to support elderly, particularly in healthcare and emotional well-being. Facial expression recognition (FER) systems offer a non-invasive means of monitoring emotional states, with applications in assisted living, mental health support, and personalized care. This study presents a systematic review of deep learning-based FER systems, focusing on their applications for the elderly population. Following a rigorous methodology, we analyzed 31 studies published over the last decade, addressing challenges such as the scarcity of elderly-specific datasets, class imbalances, and the impact of age-related facial expression differences. Our findings show that convolutional neural networks remain dominant in FER, and especially lightweight versions for resource-constrained environments. However, existing datasets often lack diversity in age representation, and real-world deployment remains limited. Additionally, privacy concerns and the need for explainable artificial intelligence emerged as key barriers to adoption. This review underscores the importance of developing age-inclusive datasets, integrating multimodal solutions, and adopting XAI techniques to enhance system usability, reliability, and trustworthiness. We conclude by offering recommendations for future research to bridge the gap between academic progress and real-world implementation in elderly care.
--------------------------------------------------------------------------------------------------------
LAST SToP For Modeling Asynchronous Time Series
Traditional time series analysis struggles with irregularly timed events. This research introduces a novel prompt design for large language models to handle asynchronous time series. By leveraging natural language descriptions of timestamped events, the approach extends time series analysis beyond forecasting to tasks like anomaly detection and data imputation. The study introduces Stochastic Soft Prompting, demonstrating improved performance across different domains and datasets.
Authors: Shubham Gupta, Thibaut Durand, Graham Taylor, Lilian W. Białokozowicz
Link: https://arxiv.org/abs/2502.01922v1
Date: 2025-02-04
Summary:
We present a novel prompt design for Large Language Models (LLMs) tailored to Asynchronous Time Series. Unlike regular time series, which assume values at evenly spaced time points, asynchronous time series consist of timestamped events occurring at irregular intervals, each described in natural language. Our approach effectively utilizes the rich natural language of event descriptions, allowing LLMs to benefit from their broad world knowledge for reasoning across different domains and tasks. This allows us to extend the scope of asynchronous time series analysis beyond forecasting to include tasks like anomaly detection and data imputation. We further introduce Stochastic Soft Prompting, a novel prompt-tuning mechanism that significantly improves model performance, outperforming existing fine-tuning methods such as QLoRA. Through extensive experiments on real world datasets, we demonstrate that our approach achieves state-of-the-art performance across different tasks and datasets.
--------------------------------------------------------------------------------------------------------