Week Ending 8.4.2024
RESEARCH WATCH: 8.4.2024
SAM 2: Segment Anything in Images and Videos
SAM 2 represents a significant advancement in visual segmentation for both images and videos. Building on the success of the original Segment Anything Model, this new iteration introduces a data engine that improves model performance through user interaction. The result is a more accurate and efficient segmentation tool that can handle a wide range of tasks. With its ability to process videos in real-time and achieve better accuracy with fewer interactions, SAM 2 has potential applications in fields such as autonomous driving, medical imaging, and video editing. The release of the model, dataset, and interactive demo will likely accelerate research and development in visual perception tasks.
Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer
Link: https://arxiv.org/abs/2408.00714v1
Date: 2024-08-01
Summary:
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing a version of our model, the dataset and an interactive demo.
--------------------------------------------------------------------------------------------------------
Palu: Compressing KV-Cache with Low-Rank Projection
Palu addresses a critical challenge in large language model deployment: memory consumption. By introducing a novel KV-Cache compression framework using low-rank projection, Palu significantly reduces memory usage while maintaining or even improving model accuracy. This innovation is particularly important as language models continue to grow in size and complexity. Potential applications include enabling the use of more powerful language models on devices with limited memory, improving the efficiency of cloud-based AI services, and reducing the cost of running large-scale language models. The publicly available code will allow researchers and developers to implement and build upon this technique in various natural language processing applications.
Authors: Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu
Link: https://arxiv.org/abs/2407.21118v1
Date: 2024-07-30
Summary:
KV-Cache compression methods generally sample a KV-Cache of effectual tokens or quantize it into lower bits. However, these methods cannot exploit the redundancy of the hidden dimension of KV tensors. This paper investigates a unique hidden dimension approach called Palu, a novel KV-Cache compression framework that utilizes low-rank projection. Palu decomposes the linear layers into low-rank matrices, caches the smaller intermediate states, and reconstructs the full keys and values on the fly. To improve accuracy, compression rate, and efficiency, Palu further encompasses (1) a medium-grained low-rank decomposition scheme, (2) an efficient rank search algorithm, (3) a low-rank-aware quantization algorithm, and (4) matrix fusion with optimized GPU kernels. Our extensive experiments with popular LLMs show that Palu can compress KV-Cache by more than 91.25% while maintaining a significantly better accuracy (up to 1.19 lower perplexity) than state-of-the-art KV-Cache quantization methods at a similar or even higher memory usage. When compressing KV-Cache for 50%, Palu delivers up to 1.61x end-to-end speedup for the attention module. Our code is publicly available at https://github.com/shadowpa0327/Palu.
--------------------------------------------------------------------------------------------------------
Evaluating Large Language Models for automatic analysis of teacher simulations
This research explores the use of Large Language Models (LLMs) in the context of teacher education, specifically for analyzing responses in Digital Simulations. By comparing the performance of different LLMs in identifying user behaviors, the study provides valuable insights for educators and developers of educational technology. The findings suggest that Llama 3 may be more suitable for scenarios where new characteristics need to be introduced regularly. This work has potential applications in improving the quality and efficiency of teacher training programs, automating the evaluation of teacher candidates' responses, and developing more adaptive and personalized educational simulations.
Authors: David de-Fitero-Dominguez, Mariano Albaladejo-González, Antonio Garcia-Cabot, Eva Garcia-Lopez, Antonio Moreno-Cediel, Erin Barno, Justin Reich
Link: https://arxiv.org/abs/2407.20360v1
Date: 2024-07-29
Summary:
Digital Simulations (DS) provide safe environments where users interact with an agent through conversational prompts, providing engaging learning experiences that can be used to train teacher candidates in realistic classroom scenarios. These simulations usually include open-ended questions, allowing teacher candidates to express their thoughts but complicating an automatic response analysis. To address this issue, we have evaluated Large Language Models (LLMs) to identify characteristics (user behaviors) in the responses of DS for teacher education. We evaluated the performance of DeBERTaV3 and Llama 3, combined with zero-shot, few-shot, and fine-tuning. Our experiments discovered a significant variation in the LLMs' performance depending on the characteristic to identify. Additionally, we noted that DeBERTaV3 significantly reduced its performance when it had to identify new characteristics. In contrast, Llama 3 performed better than DeBERTaV3 in detecting new characteristics and showing more stable performance. Therefore, in DS where teacher educators need to introduce new characteristics because they change depending on the simulation or the educational objectives, it is more recommended to use Llama 3. These results can guide other researchers in introducing LLMs to provide the highly demanded automatic evaluations in DS.
--------------------------------------------------------------------------------------------------------
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
MindSearch introduces a novel approach to information seeking and integration using Large Language Models. By mimicking human cognitive processes, this multi-agent framework addresses key challenges in complex information retrieval tasks. The system's ability to decompose queries, perform hierarchical searches, and integrate information from multiple sources has potential applications in academic research, business intelligence, and general knowledge acquisition. MindSearch's performance, which is comparable to or better than proprietary AI search engines, suggests it could be a valuable tool for enhancing human knowledge work and decision-making processes across various domains.
Authors: Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, Feng Zhao
Link: https://arxiv.org/abs/2407.20183v1
Date: 2024-07-29
Summary:
Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Inspired by the remarkable progress of Large Language Models, recent works attempt to solve this task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests often cannot be accurately and completely retrieved by the search engine once (2) corresponding information to be integrated is spread over multiple web pages along with massive noise, and (3) a large number of web pages with long contents may quickly exceed the maximum context length of LLMs. Inspired by the cognitive process when humans solve these problems, we introduce MindSearch to mimic the human minds in web information seeking and integration, which can be instantiated by a simple yet effective LLM-based multi-agent framework. The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process: it decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher. Tasked with each sub-question, WebSearcher performs hierarchical information retrieval with search engines and collects valuable information for WebPlanner. The multi-agent design of MindSearch enables the whole framework to seek and integrate information parallelly from larger-scale (e.g., more than 300) web pages in 3 minutes, which is worth 3 hours of human effort. MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth, on both close-set and open-set QA problems. Besides, responses from MindSearch based on InternLM2.5-7B are preferable by humans to ChatGPT-Web and Perplexity.ai applications, which implies that MindSearch can already deliver a competitive solution to the proprietary AI search engine.
--------------------------------------------------------------------------------------------------------
SmileyNet -- Towards the Prediction of the Lottery by Reading Tea Leaves with AI
SmileyNet presents a lighthearted exploration of neural network capabilities, combining elements of mood-based learning and Tasseology. While the paper's claim of predicting lottery outcomes should be taken with a grain of salt, it raises interesting questions about the potential biases and unexpected behaviors that can emerge in neural networks. The study's playful approach could have applications in exploring unconventional training methods, understanding the impact of initial biases on model performance, and developing more creative approaches to problem-solving in AI. It also serves as a reminder of the importance of critical thinking when evaluating AI claims and results.
Authors: Andreas Birk
Link: https://arxiv.org/abs/2407.21385v1
Date: 2024-07-31
Summary:
We introduce SmileyNet, a novel neural network with psychic abilities. It is inspired by the fact that a positive mood can lead to improved cognitive capabilities including classification tasks. The network is hence presented in a first phase with smileys and an encouraging loss function is defined to bias it into a good mood. SmileyNet is then used to forecast the flipping of a coin based on an established method of Tasseology, namely by reading tea leaves. Training and testing in this second phase are done with a high-fidelity simulation based on real-world pixels sampled from a professional tea-reading cup. SmileyNet has an amazing accuracy of 72% to correctly predict the flip of a coin. Resnet-34, respectively YOLOv5 achieve only 49%, respectively 53%. It is then shown how multiple SmileyNets can be combined to win the lottery.
--------------------------------------------------------------------------------------------------------
MIST: A Simple and Scalable End-To-End 3D Medical Imaging Segmentation Framework
MIST addresses the need for standardized tools in medical imaging segmentation research. By providing a comprehensive framework for training, testing, and evaluating deep learning-based segmentation methods, MIST aims to facilitate fair comparisons and reproducible research in this field. The framework's modular design and ability to accommodate multiple architectures and loss functions make it a versatile tool for researchers and developers. Potential applications include accelerating the development of more accurate and efficient medical imaging segmentation algorithms, which could lead to improvements in disease diagnosis, treatment planning, and patient outcomes across various medical specialties.
Authors: Adrian Celaya, Evan Lim, Rachel Glenn, Brayden Mi, Alex Balsells, Tucker Netherton, Caroline Chung, Beatrice Riviere, David Fuentes
Link: https://arxiv.org/abs/2407.21343v1
Date: 2024-07-31
Summary:
Medical imaging segmentation is a highly active area of research, with deep learning-based methods achieving state-of-the-art results in several benchmarks. However, the lack of standardized tools for training, testing, and evaluating new methods makes the comparison of methods difficult. To address this, we introduce the Medical Imaging Segmentation Toolkit (MIST), a simple, modular, and end-to-end medical imaging segmentation framework designed to facilitate consistent training, testing, and evaluation of deep learning-based medical imaging segmentation methods. MIST standardizes data analysis, preprocessing, and evaluation pipelines, accommodating multiple architectures and loss functions. This standardization ensures reproducible and fair comparisons across different methods. We detail MIST's data format requirements, pipelines, and auxiliary features and demonstrate its efficacy using the BraTS Adult Glioma Post-Treatment Challenge dataset. Our results highlight MIST's ability to produce accurate segmentation masks and its scalability across multiple GPUs, showcasing its potential as a powerful tool for future medical imaging research and development.
--------------------------------------------------------------------------------------------------------
Implementing Streaming algorithm and k-means clusters to RAG
This research proposes an innovative approach to enhance Retrieval-augmented generation (RAG) by combining streaming algorithms and k-means clustering. The method aims to address the challenges of high memory consumption and slow updates when dealing with massive streaming data. By applying a streaming algorithm to update the index and using k-means clustering to group similar documents, the approach achieves improved accuracy and reduced memory usage. Potential applications include enhancing conversational AI systems, improving real-time information retrieval in dynamic environments, and enabling more efficient and accurate question-answering systems for large-scale, continuously updating datasets.
Authors: Haoyu Kang, Yuzhou Zhu, Yukun Zhong, Ke Wang
Link: https://arxiv.org/abs/2407.21300v1
Date: 2024-07-31
Summary:
Retrieval-augmented generation (RAG) has achieved great success in information retrieval to assist large models because it builds an external knowledge database. However, it also has many problems: it consumes a lot of memory because of the huge database. When faced with massive streaming data, it is unable to update the established index database in time. To save the memory of building the database and maintain accuracy simultaneously, we proposed a new approach combining a streaming algorithm and k-means cluster with RAG. Our approach applies a streaming algorithm to update the index and reduce memory consumption. Then use the k-means algorithm to cluster documents with high similarities together, the query time will be shortened by doing this. We conducted comparative experiments on four methods, and the results show that RAG with streaming algorithm and k-means cluster performs well in accuracy and memory. For massive streaming data, we find that our method behaves better than traditional RAG
--------------------------------------------------------------------------------------------------------
Classification, Regression and Segmentation directly from k-Space in Cardiac MRI
This paper introduces KMAE, a Transformer-based model designed to process k-space data directly in cardiac Magnetic Resonance Imaging (CMR). By eliminating the need for conversion to the image domain, KMAE offers a novel approach to analyzing CMR data. The model's ability to handle classification, regression, and segmentation tasks directly from k-space data has potential applications in improving cardiac disease diagnosis, enhancing the efficiency of CMR analysis, and potentially reducing the time and computational resources required for cardiac imaging. The model's robust performance, even with undersampled k-space data, suggests it could be particularly useful in scenarios where rapid or low-dose imaging is necessary.
Authors: Ruochen Li, Jiazhen Pan, Youxiang Zhu, Juncheng Ni, Daniel Rueckert
Link: https://arxiv.org/abs/2407.20108v1
Date: 2024-07-29
Summary:
Cardiac Magnetic Resonance Imaging (CMR) is the gold standard for diagnosing cardiovascular diseases. Clinical diagnoses predominantly rely on magnitude-only Digital Imaging and Communications in Medicine (DICOM) images, omitting crucial phase information that might provide additional diagnostic benefits. In contrast, k-space is complex-valued and encompasses both magnitude and phase information, while humans cannot directly perceive. In this work, we propose KMAE, a Transformer-based model specifically designed to process k-space data directly, eliminating conventional intermediary conversion steps to the image domain. KMAE can handle critical cardiac disease classification, relevant phenotype regression, and cardiac morphology segmentation tasks. We utilize this model to investigate the potential of k-space-based diagnosis in cardiac MRI. Notably, this model achieves competitive classification and regression performance compared to image-domain methods e.g. Masked Autoencoders (MAEs) and delivers satisfactory segmentation performance with a myocardium dice score of 0.884. Last but not least, our model exhibits robust performance with consistent results even when the k-space is 8* undersampled. We encourage the MR community to explore the untapped potential of k-space and pursue end-to-end, automated diagnosis with reduced human intervention.
--------------------------------------------------------------------------------------------------------
GNN-MolKAN: Harnessing the Power of KAN to Advance Molecular Representation Learning with GNNs
GNN-MolKAN introduces a novel approach to molecular representation learning by integrating Kolmogorov-Arnold Networks (KAN) into Graph Neural Networks (GNNs). This innovative method addresses challenges in molecular property prediction and drug design, offering improved performance, efficiency, and few-shot learning capabilities. The approach has potential applications in accelerating drug discovery processes, improving the accuracy of molecular property predictions, and enhancing the design of new materials. By consistently achieving competitive results across various datasets, GNN-MolKAN could become a valuable tool for researchers and practitioners in computational chemistry, materials science, and pharmaceutical development.
Authors: Ruifeng Li
Link: https://arxiv.org/abs/2408.01018v1
Date: 2024-08-02
Summary:
Effective molecular representation learning is crucial for molecular property prediction and drug design. However, existing approaches struggle with limitations in insufficient annotations and suboptimal architecture design. For instance, Graph Neural Networks (GNNs) suffer from over-squashing, causing the loss of important structural details in molecules, thus impairing molecular representations. In this work, we propose a new class of GNNs, GNN-MolKAN and its augmented variant, GNN-MolKAN+, that integrate the Kolmogorov-Arnold Networks (KAN) architecture from AI + Science into GNNs to address these challenges. Additionally, we introduce Adaptive FastKAN (AdFastKAN), an advanced KAN that offers increased stability and speed, further enhancing the performance of standard GNNs. Notably, our approach holds three key benefits: 1) Superior Performance: GNN-MolKAN and GNN-MolKAN+ demonstrate superior prediction ability, robust generalization to unseen scaffolds, and versatile transferability across different GNN architectures. 2) Efficiency: These models require less computational time and fewer parameters while matching or surpassing the state-of-the-art (SOTA) self-supervised methods. 3) Few-shot Learning Ability: GNN-MolKAN demonstrates great potential in few-shot learning scenarios, achieving an average improvement of 6.97% across few-shot benchmarks. Overall, we validate our architecture on 6 classification datasets, 6 regression datasets, and 4 few-shot learning datasets, consistently achieving highly competitive results across all of them.
--------------------------------------------------------------------------------------------------------
In-Hand Singulation and Scooping Manipulation with a 5 DOF Tactile Gripper
This paper presents a novel gripper design with five degrees of freedom and integrated tactile sensing capabilities. The gripper's ability to perform challenging tasks such as object singulation in granular media and precise credit card insertion demonstrates its potential for advanced robotic manipulation. Possible applications include warehouse automation, where the gripper could handle various objects in cluttered environments, and manufacturing processes requiring precise manipulation of small or delicate items. The high success rates achieved in the experiments suggest that this gripper design could significantly enhance the capabilities of robotic systems in industries requiring dexterous manipulation and tactile feedback.
Authors: Yuhao Zhou, Pokuang Zhou, Shaoxiong Wang, Yu She
Link: https://arxiv.org/abs/2408.00610v1
Date: 2024-08-01
Summary:
Manipulation tasks often require a high degree of dexterity, typically necessitating grippers with multiple degrees of freedom (DoF). While a robotic hand equipped with multiple fingers can execute precise and intricate manipulation tasks, the inherent redundancy stemming from its extensive DoF often adds unnecessary complexity. In this paper, we introduce the design of a tactile sensor-equipped gripper with two fingers and five DoF. We present a novel design integrating a GelSight tactile sensor, enhancing sensing capabilities and enabling finer control during specific manipulation tasks. To evaluate the gripper's performance, we conduct experiments involving two challenging tasks: 1) retrieving, singularizing, and classification of various objects embedded in granular media, and 2) executing scooping manipulations of credit cards in confined environments to achieve precise insertion. Our results demonstrate the efficiency of the proposed approach, with a high success rate for singulation and classification tasks, particularly for spherical objects at high as 94.3%, and a 100% success rate for scooping and inserting credit cards.
--------------------------------------------------------------------------------------------------------
Bayesian Low-Rank LeArning (Bella): A Practical Approach to Bayesian Neural Networks
Bayesian neural networks offer improved robustness and resilience compared to traditional neural networks, but their high computational complexity has limited practical adoption. The Bella framework aims to address this by using low-rank perturbations of pre-trained neural network parameters to dramatically reduce the number of trainable parameters needed. This approach enables the implementation of sophisticated Bayesian techniques like Stein Variational Gradient Descent even for large models. Bella demonstrates effectiveness on large-scale tasks like ImageNet and visual question answering, potentially opening up practical applications of Bayesian deep learning in areas like computer vision, natural language processing, and multi-modal AI systems.
Authors: Bao Gia Doan, Afshar Shamsi, Xiao-Yu Guo, Arash Mohammadi, Hamid Alinejad-Rokny, Dino Sejdinovic, Damith C. Ranasinghe, Ehsan Abbasnejad
Link: https://arxiv.org/abs/2407.20891v1
Date: 2024-07-30
Summary:
Computational complexity of Bayesian learning is impeding its adoption in practical, large-scale tasks. Despite demonstrations of significant merits such as improved robustness and resilience to unseen or out-of-distribution inputs over their non- Bayesian counterparts, their practical use has faded to near insignificance. In this study, we introduce an innovative framework to mitigate the computational burden of Bayesian neural networks (BNNs). Our approach follows the principle of Bayesian techniques based on deep ensembles, but significantly reduces their cost via multiple low-rank perturbations of parameters arising from a pre-trained neural network. Both vanilla version of ensembles as well as more sophisticated schemes such as Bayesian learning with Stein Variational Gradient Descent (SVGD), previously deemed impractical for large models, can be seamlessly implemented within the proposed framework, called Bayesian Low-Rank LeArning (Bella). In a nutshell, i) Bella achieves a dramatic reduction in the number of trainable parameters required to approximate a Bayesian posterior; and ii) it not only maintains, but in some instances, surpasses the performance of conventional Bayesian learning methods and non-Bayesian baselines. Our results with large-scale tasks such as ImageNet, CAMELYON17, DomainNet, VQA with CLIP, LLaVA demonstrate the effectiveness and versatility of Bella in building highly scalable and practical Bayesian deep models for real-world applications.
--------------------------------------------------------------------------------------------------------
Unleash the Power of Ellipsis: Accuracy-enhanced Sparse Vector Technique with Exponential Noise
The Sparse Vector Technique (SVT) is a fundamental tool in differential privacy, allowing private analysis of datasets by answering queries with binary responses. Previous approaches used conservative privacy analysis, limiting accuracy. This paper introduces a new privacy analysis for SVT that considers its less informative nature, allowing for a broader range of noise types and identifying exponential noise as optimal. The authors develop methods to enhance SVT performance, including threshold correction and an appending strategy. This work could improve privacy-preserving data analysis in areas like medical research, user behavior analysis, and other fields where maintaining individual privacy is crucial.
Authors: Yuhan Liu, Sheng Wang, Yixuan Liu, Feifei Li, Hong Chen
Link: https://arxiv.org/abs/2407.20068v1
Date: 2024-07-29
Summary:
The Sparse Vector Technique (SVT) is one of the most fundamental tools in differential privacy (DP). It works as a backbone for adaptive data analysis by answering a sequence of queries on a given dataset, and gleaning useful information in a privacy-preserving manner. Unlike the typical private query releases that directly publicize the noisy query results, SVT is less informative -- it keeps the noisy query results to itself and only reveals a binary bit for each query, indicating whether the query result surpasses a predefined threshold. To provide a rigorous DP guarantee for SVT, prior works in the literature adopt a conservative privacy analysis by assuming the direct disclosure of noisy query results as in typical private query releases. This approach, however, hinders SVT from achieving higher query accuracy due to an overestimation of the privacy risks, which further leads to an excessive noise injection using the Laplacian or Gaussian noise for perturbation. Motivated by this, we provide a new privacy analysis for SVT by considering its less informative nature. Our analysis results not only broaden the range of applicable noise types for perturbation in SVT, but also identify the exponential noise as optimal among all evaluated noises (which, however, is usually deemed non-applicable in prior works). The main challenge in applying exponential noise to SVT is mitigating the sub-optimal performance due to the bias introduced by noise distributions. To address this, we develop a utility-oriented optimal threshold correction method and an appending strategy, which enhances the performance of SVT by increasing the precision and recall, respectively. The effectiveness of our proposed methods is substantiated both theoretically and empirically, demonstrating significant improvements up to $50\%$ across evaluated metrics.
--------------------------------------------------------------------------------------------------------
Llama 3 represents a significant advancement in foundation models for AI systems. This new set of language models natively supports multilingual tasks, coding, reasoning, and tool usage. The largest model boasts 405 billion parameters and a context window of up to 128,000 tokens. Llama 3 shows comparable performance to leading models like GPT-4 across various tasks. The paper also explores integrating image, video, and speech capabilities into Llama 3. These models could have wide-ranging applications in natural language processing, multimodal AI systems, and advanced AI assistants for various industries and research fields.
Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao
Link: https://arxiv.org/abs/2407.21783v1
Date: 2024-07-31
Summary:
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
--------------------------------------------------------------------------------------------------------
TRGR: Transmissive RIS-aided Gait Recognition Through Walls
TRGR introduces a novel approach to gait recognition using radio frequency signals, enabling identification through walls. By employing transmissive reconfigurable intelligent surfaces (RIS) and a configuration alternating optimization algorithm, TRGR enhances signal quality for accurate recognition in challenging environments. The system uses only magnitude measurements of channel state information from a pair of transceivers. This technology could have significant applications in security and surveillance, allowing for non-invasive identification in complex environments. It could also be valuable in search and rescue operations, healthcare monitoring, and smart home systems where traditional visual recognition is impractical.
Authors: Yunlong Huang, Junshuo Liu, Jianan Zhang, Tiebin Mi, Xin Shi, Robert Caiming Qiu
Link: https://arxiv.org/abs/2407.21566v1
Date: 2024-07-31
Summary:
Gait recognition with radio frequency (RF) signals enables many potential applications requiring accurate identification. However, current systems require individuals to be within a line-of-sight (LOS) environment and struggle with low signal-to-noise ratio (SNR) when signals traverse concrete and thick walls. To address these challenges, we present TRGR, a novel transmissive reconfigurable intelligent surface (RIS)-aided gait recognition system. TRGR can recognize human identities through walls using only the magnitude measurements of channel state information (CSI) from a pair of transceivers. Specifically, by leveraging transmissive RIS alongside a configuration alternating optimization algorithm, TRGR enhances wall penetration and signal quality, enabling accurate gait recognition. Furthermore, a residual convolution network (RCNN) is proposed as the backbone network to learn robust human information. Experimental results confirm the efficacy of transmissive RIS, highlighting the significant potential of transmissive RIS in enhancing RF-based gait recognition systems. Extensive experiment results show that TRGR achieves an average accuracy of 97.88\% in identifying persons when signals traverse concrete walls, demonstrating the effectiveness and robustness of TRGR.
--------------------------------------------------------------------------------------------------------
HBot is an innovative chatbot designed for healthcare applications in Traditional Chinese Medicine (TCM). It combines a 3D human body model with a knowledge graph to provide services such as knowledge Q&A, prescription recommendations, and acupoint searches. When specific acupoints are mentioned in conversations, the 3D body model highlights them, offering an intuitive visual aid. This system could significantly enhance TCM education and practice, making complex concepts more accessible to students and practitioners. It also has potential applications in telemedicine, patient education, and as a supportive tool for TCM clinics and hospitals.
Authors: Bolin Zhang, Zhiwei Yi, Jiahao Wang, Dianbo Sui, Zhiying Tu, Dianhui Chu
Link: https://arxiv.org/abs/2408.00481v1
Date: 2024-08-01
Summary:
The unique diagnosis and treatment techniques and remarkable clinical efficacy of traditional Chinese medicine (TCM) make it play an important role in the field of elderly care and healthcare, especially in the rehabilitation of some common chronic diseases of the elderly. Therefore, building a TCM chatbot for healthcare application will help users obtain consultation services in a direct and natural way. However, concepts such as acupuncture points (acupoints) and meridians involved in TCM always appear in the consultation, which cannot be displayed intuitively. To this end, we develop a \textbf{h}ealthcare chat\textbf{bot} (HBot) based on a human body model in 3D and knowledge graph, which provides conversational services such as knowledge Q\&A, prescription recommendation, moxibustion therapy recommendation, and acupoint search. When specific acupoints are involved in the conversations between user and HBot, the 3D body will jump to the corresponding acupoints and highlight them. Moreover, Hbot can also be used in training scenarios to accelerate the teaching process of TCM by intuitively displaying acupuncture points and knowledge cards. The demonstration video is available at https://www.youtube.com/watch?v=UhQhutSKkTU . Our code and dataset are publicly available at Gitee: https://gitee.com/plabrolin/interactive-3d-acup.git
--------------------------------------------------------------------------------------------------------
A Tutorial on the Use of Physics-Informed Neural Networks to Compute the Spectrum of Quantum Systems
This tutorial introduces Physics-Informed Neural Networks (PINNs) as a method for solving the Schrödinger equation and finding eigenvalues and eigenfunctions of quantum systems. PINNs use Automatic Differentiation to solve Integro-Differential Equations without relying on traditional mesh-based methods. The paper demonstrates how to find ground and excited states progressively and how to incorporate physical knowledge as inductive biases for faster convergence. This approach could revolutionize computational quantum physics, potentially leading to more efficient simulations of complex quantum systems in fields like materials science, quantum chemistry, and condensed matter physics.
Authors: Lorenzo Brevi, Antonio Mandarino, Enrico Prati
Link: https://arxiv.org/abs/2407.20669v1
Date: 2024-07-30
Summary:
Quantum many-body systems are of great interest for many research areas, including physics, biology and chemistry. However, their simulation is extremely challenging, due to the exponential growth of the Hilbert space with the system size, making it exceedingly difficult to parameterize the wave functions of large systems by using exact methods. Neural networks and machine learning in general are a way to face this challenge. For instance, methods like Tensor networks and Neural Quantum States are being investigated as promising tools to obtain the wave function of a quantum mechanical system. In this tutorial, we focus on a particularly promising class of deep learning algorithms. We explain how to construct a Physics-Informed Neural Network (PINN) able to solve the Schr\"odinger equation for a given potential, by finding its eigenvalues and eigenfunctions. This technique is unsupervised, and utilizes a novel computational method in a manner that is barely explored. PINNs are a deep learning method that exploits Automatic Differentiation to solve Integro-Differential Equations in a mesh-free way. We show how to find both the ground and the excited states. The method discovers the states progressively by starting from the ground state. We explain how to introduce inductive biases in the loss to exploit further knowledge of the physical system. Such additional constraints allow for a faster and more accurate convergence. This technique can then be enhanced by a smart choice of collocation points in order to take advantage of the mesh-free nature of the PINN. The methods are made explicit by applying them to the infinite potential well and the particle in a ring, a challenging problem to be learned by an AI agent due to the presence of complex-valued eigenfunctions and degenerate states.
--------------------------------------------------------------------------------------------------------
Age of Information Analysis for Multi-Priority Queue and NOMA Enabled C-V2X in IoV
This paper addresses the growing need for real-time data in Internet-of-Vehicles (IoV) and Intelligent Transportation Systems (ITS). It introduces the concept of Age of Information (AoI) to analyze the performance of Connected Vehicle-to-Everything (C-V2X) communication systems enhanced with Non-Orthogonal Multiple Access (NOMA). The study proposes an AoI estimation method based on multi-priority data type queues and considers NOMA's influence under different Resource Reservation Interval conditions. This research could lead to improved performance in vehicular communication systems, enhancing road safety, traffic management, and the overall efficiency of intelligent transportation systems.
Authors: Zheng Zhang, Qiong Wu, Pingyi Fan, Ke Xiong
Link: https://arxiv.org/abs/2408.00223v1
Date: 2024-08-01
Summary:
As development Internet-of-Vehicles (IoV) technology and demand for Intelligent Transportation Systems (ITS) increase, there is a growing need for real-time data and communication by vehicle users. Traditional request-based methods face challenges such as latency and bandwidth limitations. Mode 4 in Connected Vehicle-to-Everything (C-V2X) addresses latency and overhead issues through autonomous resource selection. However, Semi-Persistent Scheduling (SPS) based on distributed sensing may lead to increased collision. Non-Orthogonal Multiple Access (NOMA) can alleviate the problem of reduced packet reception probability due to collisions. Moreover, the concept of Age of Information (AoI) is introduced as a comprehensive metric reflecting reliability and latency performance, analyzing the impact of NOMA on C-V2X communication system. AoI indicates the time a message spends in both local waiting and transmission processes. In C-V2X, waiting process can be extended to queuing process, influenced by packet generation rate and Resource Reservation Interval (RRI). The transmission process is mainly affected by transmission delay and success rate. In C-V2X, a smaller selection window (SW) limits the number of available resources for vehicles, resulting in higher collision rates with increased number of vehicles. SW is generally equal to RRI, which not only affects AoI in queuing process but also AoI in the transmission process. Therefore, this paper proposes an AoI estimation method based on multi-priority data type queues and considers the influence of NOMA on the AoI generated in both processes in C-V2X system under different RRI conditions. This work aims to gain a better performance of C-V2X system comparing with some known algorithms.
--------------------------------------------------------------------------------------------------------
Rolling in the deep of cognitive and AI biases
This paper addresses the critical issue of bias in AI systems, emphasizing the need to understand AI as a sociotechnical system influenced by human and societal factors. The authors propose a new methodology that incorporates human cognitive biases as core entities in AI fairness analysis. By mapping human heuristics to AI biases, they reveal hidden pathways and interdependencies. This approach could lead to more comprehensive and effective strategies for mitigating AI bias, potentially improving fairness and equity in AI applications across sensitive domains like healthcare, financial services, and law enforcement.
Authors: Athena Vakali, Nicoleta Tantalaki
Link: https://arxiv.org/abs/2407.21202v1
Date: 2024-07-30
Summary:
Nowadays, we delegate many of our decisions to Artificial Intelligence (AI) that acts either in solo or as a human companion in decisions made to support several sensitive domains, like healthcare, financial services and law enforcement. AI systems, even carefully designed to be fair, are heavily criticized for delivering misjudged and discriminated outcomes against individuals and groups. Numerous work on AI algorithmic fairness is devoted on Machine Learning pipelines which address biases and quantify fairness under a pure computational view. However, the continuous unfair and unjust AI outcomes, indicate that there is urgent need to understand AI as a sociotechnical system, inseparable from the conditions in which it is designed, developed and deployed. Although, the synergy of humans and machines seems imperative to make AI work, the significant impact of human and societal factors on AI bias is currently overlooked. We address this critical issue by following a radical new methodology under which human cognitive biases become core entities in our AI fairness overview. Inspired by the cognitive science definition and taxonomy of human heuristics, we identify how harmful human actions influence the overall AI lifecycle, and reveal human to AI biases hidden pathways. We introduce a new mapping, which justifies the human heuristics to AI biases reflections and we detect relevant fairness intensities and inter-dependencies. We envision that this approach will contribute in revisiting AI fairness under deeper human-centric case studies, revealing hidden biases cause and effects.
--------------------------------------------------------------------------------------------------------
TopicTag introduces an innovative method for automating topic labeling in documents clustered using Non-negative Matrix Factorization (NMF). By leveraging large language models and prompt engineering, the system generates accurate topic labels without requiring manual intervention from subject matter experts. This approach could significantly enhance knowledge management and document organization in various fields, streamlining the process of analyzing and categorizing large collections of scientific literature, patents, or other text-based datasets.
Authors: Selma Wanna, Ryan Barron, Nick Solovyev, Maksim E. Eren, Manish Bhattarai, Kim Rasmussen, Boian S. Alexandrov
Link: https://arxiv.org/abs/2407.19616v1
Date: 2024-07-29
Summary:
Topic modeling is a technique for organizing and extracting themes from large collections of unstructured text. Non-negative matrix factorization (NMF) is a common unsupervised approach that decomposes a term frequency-inverse document frequency (TF-IDF) matrix to uncover latent topics and segment the dataset accordingly. While useful for highlighting patterns and clustering documents, NMF does not provide explicit topic labels, necessitating subject matter experts (SMEs) to assign labels manually. We present a methodology for automating topic labeling in documents clustered via NMF with automatic model determination (NMFk). By leveraging the output of NMFk and employing prompt engineering, we utilize large language models (LLMs) to generate accurate topic labels. Our case study on over 34,000 scientific abstracts on Knowledge Graphs demonstrates the effectiveness of our method in enhancing knowledge management and document organization.
--------------------------------------------------------------------------------------------------------
Future of Artificial Intelligence in Agile Software Development
This paper explores the potential of AI to transform agile software development processes. By leveraging large language models, generative AI, and AI agents, the proposed approach aims to assist software development teams in performing routine tasks, risk analysis, strategy recommendations, and decision-making. This integration of AI into agile methodologies could lead to increased efficiency, reduced risks, and higher project success rates. The paper also suggests that AI can help break down complex notions for stakeholders, potentially improving communication and decision-making in software development projects across various industries.
Authors: Mariyam Mahboob, Mohammed Rayyan Uddin Ahmed, Zoiba Zia, Mariam Shakeel Ali, Ayman Khaleel Ahmed
Link: https://arxiv.org/abs/2408.00703v1
Date: 2024-08-01
Summary:
The advent of Artificial intelligence has promising advantages that can be utilized to transform the landscape of software project development. The Software process framework consists of activities that constantly require routine human interaction, leading to the possibility of errors and uncertainties. AI can assist software development managers, software testers, and other team members by leveraging LLMs, GenAI models, and AI agents to perform routine tasks, risk analysis and prediction, strategy recommendations, and support decision making. AI has the potential to increase efficiency and reduce the risks encountered by the project management team while increasing the project success rates. Additionally, it can also break down complex notions and development processes for stakeholders to make informed decisions. In this paper, we propose an approach in which AI tools and technologies can be utilized to bestow maximum assistance for agile software projects, which have become increasingly favored in the industry in recent years.
--------------------------------------------------------------------------------------------------------