publications | Thang M. Pham

2024

arXiv

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

Thang Pham , Phat T Nguyen, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, and Trung Bui

arXiv preprint arXiv:2411.09944, 2024

Abs arXiv

While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing.
arXiv

Taipan: Efficient and Expressive State Space Language Models with Selective Attention

Chien Van Nguyen , Huy Huu Nguyen, Thang Pham, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Ryan A Rossi, Trung Bui, Viet Dac Lai, Franck Dernoncourt, and others

arXiv preprint arXiv:2410.18572, 2024

Abs arXiv

Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba’s efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan’s superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.
NAACL

PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck

Thang Pham^*, Peijie Chen^*, Tin Nguyen^*, Seunghyun Yoon, Trung Bui, and Anh Nguyen

In Findings of the Association for Computational Linguistics: NAACL 2024, Jun 2024

Abs arXiv PDF Video Code Slides

CLIP-based classifiers rely on the prompt containing a class name that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB – an explainable and editable classifier to (1) express the class name into a set of text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a huge margin (∼10\mbox\times in top-1 accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art (SOTA) on the supervised-learning setting (88.80% and 92.20% accuracy on CUB-200 and Stanford Dogs-120, respectively) but also the first to enable users to edit the text descriptors to form a new classifier without any re-training. Compared to concept bottleneck models, PEEB is also the SOTA in both zero-shot and supervised-learning settings.

2023

EACL

PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search

Thang Pham, Seunghyun Yoon, Trung Bui, and Anh Nguyen

In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, May 2023

Abs arXiv PDF Video Code Website

While contextualized word embeddings have been a de-facto standard, learning contextualized phrase embeddings is less explored and being hindered by the lack of a human-annotated benchmark that tests machine understanding of phrase semantics given a context sentence or paragraph (instead of phrases alone). To fill this gap, we propose PiC—a dataset of ∼28K of noun phrases accompanied by their contextual Wikipedia pages and a suite of three tasks for training and evaluating phrase embeddings. Training on PiC improves ranking-models’ accuracy and remarkably pushes span selection (SS) models (i.e., predicting the start and end index of the target phrase) near human accuracy, which is 95% Exact Match (EM) on semantic search given a query phrase and a passage. Interestingly, we find evidence that such impressive performance is because the SS models learn to better capture the common meaning of a phrase regardless of its actual context. SotA models perform poorly in distinguishing two senses of the same phrase in two contexts (∼60% EM) and in estimating the similarity between two different phrases in the same context (∼70% EM).
arXiv

Semi-supervised Neural Machine Translation with Consistency Regularization for Low-Resource Languages

Viet H Pham, Thang Pham^*, Giang Nguyen^* , Long Nguyen, and Dien Dinh

arXiv preprint arXiv:2304.00557, May 2023

Abs arXiv

The advent of deep learning has led to a significant gain in machine translation. However, most of the studies required a large parallel dataset which is scarce and expensive to construct and even unavailable for some languages. This paper presents a simple yet effective method to tackle this problem for low-resource languages by augmenting high-quality sentence pairs and training NMT models in a semi-supervised manner. Specifically, our approach combines the cross-entropy loss for supervised learning with KL Divergence for unsupervised fashion given pseudo and augmented target sentences derived from the model. We also introduce a SentenceBERT-based filter to enhance the quality of augmenting data by retaining semantically similar sentence pairs. Experimental results show that our approach significantly improves NMT baselines, especially on low-resource datasets with 0.46–2.03 BLEU scores. We also demonstrate that using unsupervised training for augmented data is more efficient than reusing the ground-truth target sentences for supervised learning.

2022

AACL

Double Trouble: How to not Explain a Text Classifier’s Decisions Using Counterfactuals Synthesized by Masked Language Models?

Thang Pham, Trung Bui, Long Mai, and Anh Nguyen

In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Nov 2022

Abs arXiv PDF Video Code Slides

A principle behind dozens of attribution methods is to take the prediction difference between before-and-after an input feature (here, a token) is removed as its attribution. A popular Input Marginalization (IM) method (Kim et al., 2020) uses BERT to replace a token, yielding more plausible counterfactuals. While Kim et al., 2020 reported that IM is effective, we find this conclusion not convincing as the Deletion-BERT metric used in their paper is biased towards IM. Importantly, this bias exists in Deletion-based metrics, including Insertion, Sufficiency, and Comprehensiveness. Furthermore, our rigorous evaluation using 6 metrics and 3 datasets finds no evidence that IM is better than a Leave-One-Out (LOO) baseline. We find two reasons why IM is not better than LOO: (1) deleting a single word from the input only marginally reduces a classifier’s accuracy; and (2) a highly predictable word is always given near-zero attribution, regardless of its true importance to the classifier. In contrast, making LIME samples more natural via BERT consistently improves LIME accuracy under several ROAR metrics.

2021

ACL

Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks?

Thang Pham, Trung Bui, Long Mai, and Anh Nguyen

In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Aug 2021

Abs arXiv PDF Video Code Slides

Do state-of-the-art natural language understanding models care about word order - one of the most important characteristics of a sequence? Not always! We found 75% to 90% of the correct predictions of BERT-based classifiers, trained on many GLUE tasks, remain constant after input words are randomly shuffled. Despite BERT embeddings are famously contextual, the contribution of each individual word to downstream tasks is almost unchanged even after the word’s context is shuffled. BERT-based models are able to exploit superficial cues (e.g. the sentiment of keywords in sentiment analysis; or the word-wise similarity between sequence-pair inputs in natural language inference) to make correct decisions when tokens are arranged in random orders. Encouraging classifiers to capture word order information improves the performance on most GLUE tasks, SQuAD 2.0 and out-of-samples. Our work suggests that many GLUE tasks are not challenging machines to understand the meaning of a sentence.

2019

EMNLP-BioNLP

A Neural Pipeline Approach for the PharmaCoNER Shared Task using Contextual Exhaustive Models

Mohammad Golam Sohrab, Thang Pham, Makoto Miwa, and Hiroya Takamura

In Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, Nov 2019

Abs PDF Video

We present a neural pipeline approach that performs named entity recognition (NER) and concept indexing (CI), which links them to concept unique identifiers (CUIs) in a knowledge base, for the PharmaCoNER shared task on pharmaceutical drugs and chemical entities. We proposed a neural NER model that captures the surrounding semantic information of a given sequence by capturing the forward- and backward-context of bidirectional LSTM (Bi-LSTM) output of a target span using contextual span representation-based exhaustive approach. The NER model enumerates all possible spans as potential entity mentions and classify them into entity types or no entity with deep neural networks. For representing span, we compare several different neural network architectures and their ensembling for the NER model. We then perform dictionary matching for CI and, if there is no matching, we further compute similarity scores between a mention and CUIs using entity embeddings to assign the CUI with the highest score to the mention. We evaluate our approach on the two sub-tasks in the shared task. Among the five submitted runs, the best run for each sub-task achieved the F-score of 86.76% on Sub-task 1 (NER) and the F-score of 79.97% (strict) on Sub-task 2 (CI).
SEPLN-IberLEF

A Generic Neural Exhaustive Approach for Entity Recognition and Sensitive Span Detection

Mohammad Golam Sohrab, Thang Pham, and Makoto Miwa

In IberLEF@SEPLN, Nov 2019

Abs PDF

In this work, we present a deep exhaustive framework for the MEDDOCAN shared task. The framework employs a generic named entity recognition (NER) model that captures the underlying semantic information of texts. The key idea of our model is to enumerate all possible spans as potential entity mentions and classify them with deep neural networks. We introduce different sets of learning algorithms, including base representation(BR) average (BR-Avg), BR with attention mechanigm (BR-Attn), LSTM-Minus-based average (LM-Avg), LSTMMinus-based attention (LM-Attn), where with or without context is used after LSTM layer (Context or None) and an ensemble approach using maximumvoting of all the approaches. We evaluate our exhaustive model on two sub-tasks in the MEDDOCAN shared task in medical domain using the official evaluation script. Among the five submitted runs, the best run for each sub-task achieved the F-score of 93.12% on Sub-task 1 and the F-scores of 93.52% (strict) and 94.92% (merged) on Sub-task 2 without any external knowledge resources.