Content area

Abstract

In this article, we present a model for analyzing the co-occurrence count data derived from practical fields such as user–item or item–item data from online shopping platforms and co-occurring word–word pairs in sequences of texts. Such data contain important information for developing recommender systems or studying the relevance of items or words from non-numerical sources. Different from traditional regression models, there are no observations for covariates. Additionally, the co-occurrence matrix is typically of such high dimension that it does not fit into a computer’s memory for modeling. We extract numerical data by defining windows of co-occurrence using weighted counts on the continuous scale. Positive probability mass is allowed for zero observations. We present the Shared Parameter Alternating Tweedie (SA-Tweedie) model and an algorithm to estimate the parameters. We introduce a learning rate adjustment used along with the Fisher scoring method in the inner loop to help the algorithm stay on track with optimizing direction. Gradient descent with the Adam update was also considered as an alternative method for the estimation. Simulation studies showed that our algorithm with Fisher scoring and learning rate adjustment outperforms the other two methods. We applied SA-Tweedie to English-language Wikipedia dump data to obtain dense vector representations for WordPiece tokens. The vector representation embeddings were then used in an application of the Named Entity Recognition (NER) task. The SA-Tweedie embeddings significantly outperform GloVe, random, and BERT embeddings in the NER task. A notable strength of the SA-Tweedie embedding is that the number of parameters and training cost for SA-Tweedie are only a tiny fraction of those for BERT.

Details

1009240
Title
Global Dense Vector Representations for Words or Items Using Shared Parameter Alternating Tweedie Model
Author
Kim, Taejoon 1 ; Wang, Haiyan 2   VIAFID ORCID Logo 

 Department of Statistics and Biostatistics, California State University East Bay, Hayward, CA 94542, USA; [email protected] 
 Department of Statistics, Kansas State University, Manhattan, KS 66506, USA 
Publication title
Volume
13
Issue
4
First page
612
Publication year
2025
Publication date
2025
Publisher
MDPI AG
Place of publication
Basel
Country of publication
Switzerland
Publication subject
e-ISSN
22277390
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-02-13
Milestone dates
2024-12-31 (Received); 2025-02-10 (Accepted)
Publication history
 
 
   First posting date
13 Feb 2025
ProQuest document ID
3171089675
Document URL
https://www.proquest.com/scholarly-journals/global-dense-vector-representations-words-items/docview/3171089675/se-2?accountid=208611
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-02-26
Database
ProQuest One Academic