|
Word Embedding Revisited: A New Representation Learning and ExplicitMatrix Factorization Perspective
重返词嵌入:一个新的表示学习和明确矩阵分解的角度
Recentlysignificant advances have been witnessed in the area of distributed wordrepresentations based on neural networks, which are also known as wordembeddings. Among the new word embedding models, skip-gram negative sampling(SGNS) in the word2vec toolbox has attracted much attention due to itssimplicity and effectiveness. However, the principles of SGNS remain not wellunderstood, except for a recent work that explains SGNS as an implicit matrixfactorization of the pointwise mutual information (PMI) matrix. In this paper,we provide a new perspective for further understanding SGNS. We point out thatSGNS is essentially a representation learning method, which learns to representthe co-occurrence vector for a word. Based on the representation learning view,SGNS is in fact an explicit matrix factorization (EMF) of the words’ co-occurrencematrix. Furthermore, extended supervised word embedding can be establishedbased on our proposed representation learning view.
最近在基于神经网络的分布式词表示取得了非常显著的进步,分布式词表示也被称作是词嵌入。在最新的一些词嵌入模型中,word2vec工具箱中的skip-gram负采样(SGNS)由于其简单性和有效性引起了很大的重视。然而,SGNS的原则仍然没有被人们很好地理解,除了最近的工作将SGNS为一个模糊的逐点的互信息(PMI)矩阵的一个不清晰的矩阵分解。在本文中,我们提供了一个更进一步地理解SGNS的视角。我们指出SGNS本质上是一个表示学习方法,该方法学习一个词的共现向量。基于这个表示学习的视角,SGNS事实上是词共现矩阵的一个明确的矩阵分解(EMF)。进一步地,拓展地监督词嵌入可以通过我们提出的词表示学习视角建立。
这是一个研究SGNS问题的基础研究的文章。该文章认为SGNS本质上是一个表示学习方法。学习的内容是词的共现向量,也就是说两个向量的距离表示的是两个“共现”之间的距离?
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-5-21 04:02
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社