|
In recent years, with the breakthrough application of deep learning in the field of natural language processing, more and more studies have shown that the word2vec model (Mikolov et al, 2013) which is based on the deep learning thought is significantly better than the LDA model.Word2vec expresses words in word vectors, and the similarity among the word vectors is used to measure the similarity among the texts. It can overcome the deficiency of the bag of words and mine the association among words to obtain abundant semantic information. Studies that apply the word2vec model to analyzing and clustering topics are emerging. A number of studies have shown that the word2vec model is more suitable for extracting topics from short texts, such as microblog entries than the LDA model (Zhu, 2014; Li et al., 2015; Chen et al., 2015).
Automatic topic extraction from research projects using lexical approaches, such as Probabilistic Latent Semantic Analysis (pLSA) (Steyvers & Griffiths, 2007) and Latent Dirichlet Allocation (LDA) (Blei, Ng & Jordan, 2003), has been investigated. However, the relationships between projects cannot be measured directly. Moreover, research project descriptions are limited and do not include attributes, such as citations and references. Thus, techniques using intercitation and co-citation cannot be utilized, although projects will eventually include articles in their research results.[i]
The Latent Dirichlet Allocation (LDA) proposed by Blei et al. (2003) is applied for extracting topics from the corpus. LDA is a three-layer Bayesian model that is now widely used in discovering the latent topic themes in collections of documents. The LDA model represents each document with a probability distribution over topics, where each topic is represented as a probability distribution over words. For a detailed explanation of the algorithm, refer to, e.g., Blei (2012). The Gensim library (Rehurek & Sojka, 2010) is used for implementing the LDA model, where the parameters are set as the standard value proposed by Gensim. Considering the size of the dataset and the experience from previous studies (Ding, 2011), the number of topics is set at five.
In our work, we use Latent Dirichlet Allocation (LDA) (Blei, Ng & Jordan, 2003), which assumes a text generative model that has a Dirichlet distribution for the probability of linking topics to documents and words to topics.
Topic modeling algorithm is a powerful and computational tool to help organize and summarize information at a scale that would be impossible by human (D. Blei, Carin, & Dunson, 2011). Classic topic modeling algorithms include latent semantic analysis/indexing (LSI) (Deerwester, 1990), probabilistic latent analysis/indexing (PLSI) (Hofmann, 1999), latent dirichlet allocation (LDA) (D. M. Blei, Ng, & Jordan,
2003) and hirerachical dirichlet processing (HDP) (Wang, Paisley, & Blei, 2011). Advantages of each algorithm are unique. Specifically, LSI algorithm can depict the synonyms effectively, while PLSI algorithm is based on LSI algorithm and quite capable of distinguishing polysemy. Benefit from Bayesian Framework, LDA algorithm reduces over fitting significantly. Different from three algorithms above, HDP algorithm can identify topic numbers automatically.
[i] Sugimoto, C. R., Li, D., Russell, T. G., Finlay, S. C., & Ding, Y. (2011). The shifting sands of
disciplinary development: Analyzing North American Library and Information Science
dissertations using Latent Dirichlet Allocation. Journal of the American Society for Information
Science and Technology, 62(1), 185-204.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-26 00:16
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社