||
fasttext是由facebook开发的一个开源工具,应用广泛。
该工具主要有两大用途:word representation learning and text classification.
安装:pip install fasttext
支持2.6及其以上版本,需要Cython build the C++ extension.
功能一:word2vec
输入数据格式:file containing utf-8 encoded text
1、Word representation learning
model=fasttext.skip()
或model=fasttext.cbow()
模型参数:
input_file training file path (required) output output file path (required) lr learning rate [0.05] lr_update_rate change the rate of updates for the learning rate [100] dim size of word vectors [100] ws size of the context window [5] epoch number of epochs [5] min_count minimal number of word occurences [5] neg number of negatives sampled [5] word_ngrams max length of word ngram [1] loss loss function {ns, hs, softmax} [ns] bucket number of buckets [2000000] minn min length of char ngram [3] maxn max length of char ngram [6] thread number of threads [12] t sampling threshold [0.0001] silent disable the log output from the C++ extension [1] encoding specify input_file encoding [utf-8]
2、Obtaining word vectors for out-of-vocabulary words
model[word]
model.words
3、Load pre-trained model
model=fasttext.load_model()
模型属性:
model.model_name # Model name model.words # List of words in the dictionary model.dim # Size of word vector model.ws # Size of context window model.epoch # Number of epochs model.min_count # Minimal number of word occurences model.neg # Number of negative sampled model.word_ngrams # Max length of word ngram model.loss_name # Loss function name model.bucket # Number of buckets model.minn # Min length of char ngram model.maxn # Max length of char ngram model.lr_update_rate # Rate of updates for the learning rate model.t # Value of sampling threshold model.encoding # Encoding of the model model[word] # Get the vector of specified word
功能二:Text classification
输入数据格式:text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__.
比如:
'example very long text __label__0'
1、训练
classifier=fasttext.supervised()
模型参数:
input_file training file path (required) output output file path (required) label_prefix label prefix ['__label__'] lr learning rate [0.1] lr_update_rate change the rate of updates for the learning rate [100] dim size of word vectors [100] ws size of the context window [5] epoch number of epochs [5] min_count minimal number of word occurences [1] neg number of negatives sampled [5] word_ngrams max length of word ngram [1] loss loss function {ns, hs, softmax} [softmax] bucket number of buckets [0] minn min length of char ngram [0] maxn max length of char ngram [0] thread number of threads [12] t sampling threshold [0.0001] silent disable the log output from the C++ extension [1] encoding specify input_file encoding [utf-8] pretrained_vectors pretrained word vectors (.vec file) for supervised learning []
分类器的属性的函数
classifier.labels # List of labels classifier.label_prefix # Prefix of the label classifier.dim # Size of word vector classifier.ws # Size of context window classifier.epoch # Number of epochs classifier.min_count # Minimal number of word occurences classifier.neg # Number of negative sampled classifier.word_ngrams # Max length of word ngram classifier.loss_name # Loss function name classifier.bucket # Number of buckets classifier.minn # Min length of char ngram classifier.maxn # Max length of char ngram classifier.lr_update_rate # Rate of updates for the learning rate classifier.t # Value of sampling threshold classifier.encoding # Encoding that used by classifier classifier.test(filename, k) # Test the classifier classifier.predict(texts, k) # Predict the most likely label classifier.predict_proba(texts, k) # Predict the most likely label include their probability
2、测试
result=classifier.test()
# Properties result.precision # Precision at one result.recall # Recall at one result.nexamples # Number of test examples
3、预测
labels=classifier.predict()
参考:
https://pypi.python.org/pypi/fasttext#enriching-word-vectors-with-subword-information
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-7-28 01:14
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社