linqy的个人博客分享 http://blog.sciencenet.cn/u/linqy

博文

Lucene分词结果的查看

已有 2531 次阅读 2017-10-17 18:01 |个人分类:Lucene|系统分类:科研笔记

为了比较不同中文分词算法的分效果,需要对分词的结果进行查看

下面为Lucene中文分词结果的查看

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;

public class test {

public static List<String> getAnalyseResult(String analyzeStr, Analyzer analyzer) {
List<String> response = new ArrayList<String>();
TokenStream tokenStream = null;
try {
tokenStream = analyzer.tokenStream("content", new StringReader(analyzeStr));
CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
response.add(attr.toString());
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (tokenStream != null) {
try {
tokenStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return response;
}

public static void main(String[] args) {
try {
//test getwords funtion
String str = "山东省潍坊市高新技术产业开发区樱前街10815号。";
List<String> lists123 = getAnalyseResult(str, new SmartChineseAnalyzer());
for (String s : lists123) {
System.out.println(s);
}
}
}
}




https://blog.sciencenet.cn/blog-3134052-1081246.html

上一篇:org.geotools.data.shapefile.files.ShpFiles logCurrentLocker
下一篇:Lucene对文中特定的词改变该文本的boost权重(进而改变其评分)
收藏 IP: 111.195.160.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-1 15:39

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部