|||
mmseg中文分词具有较好的效果,因此Lucene使用的话会是搜索的结果更加合理
下面给出了Lucene中利用mmseg分词的示例:
//test mmseg4j 分词
class LuceneTest {
static Analyzer analyzer = null;
static Directory directory = null;
static String text = "CSDN.NET - 全球最大中文IT社区,为IT专业技术人员提供最全面的信息传播和服务平台";
static String text1 = "京华时报1月23日报道 昨天,受一股来自中西伯利亚的强冷空气影响,本市出现大风降温天气,白天最高气温只有零下7摄氏度,同时伴有6到7级的偏北风。";
public void mmesgtest() throws Exception {
analyzer = new ComplexAnalyzer();//mmseg分词的方法之一,也有simple和maxword方法
//创建索引目录
String indexpath = "D:\\java_test\\lucene\\data_index_mmesg";
File index_path = new File(indexpath);
Path path_mmseg = index_path.toPath();
Directory directory = FSDirectory.open(path_mmseg);
IndexWriterConfig iwConfig = new IndexWriterConfig(analyzer);
iwConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter iwriter = new IndexWriter(directory, iwConfig);
List<String> list = new ArrayList<String>();
list.add(text);
list.add(text1);
for (String item : list) {
Document doc = new Document();
doc.add(new TextField("text", item, Field.Store.YES));
iwriter.addDocument(doc);
}
iwriter.close();
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(ireader);
Query q = new TermQuery(new Term("text", "西伯利亚"));
System.out.println(q);
TopDocs tds = searcher.search(q, 10);
System.out.println("======size:" + tds.totalHits + "========");
for (ScoreDoc sd : tds.scoreDocs) {
System.out.println(sd.score);
System.out.println(searcher.doc(sd.doc).get("text"));
}
}
}
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-9-27 23:28
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社