博文

Lucene中进行分词，并按照特定的词对document或者field加权boost

已有 2749 次阅读 2017-10-24 15:04 |个人分类:Lucene|系统分类:科研笔记| Lucene, Boost, 分词

用Lucene对字符串进行分词，在分词过程中，若字符串中含有特定的词语，则对该字符串在建立索引过程中增加权重boost，使得在索引过程个靠前排列出来

示例如下：

/**
 * 环境：Lucene 4.1版本/IKAnalyzer 2012 FF版本/mmseg4j 1.9版本
 * 1.给定输入文本，获取中文拆分词结果；
 * 2.给定输入文本，对该文本按一定规则进行权重打分；
 *   如：文本中包含指定关键词的频率越高，分值越高。
 */
class AnalyzerTool {
// mmseg4j字典路径
private static final String MMSEG4J_DICT_PATH = "C:\\localwarehouse\\com\\chenlb\\mmseg4j\\mmseg4j-core\\1.9.0";
private static Dictionary dictionary = Dictionary.getInstance(MMSEG4J_DICT_PATH);

// 负面关键词信息，如果文本中包含这些词，那么该文本的打分值将变高。
private static List<String> lstNegativeWord;
static {
lstNegativeWord = new ArrayList<String>();

// 下列词语必须存在于词典中：或者是分词器自带的词典，或者是自定义词典；
        // 否则计算权重结果不准，因为有关键词没有被分词器拆分出来。
lstNegativeWord.add("不雅");
lstNegativeWord.add("新区");
lstNegativeWord.add("官员");
    }

//测试各种解析器对同样文本的解析结果
public static void testAnalyzer(String content) throws Exception {
        Analyzer analyzer = new SmartChineseAnalyzer(); // 等于new IKAnalyzer(false);
System.out.println("new SmartChineseAnalyzer()解析输出：" + getAnalyseResult(content,analyzer));
    }

// 取得权重结果，规则：在输入字符串中查找关键词，关键词出现频率越多，权重越高
public static float getBoost(String str) throws Exception {
float result = 1.0F;

// 默认解析器，可以更改为其它解析器
Analyzer analyzer = new SmartChineseAnalyzer();
        List<String> list = getAnalyseResult(str,analyzer);
for(String word:lstNegativeWord) {
if(list.contains(word)) {
                result += 10F; // 每出现一种负面关键词（不管出现几次），分值加10
}
        }
return result;
    }

//调用分词器解析输入内容，将每个分词加入到List，然后返回此List
    //获取字符串的分词结果
public static List<String> getAnalyseResult(String analyzeStr, Analyzer analyzer) {
        List<String> response = new ArrayList<String>();
        TokenStream tokenStream = null;
try {
            tokenStream = analyzer.tokenStream("content", new StringReader(analyzeStr));
            CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);
            tokenStream.reset();
while (tokenStream.incrementToken()) {
                response.add(attr.toString());
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
if (tokenStream != null) {
try {
                    tokenStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
return response;
    }

public  void shiyan() throws Exception {
// 注意：亭湖新区/亭湖这两个词必须存在于IKAnalyzer/mmseg4j两个用户自定义词典中
String content = "亭湖新区因不雅难过分视频被免官员国企老总名单公布";
        System.out.println("原文：" + content);
testAnalyzer(content);
        System.out.println("默认解析器打分结果：" + getBoost(content));
    }
}

转载本文请联系原作者获取授权，同时请注明本文来自林清莹科学网博客。
链接地址：https://blog.sciencenet.cn/blog-3134052-1082267.html

上一篇：lucene的中词中采用mmseg4j分词，并建立索引和搜索
下一篇：Lucene利用geotools对shp文件的读取

收藏 IP: 111.195.160.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

林清莹

扫一扫，分享此博文

linqy的个人博客分享 http://blog.sciencenet.cn/u/linqy

博文

Lucene中进行分词，并按照特定的词对document或者field加权boost

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

林清莹

全部作者的其他最新博文

全部精选博文导读

相关博文

linqy的个人博客分享 http://blog.sciencenet.cn/u/linqy

博文

Lucene中进行分词，并按照特定的词对document或者field加权boost

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

林清莹

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)