|||
将一段话中的句子分离出来不是一件容易的事。因为句子的开头和结尾并不是很规则,而且句子内部会出现句号。这使得通过单一的正则表达式分离句子是不可能的。有时你能成功,但大多数时候你会出错。这里我们用nltk模块来做。
第一部分:使用正则表达式
import re
paragraph = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. I say. What's wrong with you? I am confused by your activity."
#匹配句尾的那个特殊空格,所有后面只能用依据空格用split分割
rule = re.compile(r"(?<!w.w.)(?<![A-Z][a-z].)(?<=.|?|!|")s")
result = re.split(rule, paragraph)
for sentence in result:
print sentence
#如果段落中含有双引号就报错。此时我们应该改用三双引号或三单引号,亲测有效。当然,正则表达式也需要变化。下面是利用正则表达式提取文本文件中的句子的代码。
import re
#open the txt file which must be in ANSI format
#TXT file in unicode format doesn't work. I don't why.
input = open('test.txt')
input_result = input.read()
rule = re.compile(r"(?<!w.w.)(?<![A-Z][a-z].)(?<=.|?|!|")s")
result = re.split(rule, input_result)
#for sentence in result:
#print sentence
input.close()
#This command will create the ouput.txt file for you.
output = open("ouput.txt","a+")
for sentence in result:
output.write(sentence)
output.write("n")
output.close()
第二部分:提取字符串中的句子
from nltk import tokenize
paragraph = "Good morning Dr. Adams. The patient is waiting for you in room number 3."
print tokenize.sent_tokenize(paragraph)
第三部分:提取文本文件中的句子
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print 'n-----n'.join(tokenizer.tokenize(data))
备注:暂时无法成功安装nltk模块,提示缺少某dll文件!
参考资料
http://stackoverflow.com/questions/9474395/how-to-break-up-a-paragraph-by-sentences-in-python
http://stackoverflow.com/questions/4576077/python-split-text-on-sentences
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-23 11:31
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社