|||
今天在读《Shell 脚本学习指南》[1]时得到启发,很有兴趣写一个词频统计的软件。因此就花了几个小时用Perl语言写了一个100多行代码的软件。word_freq以自由软件[2]、开放源代码的形式发布在此。文后附有源代码。
一、运行环境
1 perl
软件是由 Perl 写成,因此运行软件前,电脑上必须有 perl 解释器[3], 可以在这里下载 http://www.perl.org/get.html#win32
2 命令行
必须在命令行用户界面(Command User Interface)[4]下运行,因为软件是从标准输入(STDIN)读入文本流, 而将结果打印到标准输出(STDOUT), 可以很方便地做I/O重定向,以及组合管道。
二、输入输出
1 输入
输入为纯文本,未考虑支持中文。软件从标准输入读入数据,可以使用I/O重定向符号 ‘<’ 或管道输入数据,也可以读取用户键入的内容。例如,cat file | word_freq, 或者 word_freq < file,或者word_freq, 然后键入英文单词,Ctrl-D 结束。
2 输出
结果的输出如:
Rank Word Freq. Sum
1 the 394 394
2 of 322 716
3 and 156 872
4 to 146 1018
5 in 123 1141
6 genome 98 1239
7 B 95 1334
8 a 84 1418
9 for 72 1490
10 were 69 1559
Total words in text: 7063
第一列为排序,第二列为单词,第三列为次数,第四列为累加,最后一行为总词数。
3 参数
-h 打印帮助页
-c 统计字符,而不是单词
-m NUM 打印单词出现次数不少于NUM的单词
-M NUM 打印单词出现次数不多于NUM的单词
-w NUM 打印单词长度不少于NUM的单词
-W NUM 打印单词长度不大于NUM的单词
-i 不区分大小写 以上参数可以组合使用
三、用途
1 文本分析 用于分析文章的词频。
2 辅助阅读英文论文 我使用了一篇英文论文做测试, 不区分大小写,统计获得1577个单词。看来只要掌握不超过2000个单词,就可以读懂一篇科学论文。
3计算DNA序列的GC含量。
参考资料:
[1] http://book.douban.com/subject/3519360/
[2]http://www.gnu.org/gnu/the-gnu-project.html
[3] http://zh.wikipedia.org/wiki/Perl
[4] http://zh.wikipedia.org/wiki/命令行界面
源代码:
#!/usr/bin/perl
&parse_commands();
if($help){&help();}
#
# Parse input text
#
unless(@input_files){
while(<STDIN>){
if(\$character){@txt = $_ =~ /./g;}
else{@txt = $_ =~ /[a-zA-Z]+/g;}
foreach(@txt){
if(\$ignore_case){\$_ = "\L$_\E";}
\$word{$_}++;
}
$total += @txt;
}
}else{
foreach(@input_files){
open FILE,\$_ or die "Can't open file \$_: $!\n";
while(<FILE>){
if(\$character){@txt = \$_ =~ /./g;}
else{@txt = \$_ =~ /[a-zA-Z]+/g;}
foreach(@txt){
if(\$ignore_case){\$_ = "\L$_\E";}
\$word{$_}++;
}
$total += @txt;
}
close FILE;
}
}
#
# Print title
#
print "Rank\t";
if($character){
print "Char.\t";
}else{
print "Word\t";
}
print "Freq.\t";
print "Sum\n";
#
# Print frequency
#
foreach(sort{\$word{\$b} <=> \$word{$a}}(keys %word)){
if(\$min_freq && \$word{\$_} < $min_freq){next;}
if(\$max_freq && \$word{\$_} > $max_freq){next;}
if(\$min_length && length(\$_) < $min_length){next;}
if(\$max_length && length(\$_) > $max_length){next;}
$count++;
\$sum += \$word{$_};
print "$count\t";
print "$_\t";
print "\$word{$_}\t";
print "$sum\n";
}
print "Total ",(\$character?"characters":"words")," in text: $total\n";
#
# Subroutines
#
sub parse_commands{
while(@ARGV){
$_ = shift @ARGV;
if(-e \$_){push @input_files,$_;}
elsif(\$_ eq '-h'){$help = 1;}
elsif(\$_ eq '-c'){$character = 1;}
elsif(\$_ eq '-m'){$min_freq = shift @ARGV;}
elsif(\$_ eq '-M'){$max_freq = shift @ARGV;}
elsif(\$_ eq '-w'){$min_length= shift @ARGV;}
elsif(\$_ eq '-W'){$max_length = shift @ARGV;}
elsif(\$_ eq '-i'){$ignore_case = 1;}
else{
print STDERR "Unrecognized flag: $_\n";
print STDERR "$0 -h for helpn";
exit;
}
}
}
sub help{
system("clear");
print "WORD_FREQ(1) Word Frequency Analysis WORD_FREQ(1)
NAME
word_freq - word frequency analysis
SYNOPSIS
word_freq [OPTION]... [FILE]...
DESCRIPTION
Count words of text from FILE(s), or standard input, and print the frequency of each word or character.
OPTIONS
-c Print frequency of characters
-m NUM Print words with minimum frequency NUM
-M NUM Print words with maximum freqeuncy NUM
-w NUM Print words with minimum length NUM
-W NUM Print words with maximum length NUM
-i Ignore case
-h Display this help and exit
With no FILE, or when FILE is -, read standard input.
AUTHOR
Written by Leiting Li <lileiting@foxmail.com>
COPYRIGHT
Copyright (c) 2012 Leiting Li. Licnese GPLv3+: GNU
GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extend permitted by law.
LEITING LI Febrary 2012 WORD_FREQ(1)
";
exit;
}
(Leiting Li, Feb 26, 2012)
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2023-6-4 19:35
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社