这两天还是纠结于分类模型的准确率。因为对从网上随机摘录的文本进行分类时,结果总是不甚理想,不像使用cross-validation得到的结果那么好。
于是决定使用独立测试集(含1402个实例)进行评估。训练集实例9804个,特征9302个,没有使用特征选择。准确率大约78%,其中“历史”和“艺术”有点分不清。结果如下:
-------------------------------------------------------------------------
weka.filters.unsupervised.attribute.StringToWordVector in:9804
Number of instances: 9804
Number of attributes: 9302
loading test data in:test_segmented......
weka.filters.unsupervised.attribute.StringToWordVector in:1402
weka.filters.unsupervised.attribute.ReplaceMissingValues in:9804
weka.filters.unsupervised.attribute.Normalize in:9804
evaluating.........
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.91 0.008 0.901 0.91 0.905 0.993 C11-Space
0.455 0.001 0.938 0.455 0.612 0.928 C15-Energy
0.464 0 1 0.464 0.634 0.974 C16-Electronics
0.556 0.001 0.938 0.556 0.698 0.989 C17-Communication
0.98 0.031 0.705 0.98 0.82 0.985 C19-Computer
0.588 0.003 0.833 0.588 0.69 0.96 C23-Mine
0.78 0.001 0.979 0.78 0.868 0.996 C29-Transport
0.81 0.035 0.638 0.81 0.714 0.974 C3-Art
0.95 0.006 0.922 0.95 0.936 0.994 C31-Enviornment
0.92 0.009 0.885 0.92 0.902 0.99 C32-Agriculture
0.96 0.034 0.686 0.96 0.8 0.979 C34-Economy
0.692 0.004 0.878 0.692 0.774 0.989 C35-Law
0.472 0 1 0.472 0.641 0.98 C36-Medical
0.526 0.002 0.952 0.526 0.678 0.992 C37-Military
0.91 0.048 0.591 0.91 0.717 0.965 C38-Politics
0.97 0.021 0.782 0.97 0.866 0.989 C39-Sports
0.235 0 1 0.235 0.381 0.852 C4-Literature
0.639 0.004 0.886 0.639 0.743 0.974 C5-Education
0.489 0.002 0.88 0.489 0.629 0.891 C6-Philosophy
0.75 0.026 0.688 0.75 0.718 0.963 C7-History
Correctly Classified Instances 1095 78.1027 %
Incorrectly Classified Instances 307 21.8973 %
Kappa statistic 0.7661
Mean absolute error 0.0904
Root mean squared error 0.2092
Relative absolute error 97.1367 %
Root relative squared error 94.8845 %
Total Number of Instances 1402
=== Confusion Matrix ===
a b c d e f g h i j k l m n o p q r s t <-- classified as
91 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = C11-Space
0 15 0 0 4 4 0 0 2 1 3 0 0 0 2 2 0 0 0 0 | b = C15-Energy
0 0 13 1 9 0 0 0 0 0 2 0 0 0 0 3 0 0 0 0 | c = C16-Electronics
1 0 0 15 7 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 | d = C17-Communication
2 0 0 0 98 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = C19-Computer
0 0 0 0 7 20 0 0 2 0 2 0 0 0 2 1 0 0 0 0 | f = C23-Mine
0 1 0 0 1 0 46 0 0 0 5 2 0 0 3 1 0 0 0 0 | g = C29-Transport
0 0 0 0 0 0 0 81 0 0 1 0 0 0 0 0 0 0 0 18 | h = C3-Art
0 0 0 0 1 0 0 0 95 4 0 0 0 0 0 0 0 0 0 0 | i = C31-Enviornment
0 0 0 0 0 0 0 0 0 92 7 0 0 0 0 0 0 0 0 1 | j = C32-Agriculture
0 0 0 0 0 0 0 0 0 1 96 0 0 0 2 0 0 0 0 1 | k = C34-Economy
0 0 0 0 0 0 1 0 0 1 5 36 0 1 8 0 0 0 0 0 | l = C35-Law
0 0 0 0 0 0 0 2 0 4 8 1 25 0 7 4 0 2 0 0 | m = C36-Medical
4 0 0 0 0 0 0 0 1 0 1 1 0 40 24 3 0 1 0 1 | n = C37-Military
0 0 0 0 0 0 0 0 0 0 3 0 0 0 91 0 0 0 0 6 | o = C38-Politics
0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 97 0 0 0 1 | p = C39-Sports
0 0 0 0 1 0 0 13 0 0 1 0 0 0 3 2 8 0 2 4 | q = C4-Literature
0 0 0 0 0 0 0 3 1 1 1 1 0 0 6 9 0 39 0 0 | r = C5-Education
3 0 0 0 2 0 0 8 1 0 1 0 0 0 4 0 0 2 22 2 | s = C6-Philosophy
0 0 0 0 0 0 0 19 0 0 3 0 0 0 1 1 0 0 1 75 | t = C7-History
-------------------------------------------------------------------------
源文件主要代码:
String traindatadir = "train_segmented";
TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File( traindatadir ));
Instances dataRaw = loader.getDataSet();
StringToWordVector filter = new StringToWordVector();
filter.setStemmer( new NullStemmer() );
filter.setInputFormat(dataRaw);
System.out.println("nnfiltering data in:" + traindatadir+ "......nn");
Instances dataFiltered = Filter.useFilter(dataRaw, filter);
System.out.println("Number of instances: "+ dataFiltered.numInstances());
System.out.println("Number of attributes: "+ dataFiltered.numAttributes());
String testdatadir = "test_segmented";
System.out.println("nnloading test data in:" + testdatadir+ "......nn");
loader.setDirectory(new File( testdatadir ));
Instances testRaw = loader.getDataSet();
//因为刚刚过滤了训练集,所以过滤器会使用训练集的结构对testRaw进行过滤
Instances testFiltered=Filter.useFilter(testRaw, filter);
SMO classifier = new SMO();
classifier.buildClassifier(dataFiltered);
System.out.println("evaluating.........");
Evaluation eval = new Evaluation(dataFiltered);
eval.evaluateModel(classifier, testFiltered); //使用独立测试集进行评估
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toSummaryString());
System.out.println(eval.toMatrixString());
现在想知道的是,否能保存刚刚过滤了训练集的过滤器?以便下次对一个文本进行过滤和分类?
https://blog.sciencenet.cn/blog-713110-574111.html
上一篇:
weka中使用TFIDF进行特征选择下一篇:
使用DataSource和DataSink