之前都是在小训练集上做的实验,当训练集较大时,MyEclipse总是报heap space错。修改.ini文件也不行。
后来,修改MyEclipse的run configuration中的arguments的VM arguments为-Xms500m -Xmx1024m,不再报错了。下面是运行结果(分类器是NaiveBayes)。结果不太理想(正确分类68.6%),有待改善。还有,似乎Accuracy By Class 与Confusion Matrix数据不一致,差距很大,why?
weka.filters.unsupervised.attribute.StringToWordVector in:9804 Number of instances: 9804 Number of attributes: 9302 === Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.794 0.015 0.788 0.794 0.791 0.966 C11-Space 0.5 0.002 0.444 0.5 0.471 0.975 C15-Energy 0.704 0.005 0.264 0.704 0.384 0.981 C16-Electronics 0.68 0.003 0.395 0.68 0.5 0.99 C17-Communication 0.858 0.007 0.953 0.858 0.903 0.982 C19-Computer 0.545 0.006 0.234 0.545 0.327 0.945 C23-Mine 0.86 0.015 0.251 0.86 0.389 0.98 C29-Transport 0.726 0.015 0.8 0.726 0.761 0.951 C3-Art 0.72 0.026 0.799 0.72 0.757 0.946 C31-Enviornment 0.614 0.014 0.838 0.614 0.709 0.945 C32-Agriculture 0.762 0.044 0.773 0.762 0.767 0.924 C34-Economy 0.647 0.003 0.508 0.647 0.569 0.983 C35-Law 0.843 0.019 0.189 0.843 0.309 0.985 C36-Medical 0.892 0.064 0.096 0.892 0.173 0.969 C37-Military 0.432 0.016 0.754 0.432 0.549 0.832 C38-Politics 0.591 0.016 0.84 0.591 0.694 0.911 C39-Sports 0.636 0.008 0.212 0.636 0.318 0.971 C4-Literature 0.644 0.017 0.189 0.644 0.292 0.931 C5-Education 0.568 0.002 0.532 0.568 0.549 0.95 C6-Philosophy 0.584 0.038 0.436 0.584 0.499 0.908 C7-History
Correctly Classified Instances 6730 68.6455 % Incorrectly Classified Instances 3074 31.3545 % Kappa statistic 0.6529 Mean absolute error 0.0313 Root mean squared error 0.1759 Relative absolute error 35.2848 % Root relative squared error 83.4657 % Total Number of Instances 9804
=== Confusion Matrix ===
a b c d e f g h i j k l m n o p q r s t <-- classified as 508 1 3 5 31 2 12 0 2 1 11 0 16 30 0 17 0 0 1 0 | a = C11-Space 1 16 0 1 0 3 3 0 1 0 0 0 2 4 0 0 0 1 0 0 | b = C15-Energy 2 0 19 1 0 1 2 0 0 0 0 0 1 0 0 1 0 0 0 0 | c = C16-Electronics 0 0 3 17 0 0 2 0 0 0 0 0 1 1 0 0 0 1 0 0 | d = C17-Communication 59 0 33 9 1164 0 7 3 4 0 10 3 16 33 0 8 3 3 0 2 | e = C19-Computer 0 1 0 0 0 18 6 0 1 2 0 0 1 3 0 0 0 1 0 0 | f = C23-Mine 1 1 0 0 0 1 49 0 0 1 0 0 1 0 1 1 0 1 0 0 | g = C29-Transport 0 0 0 0 0 1 1 537 0 0 3 0 12 27 12 18 44 16 1 68 | h = C3-Art 43 14 5 0 17 9 21 0 876 34 63 7 60 20 0 30 2 16 0 0 | i = C31-Enviornment 2 0 2 2 0 14 25 0 165 627 112 4 22 4 1 9 1 20 0 11 | j = C32-Agriculture 1 0 4 6 2 25 41 3 3 78 1219 12 13 30 74 29 0 15 1 44 | k = C34-Economy 0 0 0 0 0 0 0 0 0 0 0 33 0 14 1 0 0 3 0 0 | l = C35-Law 0 0 0 0 0 0 0 0 0 0 0 0 43 4 0 1 0 3 0 0 | m = C36-Medical 0 1 0 0 0 0 0 0 0 0 0 0 0 66 0 1 0 5 0 1 | n = C37-Military 0 1 1 0 0 1 1 4 0 0 74 5 1 299 442 7 4 16 14 154 | o = C38-Politics 28 1 1 2 8 1 16 35 45 2 64 1 26 138 28 740 4 41 2 70 | p = C39-Sports 0 0 1 0 0 0 1 2 0 0 0 0 0 2 0 1 21 3 0 2 | q = C4-Literature 0 0 0 0 0 0 1 0 0 0 1 0 3 5 2 8 1 38 0 0 | r = C5-Education 0 0 0 0 0 0 0 1 0 0 1 0 0 0 4 1 1 11 25 0 | s = C6-Philosophy 0 0 0 0 0 1 7 86 0 3 19 0 9 11 21 9 18 7 3 272 | t = C7-History
行列不齐,放个截图:
转载本文请联系原作者获取授权,同时请注明本文来自李向东科学网博客。 链接地址: https://blog.sciencenet.cn/blog-713110-571650.html
上一篇:
不做特征选择,就不知道去停词的重要性 下一篇:
SMO分类器的训练模型评估结果