Summary of all lexicons used in our Chinese system:
(1) segmentation-only dic: 184k;
(2) feature lexicon: 59k;
(3) location names: 3k;
(4) person names: 64k;
(5) product/brand names:21K;
(6) company names: 1.7k;
(7) other names: 2k
(8) idioms: 52k;
Total:
387k participated in segmentation, and
203k with lexical features to support parsing and sentiments
In fact, we also have two more lexicons for handling social media jargons and Cantonese-only vocabulary. These lexicons are used in the pre-processing stage, not in segmentation, nor in parsing: