The third one is more to the point - 严格说起来,这不能算是迷思,而应该算是放之四海而皆准的“多余的话”
Frankly, the first two are 标题党 to me. Most "supporting evidence" is wrong.
Well, I think I know what you were trying to say. But to most people I believe you are misleading.
No, I was not misleading, this is 矫枉过正 on purpose.
At least I think you should explain a bit more, and carefully pick up your examples.
Take one example. Tokenizing Peoples Republic of China is routinely done by regular expression (rule based) based on capitalization, apostrophe and proposition (symbolic evidences), but NOT using dictionary.
that is not the point. yes, maybe I should have chosen a non-Name example ("interest rate" 利率 is a better example for both Chinese and English), but the point is that closed compounding can (and should) be looked up by lexicons rather than using rules.
What you are referring to I guess is named entity recognition. Even that chinese and English could be significantly different.
No I was not talking about NE, that is a special topic by itself. I consider that to be a low-level, solved problem, and do not plan to re-invent the wheel. I will just pick an off-shelf API to use for NE, tolerating its imperfection.
I wouldn't be surprised if you don't do tokenization, as you can well combine that in overall parsing. But to applications like Baidu search, tokenization is the end of text processing and is a must-have.
Chunking of words into phrases (syntax) are by nature no different from chunking of morphemes (characters) into words (morphology). Parsing with no "word segmentation" is thus possible.
In existing apps like search engines, no big players are using parsing and deep NLP, yet (they will: only a time issue), so lexical features from large lexicons may not be necessary. As a result, they may prefer to adopt a light-weight tokenization without lexicons. That is a different case from what I am addressing here. NLP discussed in my post series assumes the need for developing a parser as its core.
Your attack to tagging is also misleading. You basically say if a word has two categories, just tag it both without further processing. That is tagging already.
That is not (POS) tagging in the traditional sense: the traditional sense of tagging is deterministic and relies on context. Lexical feature assignment from lexical lookup is not tagging in the traditional sense. If you want to change the definition, then that is off the topic.
What others do is merely one step forward, saying tag-a has 90% correct while tag-b 10% chance. I did rule based parser before and I find that is really helpful (at least in terms of speed). I try the high chance first. If it making sense, I just take it. If not, I come back trying the other. Let me know if you don't do something like that.
Parsing can go a long way without context-based POS tagging. But note that at the end I proposed 一步半 approach, i.e. I can do limited, simple context-based tagging for convenience' sake. The later development is adaptive and in principle does not rely on tagging.
Note here I am not talking about 兼语词 which is essentially another unique tag with its own properties. I know this is not 100% accurate but I see it in chinese something like 动名词 in English.
In fact, I do not see that as 兼语词, but for the sake of explanation of the phenomena, I used that term (logically equivalent, but to elaborate on that requires too much space). In my actual system, 学习 is a verb, only a verb (or logical verb).
Then this touches grammar theory. While we may not really need a new theory, we do need to have a working theory with consistency. You may have a good one in mind. But to most people it is not the case. For example, I see you are deeply influenced by 中心词 and dependency. But not everyone even aware of that, not to mention if they agree with. Till now there is no serious competition, as really no large scale success story yet. We need to wait and see which 学派 eventually casts a bigger shadow.
Good to be criticized. But I had a point to make there.