Seminatural language processing: On the grammars of natural languages

Can a natural language ever be perfectly interpreted using a formal grammar?

Conventional wisdom in language analysis says that it is not possible, that natural language is just too complex for a computer to handle. But such categorical statements have been refuted before - it was not long ago that experts said a computer could never learn to play chess well enough to challenge a human. There is clearly a quantitative difference in complexity, but is there a qualitative one?

When we try to analyse a text according to a formal grammar, there are several things which can go wrong. One thing is that the grammar actually does not cover the particular sentence structure used. Another possibility is that the syntax is in itself ambiguous, and yet another is that the semantics is ambiguous.

A formal grammar can clearly not reliably interpret a sentence which is inherently ambiguous. This is exactly why it is helpful to simplify the language, removing both syntactical and semantic ambiguities. We could of course start with purely natural language and try to hunt down those ambiguities, but a much more straightforward method is to start from the other direction. We assume that the semantics is unambiguous, and create a grammar which is also unambiguous. Then we can see how much it resembles natural language, and we can try to gradually extend and modify the grammar to become more and more like a specific natural language. In principle, we will eventually approach the formal unambiguous language which is the closest to a natural language - the natural language maximally clarified.

If we start with the assumption that we are using completely natural language, we are severely constrained in what we can do. A single word with ambiguous meaning would be enough, in principle, to overthrow the interpretation. We can reach a percentage, but that is very unreliable. If we were able to interpret, for example, 90% of sentences, what would happen if the user starts using twice as long sentences? At best, each half has the same probability as before, making the new efficiency 81%. More likely, the error will be even bigger, since each word can be linked to twice as many potential heads.

If on the other hand we are allowed to limit the language ever so slightly, we can make reliable statements as to how certain the interpretation will be. We can state exactly which things the interpreter can and can not handle, and at least in some situations we can identify the unclear cases automatically.

Seminatural language processing

2010-10-28

On the grammars of natural languages

No comments:

Post a Comment