Seminatural language processing: The structure of the language

There have been many attempts at trying to make rules for natural languages. Some are more successful than others, but none can describe the whole complexity of a language as it is really spoken. But that's the point here; I want to find rules that are nearly true for natural language, and then modify the language so that it does follow the rules exactly. It has been necessary to use much simpler rules than the most advanced descriptive attempts, but the goal is in principle to make something that looks similar to natural language - not, however, to any particular natural language, but to natural language in general. Possibly making it similar to a specific language is a different task entirely. It is important to remember this, as I will be using English words for the examples, but the syntax and semantics do not need to be similar to English.

A typical such rule which almost but not quite works for a lot of natural language expressions is the defining rule of regular grammars. This seems like a good place to start. But a regular language is not necessarily unambiguous. The question of ambiguity is not trivial, because sometimes an ambiguity in the syntax does not lead to ambiguity in the interpretation. Whereas "the cat eats the mouse" is distinctly different from "the mouse eats the cat", one could argue that "the cat meets the mouse" is roughly the same thing as "the mouse meets the cat". So if we had a language which was ambiguous in that it did not distinguish between subject and object, we would get problems with the first verb, but not the second.

I would expect the finished grammar to be regular, or at least context-free, and it may have to follow stricter rules than that in order to make it unambiguous. It is nowadays common to use relational grammar rather than phrase-structure grammar in analysing natural langauge; I will use a model which is similar to relational grammar, but the result will probably be a grammar which also conforms to the rules of regular grammar.

In trying to make something similar to natural languages, we should not only look at these rather aloof formal theories and stuctures of conceivable languages, but also at the common characteristics of existing languages. One strong tendency among natural languages is that they tend to be either head-initial or head-final. (Sometimes the words right-branching and left-branching are used, sometimes for the same thing and sometimes for something slightly different, but I find those terms misleading since they assume that we write from left to right.) What signifies a head-initial language is that the head word of a phrase tends to be the first word of that phrase. Three examples are often mentioned:

the verb precedes the direct object
there are prepositions (i.e. position words precede the things which the position is in relation to)
nouns and verbs precede the adjectives and adverbs which describe them

Head-initial languages are more common than head-final ones, and English is generally considered among them. But as you can see from the three examples, English does not completely conform to the rules; only two of the three are generally true in English. Most languages, just like English, do not conform to the rules completely. It is another example of a rule which is almost but not completely followed. Also, for cases other than those three, it is not always obvious which word is the head. We shall return to that shortly.

I decided to make the language strictly head-initial, with the added rule that there is always free word order for the dependants. Most natural languages have a default word order, regardless of which word orders are allowed (and whether they are distinct in meaning), so it might seem a strange choice for making things look natural. Furthermore, it would seem that it makes the analysis more complicated, and takes away one method of transferring information, making the language less efficient. My motivation here is partly that there is less risk of error, if we allow the user to express things more freely. Particularly since the word order varies greatly in the world's languages, it is something a user might easily do wrong. Also, even though human languages as a rule have a default word order, they are also notorious offenders against it. It is exactly the sort of not-as-rare-as-you-thought exceptions which bring down ordinary parsers.

Let us start by looking at noun phrases. If we have a noun and an article, it is unclear which word is the head of the phrase. Traditionally the most common idea has been to treat the noun as the head, and see the article as something which describes it - it is the noun that makes it a noun phrase, and the article only serves to tell us minor details about the form. But it is also possible to think of the article as the thing that makes this a noun phrase. In fact, we can subtly shift the meaning of the two words, moving the concept of "object" from the noun to the article. In a way, the article is now the real noun - albeit a noun without any further meaning - and the old noun has been demoted to something suspiciously similar to an adjective.

It is no big leap to think that we can do the same thing with verb phrases. That would mean some sort of "article" for the verb. It may sound strange, but it is nothing more than a mandatory copula, something which several natural languages have, and English could be said to use in certain verb forms.

But then we do arrive at a big leap. What if we can treat nouns, verbs, adjectives and adverbs as the same class of words? This way we can conform very well to the rule of free word order after a head. The article or copula marks the phrase, and then the content words can come in any order. Since we have a mandatory article for all phrases, we can make use of that in a most convenient way: We can put all form markers on the article. That way, we can have numerous forms (if needed) but still need no actual inflections on the content words themselves.

This is a great gain. Now the parser does not have to know anything at all about the content words. It only needs to know a small number of function words, and can treat all the others the same. It does not need to know whether a word is a noun or a verb, and it also does not have to keep track of which word is the plural of another word. It's useful enough for one-time parsing, but even more so when continuously communicating with the parser, because it gives us the opportunity to use new words without (syntactical) interpretation problems. Was it not for this idea, we would have a world of trouble trying to explain to the computer that while a boy is a child and a girl is a child, the two of them put together are not two child, nor two childs, but two children.

Is this found in natural languages? A rather common phenomenon is that adjectives act like static verbs. This is found in many Asian languages, among others. Treating nouns and verbs as the same is not as common, but on the other hand there are usually ways to form one from the other, and in some languages (including English) this can sometimes be done with no morphological changes at all - there is a noun and a verb, but they look the same. All in all, this simplification is somewhat unnatural, but serves a purpose, and the use of markers like the mandatory copula makes it look a little more natural and avoids confusion between things like attributive and predicative use.

Now let us go up a level to look at the relation between the phrases. On the datalogy side, the norm is to see the verb as the root of the sentence. This goes back to a long tradition of predicate logic and mathematics. On the linguistics side, it is not quite so obvious which part is the root. One might think that this is merely an arbitrary convention, but in some situations it does make a difference. In natural language analysis, the conventions do in general affect the result, although the specific effects should not be applicable for the type of analysis used here. Does the head assignment make a difference in the linguistic study of natural languages? It does, actually, if we consider the head-initial rule. As we have seen, the verb usually precedes the direct object in a head-initial language; therefore, it seems reasonable to think that the verb is the head of the direct object. But the subject is more complicated. Some head-initial languages let the subject also follow the verb, but more common is to put it before the verb. It is not unthinkable to treat the subject as the head of the verb.

There is also another tendency among natural languages which should be mentioned: Topic fronting. It means that the topic of the sentence (that which we are talking about) precedes the comment (that which we are saying about the topic). Is this an effect of the head-initial rule? No, because it works the same way in head-final languages. But it can affect our analysis. Consider how topic fronting usually works in English:
What did the mouse do? The mouse ate the cheese.
What happened to the cheese? The cheese was eaten by the mouse.
Here we have essentially the same sentence in two different versions, the only difference (in meaning) being the changing in topic. What we do when we want the object to be the topic is simply to make it the subject instead, using a passive verb form. But we can occasionally use simple reordering too:
Him, I like.
This is rare in English, but very common in other languages.

With all this in mind, my model for which phrase is the root of the sentence is somewhat novel: Neither. I choose to make all the constituent phrases equal, and connect them to an invisible sentence root.
It seems when looking at natural languages that the head-initial rule has little to say about the verb-subject connection, and is in some sense obscured by topic fronting, so we can argue that the most natural analysis is to see neither of them as the root. Even in programming languages we can see that the subject is sometimes the head of the verb; the method syntax in for example Java can be seen as a kind of SVO structure.

But most importantly, this gives us the desired free word order, on sentence level just as on phrase level. Natural laguages rarely have rigidly fixed word order, so it would be the most naturalistic and least error prone way to handle things here. That is not to say that this analysis is useful anywhere else - in automatic natural language analysis or in descriptive linguistics - but it seems to be the most effective here.

In order to make this work, we need the nouns to be marked, so that we know which one is subject and which is object. The obvious way (obvious to an English speaker, anyway) would be to mark the subject and the object, in other words, the case markings known as nominative and accusative. But I want to avoid doing everything like in English, so we should not rush into this matter. Let us look at a few examples.
the mouse eats the cheese
the man sits
The first sentence is transitive (it has two arguments) and the second is intransitive (it has one argument). For languages which mark case, there are basically three ways to deal with these sentences.

A nominative-accusative language puts the active actor in the transitive sentence (the mouse) and the single actor in the intransitive sentence (the man) in the same case (the nominative), and the passive actor in the transitive sentence in a different case (the accusative).
An ergative-absolutive language puts the passive actor in the transitive sentence in the same case as the single actor in the intransitive sentence (the absolutive) and the active actor in the transitive sentence in a different case (the ergative).
A tripartite language uses a different case for all three.

But we don't want to distinguish between transitive and intransitive verbs. We want to treat all verbs the same and arrive at a semantic analysis. What happens if we treat the same word as both transitive and intransitive? In fact, English has several such verbs.
The man burns the paper.
The paper burns.
As we can see, English is clearly ambiguous here. If we leave out the object of the transitive sentence (the man burns) we get a sentence which looks like the intransitive, but means something completely different. Only our pragmatic knowledge allows us to, hopefully, figure out whether the man is burning something else, or he is himself on fire.

This becomes even more of an issue when we include adjectives as predicates. For the predicate "to be red", the intransitive meaning is obvious. But what should it mean as a transitive verb? Presumably "to make something red", or something simliar. Then the same problem occur.

When we analyse the semantic roles, unless we use predicate-specific rules, we will generally analyse the actors as agent or patient (sometimes with animacy variants, but those are unimportant here). In the majority of cases, the single actor of the intransitive sentence is a patient, as is the passive actor of the transitive sentence, whereas the active actor of the transitive sentence is the agent. This is exactly the same as in absolutive-ergative case marking.

It should be noted that there are exceptions. Consider the intransitive word "sit". What would that mean as a transitive word? It could mean "make something sit", as in "he sat down his cup". It could also mean "have the capacity to be sat on", as in "the sofa seats three". We will have to choose one, and make the other one a separate word, and the most sensible choice seems to be the former, as is usually the case in English.

In some situations, an absolutive-ergative structure appears in English too. The nominalising suffixes "-er" and "-ee" seem to follow this rule. If you start with a transitive verb, such as "employ", the subject is the "employer" and the object is the "employee", but for an intransitive predicate like "retire" or even "be absent", the subject is not a "retirer" or an "absenter", but a "retiree" or "absentee".

Thus I decide to make the language ergative-absolutive, basically making the syntactic role correspond directly to the semantic role. With the free word order, this also saves us the trouble of needing to mark voice on the verb. We can always include or exclude the agent and the patient at will.

Now we have arrived at a rudimentary semi-natural language. It can only express some rather simple things, but in principle it does what it is supposed to do; it is similar to natural languages, but can be analysed with 100% accuracy.

Seminatural language processing

2010-10-09

The structure of the language

No comments:

Post a Comment