2010-10-28

A robot experiment and trying to speak naturally

An experiment conducted by Peters and Topp aimed to investigate how humans spontaneously interact with robots. The human guided a robot around a few rooms, pointing out the location of certain objects. The focus was on determining which methods the human uses to show objects - pointing at them, standing next to them, etc. But the experiment also revealed other things about how humans interact with robots.

At first, the humans treated the robot as highly intelligent, more or less as they would a person. Many started with greetings and general pleasantries. As they started guiding the robot around, they used reasonably complex grammar, or at least complete sentences. But as the experiment continued, things rapidly changed. The participants dropped all but the most rudimentary grammar. Any superfluous expressions disappeared, leaving what was essentially bare commands. Words like "please" were relatively common at first, but soon disappeared. Expressions which seemed to have no effect were dropped. Some stopped using expressions like "this is a table" in favour of only "table".

Analysing the transcripts of the whole duration for some of the participants, I find that the vast majority of their utterances consist of a few short expressions:
23% "Stop"
23%"Turn right" or "Turn left"
22% "Follow me"
8% "Move forward"
6% "Move back" or "Move backward"
In total, these few expressions make up 83% of what is said. Excluding the initial parts, or including variants like "please follow me" and "go forward", would make the results even higher.

What can we conclude from this? We know that humans are much more adaptable than robots, and it seems that we are so adaptable that we are unable to speak normally to a robot even with a conscious effort. Unless the robot looks and talks exactly as a human, we will inevitably respond by speaking in a simpler and clearer manner. This means that for robot communication, it is not meaningful to try to interpret perfectly natural language, as that is not the language we speak when talking to robots.

What makes much more sense is to try to control the simplifications. It seems clear that we are quick to learn, without conscious effort, which expressions work and which do not. A relatively human-friendly grammar can most likely be learned to near-perfection in an hour's conversation, without the need for any further explanation. Unfortunately it is difficult to test on real machines, since a conversation requires semantic understanding as well as syntactic.

Ostensibly naturalistic formal languages like Cobol or Applescript fail because they are only naturalistic on a very superficial level. There are specific expressions which look like a natural language, but when the user subconsciously tries to write analogous expressions they are not accepted, since the underlying structure is not like that of natural languages.

Languages based on predicate logic such as Lojban fail because they are based on a way of thinking which is different from the way humans normally think. In many cases there is also a link between the semantic and syntactic understanding, so it is impossible to separate the tasks of interpreting them.

On the grammars of natural languages

Can a natural language ever be perfectly interpreted using a formal grammar?

Conventional wisdom in language analysis says that it is not possible, that natural language is just too complex for a computer to handle. But such categorical statements have been refuted before - it was not long ago that experts said a computer could never learn to play chess well enough to challenge a human. There is clearly a quantitative difference in complexity, but is there a qualitative one?

When we try to analyse a text according to a formal grammar, there are several things which can go wrong. One thing is that the grammar actually does not cover the particular sentence structure used. Another possibility is that the syntax is in itself ambiguous, and yet another is that the semantics is ambiguous.

A formal grammar can clearly not reliably interpret a sentence which is inherently ambiguous. This is exactly why it is helpful to simplify the language, removing both syntactical and semantic ambiguities. We could of course start with purely natural language and try to hunt down those ambiguities, but a much more straightforward method is to start from the other direction. We assume that the semantics is unambiguous, and create a grammar which is also unambiguous. Then we can see how much it resembles natural language, and we can try to gradually extend and modify the grammar to become more and more like a specific natural language. In principle, we will eventually approach the formal unambiguous language which is the closest to a natural language - the natural language maximally clarified.

If we start with the assumption that we are using completely natural language, we are severely constrained in what we can do. A single word with ambiguous meaning would be enough, in principle, to overthrow the interpretation. We can reach a percentage, but that is very unreliable. If we were able to interpret, for example, 90% of sentences, what would happen if the user starts using twice as long sentences? At best, each half has the same probability as before, making the new efficiency 81%. More likely, the error will be even bigger, since each word can be linked to twice as many potential heads.

If on the other hand we are allowed to limit the language ever so slightly, we can make reliable statements as to how certain the interpretation will be. We can state exactly which things the interpreter can and can not handle, and at least in some situations we can identify the unclear cases automatically.

2010-10-09

A most simple formal grammar

This is the rudimentary version of the grammar, mentioned in the previous post.

sentence {
phrase*
}

phrase {
noun_phrase
| verb_phrase
}

noun_phrase {
noun_article attribute*
}

verb_phrase {
copula attribute*
}

attribute {
<property>
}

noun_article {
ART.ABS
| ART.ERG
}

copula {
COP
}

("Property" here stands for the part of speech consisting of content words. It is the only element which does not have a finite number of possible words in the grammar.)

The structure of the language

There have been many attempts at trying to make rules for natural languages. Some are more successful than others, but none can describe the whole complexity of a language as it is really spoken. But that's the point here; I want to find rules that are nearly true for natural language, and then modify the language so that it does follow the rules exactly. It has been necessary to use much simpler rules than the most advanced descriptive attempts, but the goal is in principle to make something that looks similar to natural language - not, however, to any particular natural language, but to natural language in general. Possibly making it similar to a specific language is a different task entirely. It is important to remember this, as I will be using English words for the examples, but the syntax and semantics do not need to be similar to English.

A typical such rule which almost but not quite works for a lot of natural language expressions is the defining rule of regular grammars. This seems like a good place to start. But a regular language is not necessarily unambiguous. The question of ambiguity is not trivial, because sometimes an ambiguity in the syntax does not lead to ambiguity in the interpretation. Whereas "the cat eats the mouse" is distinctly different from "the mouse eats the cat", one could argue that "the cat meets the mouse" is roughly the same thing as "the mouse meets the cat". So if we had a language which was ambiguous in that it did not distinguish between subject and object, we would get problems with the first verb, but not the second.

I would expect the finished grammar to be regular, or at least context-free, and it may have to follow stricter rules than that in order to make it unambiguous. It is nowadays common to use relational grammar rather than phrase-structure grammar in analysing natural langauge; I will use a model which is similar to relational grammar, but the result will probably be a grammar which also conforms to the rules of regular grammar.

In trying to make something similar to natural languages, we should not only look at these rather aloof formal theories and stuctures of conceivable languages, but also at the common characteristics of existing languages. One strong tendency among natural languages is that they tend to be either head-initial or head-final. (Sometimes the words right-branching and left-branching are used, sometimes for the same thing and sometimes for something slightly different, but I find those terms misleading since they assume that we write from left to right.) What signifies a head-initial language is that the head word of a phrase tends to be the first word of that phrase. Three examples are often mentioned:

  • the verb precedes the direct object
  • there are prepositions (i.e. position words precede the things which the position is in relation to)
  • nouns and verbs precede the adjectives and adverbs which describe them
Head-initial languages are more common than head-final ones, and English is generally considered among them. But as you can see from the three examples, English does not completely conform to the rules; only two of the three are generally true in English. Most languages, just like English, do not conform to the rules completely. It is another example of a rule which is almost but not completely followed. Also, for cases other than those three, it is not always obvious which word is the head. We shall return to that shortly.

I decided to make the language strictly head-initial, with the added rule that there is always free word order for the dependants. Most natural languages have a default word order, regardless of which word orders are allowed (and whether they are distinct in meaning), so it might seem a strange choice for making things look natural. Furthermore, it would seem that it makes the analysis more complicated, and takes away one method of transferring information, making the language less efficient. My motivation here is partly that there is less risk of error, if we allow the user to express things more freely. Particularly since the word order varies greatly in the world's languages, it is something a user might easily do wrong. Also, even though human languages as a rule have a default word order, they are also notorious offenders against it. It is exactly the sort of not-as-rare-as-you-thought exceptions which bring down ordinary parsers.

Let us start by looking at noun phrases. If we have a noun and an article, it is unclear which word is the head of the phrase. Traditionally the most common idea has been to treat the noun as the head, and see the article as something which describes it - it is the noun that makes it a noun phrase, and the article only serves to tell us minor details about the form. But it is also possible to think of the article as the thing that makes this a noun phrase. In fact, we can subtly shift the meaning of the two words, moving the concept of "object" from the noun to the article. In a way, the article is now the real noun - albeit a noun without any further meaning - and the old noun has been demoted to something suspiciously similar to an adjective.

It is no big leap to think that we can do the same thing with verb phrases. That would mean some sort of "article" for the verb. It may sound strange, but it is nothing more than a mandatory copula, something which several natural languages have, and English could be said to use in certain verb forms.

But then we do arrive at a big leap. What if we can treat nouns, verbs, adjectives and adverbs as the same class of words? This way we can conform very well to the rule of free word order after a head. The article or copula marks the phrase, and then the content words can come in any order. Since we have a mandatory article for all phrases, we can make use of that in a most convenient way: We can put all form markers on the article. That way, we can have numerous forms (if needed) but still need no actual inflections on the content words themselves.

This is a great gain. Now the parser does not have to know anything at all about the content words. It only needs to know a small number of function words, and can treat all the others the same. It does not need to know whether a word is a noun or a verb, and it also does not have to keep track of which word is the plural of another word. It's useful enough for one-time parsing, but even more so when continuously communicating with the parser, because it gives us the opportunity to use new words without (syntactical) interpretation problems. Was it not for this idea, we would have a world of trouble trying to explain to the computer that while a boy is a child and a girl is a child, the two of them put together are not two child, nor two childs, but two children.

Is this found in natural languages? A rather common phenomenon is that adjectives act like static verbs. This is found in many Asian languages, among others. Treating nouns and verbs as the same is not as common, but on the other hand there are usually ways to form one from the other, and in some languages (including English) this can sometimes be done with no morphological changes at all - there is a noun and a verb, but they look the same. All in all, this simplification is somewhat unnatural, but serves a purpose, and the use of markers like the mandatory copula makes it look a little more natural and avoids confusion between things like attributive and predicative use.

Now let us go up a level to look at the relation between the phrases. On the datalogy side, the norm is to see the verb as the root of the sentence. This goes back to a long tradition of predicate logic and mathematics. On the linguistics side, it is not quite so obvious which part is the root. One might think that this is merely an arbitrary convention, but in some situations it does make a difference. In natural language analysis, the conventions do in general affect the result, although the specific effects should not be applicable for the type of analysis used here. Does the head assignment make a difference in the linguistic study of natural languages? It does, actually, if we consider the head-initial rule. As we have seen, the verb usually precedes the direct object in a head-initial language; therefore, it seems reasonable to think that the verb is the head of the direct object. But the subject is more complicated. Some head-initial languages let the subject also follow the verb, but more common is to put it before the verb. It is not unthinkable to treat the subject as the head of the verb.

There is also another tendency among natural languages which should be mentioned: Topic fronting. It means that the topic of the sentence (that which we are talking about) precedes the comment (that which we are saying about the topic). Is this an effect of the head-initial rule? No, because it works the same way in head-final languages. But it can affect our analysis. Consider how topic fronting usually works in English:
What did the mouse do? The mouse ate the cheese.
What happened to the cheese? The cheese was eaten by the mouse.

Here we have essentially the same sentence in two different versions, the only difference (in meaning) being the changing in topic. What we do when we want the object to be the topic is simply to make it the subject instead, using a passive verb form. But we can occasionally use simple reordering too:
Him, I like.
This is rare in English, but very common in other languages.

With all this in mind, my model for which phrase is the root of the sentence is somewhat novel: Neither. I choose to make all the constituent phrases equal, and connect them to an invisible sentence root.
It seems when looking at natural languages that the head-initial rule has little to say about the verb-subject connection, and is in some sense obscured by topic fronting, so we can argue that the most natural analysis is to see neither of them as the root. Even in programming languages we can see that the subject is sometimes the head of the verb; the method syntax in for example Java can be seen as a kind of SVO structure.

But most importantly, this gives us the desired free word order, on sentence level just as on phrase level. Natural laguages rarely have rigidly fixed word order, so it would be the most naturalistic and least error prone way to handle things here. That is not to say that this analysis is useful anywhere else - in automatic natural language analysis or in descriptive linguistics - but it seems to be the most effective here.

In order to make this work, we need the nouns to be marked, so that we know which one is subject and which is object. The obvious way (obvious to an English speaker, anyway) would be to mark the subject and the object, in other words, the case markings known as nominative and accusative. But I want to avoid doing everything like in English, so we should not rush into this matter. Let us look at a few examples.
the mouse eats the cheese
the man sits

The first sentence is transitive (it has two arguments) and the second is intransitive (it has one argument). For languages which mark case, there are basically three ways to deal with these sentences.

  • A nominative-accusative language puts the active actor in the transitive sentence (the mouse) and the single actor in the intransitive sentence (the man) in the same case (the nominative), and the passive actor in the transitive sentence in a different case (the accusative).
  • An ergative-absolutive language puts the passive actor in the transitive sentence in the same case as the single actor in the intransitive sentence (the absolutive) and the active actor in the transitive sentence in a different case (the ergative).
  • A tripartite language uses a different case for all three.

But we don't want to distinguish between transitive and intransitive verbs. We want to treat all verbs the same and arrive at a semantic analysis. What happens if we treat the same word as both transitive and intransitive? In fact, English has several such verbs.
The man burns the paper.
The paper burns.

As we can see, English is clearly ambiguous here. If we leave out the object of the transitive sentence (the man burns) we get a sentence which looks like the intransitive, but means something completely different. Only our pragmatic knowledge allows us to, hopefully, figure out whether the man is burning something else, or he is himself on fire.

This becomes even more of an issue when we include adjectives as predicates. For the predicate "to be red", the intransitive meaning is obvious. But what should it mean as a transitive verb? Presumably "to make something red", or something simliar. Then the same problem occur.

When we analyse the semantic roles, unless we use predicate-specific rules, we will generally analyse the actors as agent or patient (sometimes with animacy variants, but those are unimportant here). In the majority of cases, the single actor of the intransitive sentence is a patient, as is the passive actor of the transitive sentence, whereas the active actor of the transitive sentence is the agent. This is exactly the same as in absolutive-ergative case marking.

It should be noted that there are exceptions. Consider the intransitive word "sit". What would that mean as a transitive word? It could mean "make something sit", as in "he sat down his cup". It could also mean "have the capacity to be sat on", as in "the sofa seats three". We will have to choose one, and make the other one a separate word, and the most sensible choice seems to be the former, as is usually the case in English.

In some situations, an absolutive-ergative structure appears in English too. The nominalising suffixes "-er" and "-ee" seem to follow this rule. If you start with a transitive verb, such as "employ", the subject is the "employer" and the object is the "employee", but for an intransitive predicate like "retire" or even "be absent", the subject is not a "retirer" or an "absenter", but a "retiree" or "absentee".

Thus I decide to make the language ergative-absolutive, basically making the syntactic role correspond directly to the semantic role. With the free word order, this also saves us the trouble of needing to mark voice on the verb. We can always include or exclude the agent and the patient at will.

Now we have arrived at a rudimentary semi-natural language. It can only express some rather simple things, but in principle it does what it is supposed to do; it is similar to natural languages, but can be analysed with 100% accuracy.

The structure of the program

The visualiser shows the current state of the world, and displays an input box asking the user to say something. The input is sent to a file. (The use of files along the way allows us to check for errors along the way, and to easily interfere at any point in the process when developing the system.)

The translator reads the file, and replaces the lexical words with the interlinear words. If the input is in the same language the visualiser uses (which happens to be English), it's a very simple process, translating the form words and marking the content words. By switching to a different translator, one can easily use a different language. The syntax analysis will work exactly the same, but if the final interpretation is to work there also has to be a translation of the relevant content words to the target language. This only applies to predefined words - it is possible to define new words within the system. Depending on the application, the number of predefined words could be very small. It is easy to imagine an application which is self-lexicalising regarding the content words; the computer has a preexisting notion of certain properties (objects, actions, etc.) but asks for the word to describe them, completely eliminating the need for predefined content words.
The results of the translation are written to a new file.

The parser reads that file, and attempts to identify the head and the function of each word. For natural language this is a very complicated process, and can only be done to a limited degree of precision. But here, all other steps of the process (including those in the user's brain) have adapted in order to make this one step trivial.
I have used two different approaches in constructing the parser. The first acts as a pushdown state machine, much like a regular compiler. Each state corresponds to a unit in the analysis. For each state there is a subroutine, which reads the next word. It determines whether to finish this current state and go up to the calling function, or call another function corresponding to the unit for which the word in question is the head.
The other method is to first note the part of speech and form of the words, and then go through them backwards. For each word, the parser simply steps backward until it finds a word which can act as the head of the current word. This method is simpler, but puts stricter limits on the structures used in the language.
The parser writes another file, in a table format similar to the CONLL standard.

The treemaker turns the list into an object oriented semantic tree. Each word is represented by a word object, with links to its head and dependants.

The semantic interpreter goes through the sentence tree, and performs actions described by the predicates. If compares the properties in the tree to the properties of the objects in a collection.
Finally it calls the various methods of the visual representations of the objects. The visualiser automatically updates to view the changes, and then asks for new input.

3D environments

For the purpose of making a 3D visualisation of the interpreter, I tried several different ways of getting 3D.

OpenGL

OpenGL is a low-level graphics library used on several platforms and with several programming languages. For a more serious 3D application, this might have been the library of choice. However, this implementation is not about efficiency on that level, so neither OpenGL nor C seems optimal. I did not need any complex 3D models, just the most basic functionality, so it seemed at first that maybe going down to this basic level would be a good idea. But there are simple and complex schemes on both high and low level, and it soon became apparent that this was not the best option.

Java3D

Java3D is an API for Java, which can implement OpenGL. It seemed like an easier way into OpenGL, rather than using it directly through C. It seemed like a good idea to make the semantic model in Java, so it would be rather convenient to be able to connect it to the graphical interface within Java. But altho somewhat more high-level, Java3D is not the simple solution I was looking for.

Alice

Alice is a program designed to teach object-oriented programming to young students. It does this using 3D scenes, and has a collection of animated 3D models available. It was based on Java and Python. This seemed like a good way to get the needed functionality without the complexity of a full-blown Java API; there were already models, and there were simple functions for doing the basic things that needed doind, such as moving the models around or resizing them. But how to get the functions out of the rather limited Alice environment and get them to connect to the interpreter?

Writing the visualisation in Alice itself didn't seem like a viable idea; it was far too limited. Earlier versions of Alice had supported scripting with Jython, and it turned out to be possible to enable that functionality in version 2.2 as well. Unfortionaltely the scripting left a lot to be desired. It was unstable, undocumented, and generally unreliable. It seemed at the time like a lost cause for this purpose.

SketchUp

SketchUp is Google's easy-to-use CAD program, partly made for use with Google Earth. It's not made for animation or interaction, but it does have scripting capabilities, using Ruby. One advantage of this solution is there are a large number of models available. As a CAD program it is indeed fairly straightforward, and, I considered writing the whole project in Ruby. But on further exploration, the scripting turned out to be rather slow and cumbersome.

Finally

I realised that there was a new version of Alice, still in beta but with a very important improvement: The projects could now be exported, to create a set of Java files. This made everything a great deal easier. I could now do all the real work on the visualisation in Java, as well as the semantic model. I was essentially only using the 3D engine and models from Alice, as I had hoped to do all along.

Simplifications

Natural language processing includes several steps, each with their separate problems. There are problems with ambiguity in grammars and words, and these are the things which the seminatural language attempts to address. But there are other types of difficulties. Often one wants to start with spoken language, which leads to several sources of errors; the task of segmenting speech into sentences and words, the difference between dialects and individuals, and simple acoustic noise. Even if one starts with text, there is the possibility of spelling mistakes and other errors. This project does not deal with speech, nor with any kind of automatic error correction. Can this be justified, and is there any other way to make sure that the input is reasonable? In some applications the user does supply the input in the form of text, and even when it is in spoken form, interpreting it as text is largely a separate problem. Assuming that all input is correct would be a severe limitation, but that does not mean that it is necessary to have any sort of error correction in the form of "guessing" what the user meant. The important thing is not really to find the intended interpretation of every sentence; far more important is to identify which sentences are wrong. A seminatural language system can do this very well. Pointing out which sentences do not follow the grammatical rules and asking the user to correct them is trivial.

When to use seminatural languages

In some situations, it is necessary that a computer reacts to completely natural language. It might be analysing text or speech which was not originally intended for computer analysis, such as newspapers. It might also deal with users who are not trained in or used to interacting with the machine. In other situations, it is perfectly acceptable to use formal language, and it might even be preferable from the user's perspective. The typical example here is programming languages. But there are also situations where we might want a compromise between the qualities of these two. It can be convenient when dealing with robots, whether it be industrial assembly robots, toy robots, or automatic lawnmowers. If these machines could respond to spoken or written language, as opposed to simple buttons and switches, they could be made to perform more complex actions without becoming too difficult to handle. While perfect understanding of natural language would be largely sufficient, that is a level of technology we are far away from and not really willing to wait for, and unlike for some other applications it is not necessary. Such machines have a limited set of uses, and therefore need only a limited vocabulary, and they could also generally make do with a less advanced grammar.

The idea of seminatural language

Making machines understand human language has proven frustratingly difficult. All the methods used today are prone to misinterpretation; not only do they fail to understand many sentences, but they also fail to identify which sentences are problematic. Meanwhile, making humans understand machine language has been much more successful. Not surprisingly, perhaps, since humans are generally more adaptable than most machines. In fact, when humans interact with robots, they quickly tend to change their patterns of speech, and of communication in general, in order to make the machine understand as well as possible. So perhaps it would be possible to slightly adjust the human language, rather than trying in vain to adjust the models?

We have had models of human language since long before computers; it could be said to be far better studied than any sort of machine interaction. The problem is just that humans tend not to follow the rules set out by grammar all too strictly, and also that human grammars have some "flaws" which occasionally makes it impossible to interpret a sentence unambiguously. For humans this isn't a problem, since we have a great deal of outside information and experience which helps us interpret what is being said, but for a computer even small holes in the interpretation can lead to a complete breakdown of understanding. The idea is therefore this: to set up a set of simple grammatical rules resembling human language as far as possible while remaining unambiguous, and require that the user follow these rules exactly. The aim of this project is to investigate what such rules might look like, and try to get some idea as to how much such a language would differ from normal human language and to what kind of difficulties humans might experience in using it.

Short description

Background

There are two completely different ways of analysing human language. In the early days of natural language parsing, the common way was to use a formal grammar. The language would be treated as a formal language, and analysed in largely the same way as a programming language. Nowadays it is more common to use statistical methods. One starts with a large text, a corpus, which has been analysed by hand, and uses machine learning techniques to get a program to learn the connections and be able to interpret a given text.

One application of language understanding is to give instructions to robots. Depending on the situation, natural or formal language may be more appropriate. Sometimes it is necessary to use formal language to avoid errors, but when possible natural language has some advantages. One is that the user already knows the language. But also, when robots develop and find use in everyday applications, it may be good to have a language which has many of the properties a natural language has, and even if it is different from the standard human language a naturalistic language is likely easier to learn than a formal language.

Problem

Using a formal grammar to analyse natural language turned out to be difficult. Natural languages do not follow the simple grammars which you see in grammar books; their rules are much more complex. They also include ambiguities which can only be resolved with extensive knowledge of the real world and the actual meaning of what is being said. Therefore, statistical methods turned out to be more effective. But it is in the nature of statistical methods that they are not always right, and for many sensitive applications the error frequency is unacceptable.

Idea

To get around the problem, one could use a language which has a formal grammar, but is otherwise similar to a natural language. Such a seminatural language would be more intuitive and therefore easier for humans to learn than a traditional formal language, and it would have different forms of expression. It would also be easier for machines to interpret than purely natural language, and perhaps in some situations it would help humans understand it as well.

My idea is thus to investigate which properties such a language could have, and make an application in the form of a program which reads text in a seminatural language, and responds by manipulating objects in a 3D world representing the environment a robot might interact with.

Specifically for communicating with robots, it could be useful to look at aspects of natural and formal languages other than the obvious difference that the formal languages have a formal grammar. For example, a natural language is capable of both excluding and repeating information, which can be useful in many situations. Even if a language is syntactically unambiguous there may then be semantic ambiguities, depending on the situation. Those ambiguities can be resolved using feedback. A graphical simulation gives the user visual feedback; the user sees the effect of what he has said, and can confirm or reject the result. Just like in a natural language this can be combined with verbal feedback.

On the other hand, there are properties of formal languages like programming languages which could be useful in a communication language, even more so if the communication is with machines. One example is the ability to express exact definitions of new words. Natural languages, including many of the constructed languages made for human communication, have a great number of concepts which are not strictly defined within the language; a large part of the lexicon can not be explained, only translated. One can also try to simplify grammatical rules to decrease the risk of misunderstanding, and add things like clearer logical expressions and more recursive structures. By adding such syntactical structures we pave the way for languages which are not only syntactically but also semantically partly or wholly deterministically analysable.