In 1950 Alan Turing, a noted mathematician and cryptographer, published a paper in which he described the "Turing Test". In its modern form imagine texting to a stranger. You can send any kind of text message you want. The stranger can reply or not as and how he chooses. After some time (days, perhaps a week) you are asked to answer a simple question: Is the stranger a person or a machine? The Turing Test is designed to answer an important question: Is it possible to create an intelligent machine? If a machine can pass the Touring Test by successfully impersonating a human then it is possible to create an intelligent machine, according to Turing.
Since then Computer Scientists and many others have been fascinated by the Turing Test. Individuals and groups have set up actual Touring Tests but it turns out to be tougher than you would think to get right. What would happen, for instance, if the stranger was a human who tried to imitate a machine? Is this really fair? Also, certain limitations are usually necessary because machines that can look, sound, and act like humans only exist in the realm of fiction. Hence, the text message scenario.
But the concept behind the Touring Test has continued to fascinate people because Turing had a profound idea. If a machine can do things that intelligent humans do then it must be intelligent. This approach seems like a very intuitive and natural approach to figuring out what we mean by "intelligent". And we have just witnessed a very public event. The TV show "Jeopardy" hosted a three day two game exhibition match between a computer (actually a network of IBM computers) and two champion winners (Ken Jennings, winner of 74 matches in a row, and Brad Rutter, the all time money champ). We know a lot of people are fascinated because the match bumped Jeopardy's ratings up 30%, according to Nielson.
Technically, it wasn't a Touring Test because we all knew that Watson (the name IBM gave to their computer system) was a machine. But the question lots of people including yours truly were asking themselves was whether Watson was able to play like a human. The answer we got, if we just go by results. was that Watson was better during these two games than his human competition. And, given the quality of the players he was up against, we can say that he was far better than the typical Jeopardy contestant and, therefore, far far better than the rest of us.
As I have pointed out, staging a Touring Test, even a fake one like the Jeopardy exhibition, is a lot tougher than it looks, if one of your objectives is fairness. IBM staged a number of demonstration matches in order to convince the Jeopardy producers that the tournament would be a good idea. In these demonstration matches it was obvious that Watson could be led astray by constructing "questions" properly. (I know about the "response in the form of a question" gimmick. It works well on the show but I am not going to bother in this article). On the other hand, electronics are far faster at reflex activities like pressing a button. So a lot of work was put into providing what both sides saw as a level playing field. "Normal" Jeopardy questions on one side versus making Watson actually push a button on the other side. And both sides wanted to create an entertaining result so both sides wanted the human participants to have a chance and for Watson to not look like a comedy punch line. And they succeeded. It was fun to watch.
So how did it all come out? Well, Watson showed some weak spots but generally won handily. The final score was Watson - $77,147, Jennings - $24,000, and Rutter - $21,600. Watson was also way ahead at the end of the first match. And Watson was pretty good at figuring out whether he knew the answer or not. We got to see Watson's top three possible answers for most questions. The top answer was coded green for confident, red for not confident, and yellow for somewhere in between. Most of the time Watson's answer was green and another player rang in first when Watson coded his best answer red. Watson also rang in first about two thirds of the time. But the tale is told by the answers Watson got wrong, especially the ones he got really wrong.
We can decide Watson is a machine by the sheer speed and breadth of his performance. But if the IBM people did not think he was quick and knew lots of stuff the exhibition would never have taken place. So that's not enough. We all know that modern computers can organize a lot of data. But, while computers are very good at dealing with "structured" data, say where you have a table with rows and columns, computers are poor at dealing with unstructured data. Put simply, computers can't read.
Oh, they can scan text. Then they can identify all the letters and assemble the letters into words by taking advantage of spaces and other punctuation. But it is very hard for computers to take the next step and understand what the words mean. Most of the truly massive amount of data that was loaded into Watson was in the form of long sequences of text. All of Wikipedia was loaded into Watson. Wikipedia consists of over 2 million articles. And each article consists mostly of standard text because that's what people are good at using. The Internet Movie Database was also loaded in, along with a truly astounding number of other references. A lot of the IMDB data is structured. You have the movie name at the top. Then you have a section that lists each actor and role, one to a line, and so on. If you take a hard look at IMDB you will find out that it is not that easy but, for IMDB it seems like you at least have a chance of sorting much of it out. But for Wikipedia and most of what was loaded into Watson it is a lot harder.
I might find a sentence "Adam begat Cain". This tells a person that you have a parent child relationship where Adam is the father and Cain is the son. And to completely nail it down, you have to know that Adam and Cain are both male names. But what about "A boy named Sue", the popular Johnny Cash song. While Sue is normally a female name, in this case Sue is a male. A friend had a dog named Sam, short for Samantha. Sam is usually the name of a male person. And I could come up with even tougher examples where it is hard for a machine to make sense of things. There is ambiguity. There are contradictions. People are pretty good at functioning in this kind of messy environment but computers aren't.
So how did Watson do? Watson came up with the correct answer in a truly astounding number of areas. So whatever the IBM people did did to collect and organize Watson's data, it worked pretty well. Most of the time Watson came up with up with a green answer that was correct. In one case Watson came up with a yellow rated answer of "Serbia" when the correct answer was "Slovenia". I wouldn't have known which answer was the correct one. So I score Watson high for getting close and knowing when he wasn't sure. I don't know what process the IBM people to assemble the database. The advantage they had was that this could all be done "offline" before the exhibition started. And it might bee that they cheated by using statistical techniques like "this word is frequently found near that word". But if they were able to do the linguistic analysis necessary to get from "Adam begat Cain" to "Adam is the father of Cain", "Cain is the son of Adam", etc., in other words do linguistic and other analysis to turn strings of text into usable information, that would be a truly useful feat.
Jeopardy also poses quite a challenge in the structure of the clues. They are not standard English. There are frequently puns and other tricks. These are hard for people type contestants to deal with, especially with the time constraint. But they are much more difficult for a computer to deal with. And in this case, you can't use statistical tricks. It won't yield enough information because Jeopardy clues are very tightly packed and are frequently constructed so that the components are ambiguous and you have to combine all the components to narrow things down to one answer.
On several occasions Watson went astray by not figuring out an attribute that the correct answer needed to possess. For instance, in one case Watson gave a green rated answer of "Picasso", which was wrong. The clue was asking for a painting style not the name of a painter. The correct answer was "Modern Art". This would have been completely obvious to a human. In another case Watson was unable to correctly process the category. It was a tricky one, keys found on a computer keyboard. For instance, one answer that Watson did not ring in on was "F1". But in another case Watson rang in and supplied a green rated answer of "Chemise". There is no "Chemise" key on a computer keyboard but there is a "Shift" key. The clue had to do with clothing styles. I might or might not have come up with the correct answer but I would definitely have known that "Chemise" was wrong. Had Watson gotten the category, the questions would have been a piece of cake for him. Watson also answered "Dorothy Parker", an author, when what was required was "Elements of Style", the title of a book. I believe this was on a Daily Double and Watson did correctly rate his answer as a red.
So tricky categories and clues were a disadvantage to Watson. But an area he should have had a decided advantage was with ringing in. One would expect that Watson would let red answers go but would always ring in first when he had a Green answer. But in 11 cases Watson with a green answer was beat by one of the human players. I don't know what the story was here. Ken Jennings did say somewhere that it is possible to do an "anticipatory" ring in. You try to figure out when Alex is going to finish the "question" and ring in an instant after he should finish. If you ring in early your button is locked out for a while so there is a high penalty for ringing in early. Successful contestants try to figure out the answer while Alex is still talking. Watson got the questions in text form as soon as it was displayed on the board and employed the same strategy. So, for green answers, Watson should always ring in the same way.
Without knowing a lot more of the details (see - I told you this is tricky) it is impossible to give Watson a grade on the Jeopardy version of the Touring Test. But, ignoring the obvious, like the presence of an Avatar rather than a real person, and assuming the whole undertaking was "legit", for the most part Watson did very well. He was able to make the proper sense of most of the clues most of the time. And, besides the Avatar there was another dead give away. In one case Watson gave the same answer as another contestant who had gotten it wrong. Watson was deaf and they did not feed any of the audio to him. Again, ignoring the obvious (Avatar, deafness) that fact that Watson made at least one bonehead mistake means that technically he failed. But it's still a stunning achievement.
So, beyond the Touring Test aspect, what does this all mean? I was in school in the early '70s and it was one of those times when Artificial Intelligence (AI) research was on the rise. A number of "proof of concept" projects had generated a lot of buzz inside the Computer Science community. The story was "just let us crank these up a bit and see what we can really do". But none of these projects was able to progress past "proof of concept" into something more general and more powerful. After watching for a couple of years I decided that real progress in AI was a long way away and that AI was really hard to do. Unfortunately, my observations turned out to be spot on. There hasn't been a lot of headline progress on AI since. But Watson proves that real progress has been made.
The issue I discussed above of turning text into data was completely beyond the state of the art in the '70s for anything but toy environments. If, as it seems, the Watson project has been able to process astounding amounts of raw text data and turn it into real information that can be used by computers, that is tremendous and real progress. There is another aspect of the Watson project that I want to discuss next. That's machine learning.
The original approach to AI was to put in a bunch of rules. Things like "normal temperature for a human is 97 - 99 degrees Fahrenheit", that sort of stuff. You put in a bunch of rules and the computer used them to answer your question. The theory was that if you had enough rules and they were good rules, you could get a good result. That approach eventually petered out. The modern approach to the same problem is to put some kind of general structure and analytic capability into the system. They you give the system a bunch of right and wrong examples Including whether each example is right or wrong, called a "training set". In our case they fed in a bunch of right and wrong "answers and questions". Then the idea is that the system does some kind of analysis of the "training set" and figures out its own rules. Computer Science people have been playing around with this approach for many years now. And its the approach the IBM people used. Based on their results, I would have to say that the state of the art in machine learning is now pretty good. And that's good news too.
So where do we go from here? Years ago a "physician assistant" was developed. Modern medicine is (and was) unbelievably complicated. The idea was to provide some computer assistance to help medical people diagnose tough cases. The program ultimately went nowhere. But this project seems like a perfect fit for Watson's capabilities. Pump in a lot of medical data, most typically found in text form. This medical data will be chock full of ambiguities and contradictions, just as the Jeopardy database was. Watson has tremendous English language capabilities and "medical language" should be no harder to master than "Jeopardy language". Finally this "training set" approach to machine learning should work as well for medicine as it did for Jeopardy. So it sounds like this would be a good application for Watson technology.
This has already occurred to IBM. They are starting work on a medicine version of Watson. Another area they have identified is the law. Again you have vast amounts of text, in this case legalese, that is ambiguous and contradictory. But again the same types of abilities that made Jeopardy Watson successful look like a good fit. Certainly if one or both of these projects are successful we can expect IBM (and eventually others) to come up with other applications.
Finally, can we look forward to a rematch? I don't think so. I have seen some of the test runs IBM did before the Jeopardy producers decided Watson was ready for prime time. The goofs were much more frequent and much more embarrassing. Yet the progress from there to what we saw in prime time only a few months later was truly astounding. Given even a few more months I am sure the Watson team could fix the few goofs we saw and many more to the point that human contestants would go from having little chance to having no chance at all. The only reason I can think of that I might have misjudged the situation would be if the books were cooked is some non obvious way. Since I don't think the books were cooked I think this Jeopardy challenge will be a one time event.
No comments:
Post a Comment