Big Data is not the answer. #5 #cong16

By Syed Ghazi

The movement of science and technology is (and perhaps always has been in some sense) towards the shores of Artificial Intelligence (AI). The idea that we will demystify the inner workings of the human brain, then codify it and finally replicate it into an intelligence that is not limited by the human condition is an ancient idea which continues to fascinate our collection imagination. 

There is the controversial argument that theology professed faith in AI before science even acknowledged the hypothesis, but that is for another time. The AI aspiration comes coupled with the hope that it might hold the key to unlocking the biggest mysteries of our world. 

The underlying assumption of course is that the main objective of science and technology is to help solve the biggest challenges faced by humans, to name a few; climate change, cancer, incompetent rulers, etc. The objective is not necessarily the creation of maximum economic value; however in theory the two are not mutually exclusive. 

We all agree that there is a lot of hype about AI, but that is what it is – hype! The general consensus in ‘The Community’ – AI experts and researchers – is that not only are we quite far from general AI, but that we are headed in the wrong direction. Andrew Ng thinks we might reach Alpha Centauri before we get AI. Yes, one would assume he knows that it takes more than 4 years at the speed of light. 

To some extent the excitement stems from big data and machine learning. Big data’s popularity is owed to the idea of commoditisation of data at a previously unimaginable scale. Yet, Gary Marcus for one is convinced that big data does not help us answer any of the critical questions in artificial intelligence. He argues that the important questions have more to do with the human ability to comprehend language. As such, the important questions are not about making perfect recommendation engines or building machines that are very good at chess. 

A lot of people mistake a Siri update for progress in AI (I hope not). To be fair to Siri, speech recognition has made remarkable progress and it gets majority of what we say right in majority of the cases. However, it does not understand what we say and without meaningful comprehension there is no AI. To reiterate: speech recognition does not equal language comprehension. 

Although, it is true that getting the words right even if it is just speech recognition is hard and the progress is impressive nonetheless. This brings about the question – how come speech recognition is improving at such a fast pace? The short answer is that it has definitely improved but not quite the way it is generally perceived. 

For example, it works best for Caucasian males (native speakers) in quiet surroundings. If there is noise in the background and/or a non-native speaker it does not work so well. There is a very good reason for it and that is the methodology behind speech recognition: brute force. The power and fundamental limitation of brute force is that it requires copious amounts of data to work; ideally all possible permutations and combinations. Where there is not much data, brute force fails. 

Children on the other hand, do not use a fraction of that amount of data to reach logical conclusions about the world around them. Let us take the example of human language. Professor Chomsky talks about ‘poverty of stimuli’ – the idea that we as human beings are capable of making infinite sentences with a finite number of words. Machines cannot do this – not yet and not anytime soon by the looks of it. This is a big problem and big data is clearly not the answer. 

If you're thinking: let us have more data and we can solve the problem. Take a moment to think that all data is not cheap. But for the sake of argument, let us suppose that cost is not a factor. It still won't work. 

One simple problem is that it is impossible to create a database of all the sentences that could exist (which is infinite) and what they mean in a particular context so that we could use deep learning to give the machine the ability to comprehend language. It cannot be done. 

The idea that a machine can listen to cricket commentary and understand what is going on, is pure fantasy right now. Similarly, if we take the example of medical science; we have more than 10,000 papers a month coming out of cancer research and it is impossible for individuals to read and understand all of it to connect the dots. Now imagine that machines had the ability to read, synthesize that information and draw logical conclusions. That would be a game changer. This is one of the reasons why the obsession with data has to go because the stakes are too high. We need to move beyond data. It is obvious that brute force data processing cannot lead to scientific understanding. 

That is also why driverless cars are not going to take over your streets anytime soon. They might work fine in San Francisco on a sunny day but if the weather is not clear or you put a car that learnt to drive on the streets of San Francisco, on the streets of India, it won’t work the way you expect it to. 

This is because it does not have enough data and the data that it does have, is for a completely different style of driving. You cannot compute an algorithm for road rage and then do multiple test repetitions so that the car can learn. This is a major drawback of learning by memorisation – there is no ability to reason independently in a new situation. 

Think about it, all driverless cars obey all traffic laws at all times and all humans don’t obey all traffic laws at all times. This very idea that 'abiding strictly with all rules and regulations without any independent thinking is not safe’, is impossible for a machine to comprehend because even if you do somehow compute an inherent contradiction in a logical function, you will still need to train and test the machine. This points to a structural flaw within machine learning as a whole. There is a dependence on a training set and a test set, where the test set is designed to test the training set, thus both are more or less similar. Training is obviously controlled and the objective is memorisation of data. Testing is the real world scenario. 

However, you cannot guarantee that even if it does work well in the training and test sets; it will continue to work well in the real world because the variables are infinite. Google celebrated when its driverless cars learnt to recognise ‘leaves’. Think about this ‘achievement’ and think about all the other things like ‘leaves’ that a car would need to learn. 

Another misconception is that we have new algorithms because of big data and surely that makes things so much better. The truth is that we don’t really have new algorithms. They are for the most part the same ones that we have been using since the 80s and the harsher truth is that those algorithms are also just variations of the ones from six decades ago. 

As mentioned earlier, the real difference is the commoditisation of data at a large scale now; hence, the obsession with big data and bigger machines. It is now profitable to use brute force data processing to churn algorithms but they are not even close to being comparable to human intelligence. 

Yes, Deep Blue can beat Kasparov and DeepMind can beat Lee Sedol because they can train with themselves millions of times within a strictly defined set of parameters, but that is meaningless to real world applications. 

Anyone who has worked on recommendation engines will tell you that they are also right most of the time for most people and no one is bothered if they do make a mistake every now and then. Who cares if Amazon suggests a venison cookbook to a vegetarian? Probably no one. However, a driverless Uber on the roads of Mogadishu or Dublin does not afford us the luxury of being wrong even once. Tesla realised this much earlier than the recent crash. They scaled back their ambitions and put restrictions which did not allow the cars to be used on particular kinds of residential streets. 

These are all problems that will be solved at some point but that will take a decade or two or even three to solve. Some believe it might take half a century, which is still ambitious given the complexity and the scale of the problem. There is an inverse correlation between enthusiasm to solve a problem and the estimated time required to solve that problem. Yann LeCun fears that this optimism about AI might turn into disappointment - another AI winter. This hype has come and gone before and people might lose interest again because it is too hard. 

So the next time you come across something that promises driverless cars or robots in your house – think about Alpha Centauri. 

 Syed Ghazi 


© Eoin Kennedy 2017 eoin at congregation dot ie