6 Oct 2017

How computer see the world?

Vision is one of the important senses that humans have. We understand the world by looking at it. But our capability of perceiving the world through eyesight is limited because our eyes can capture only the visible light which is just one part of the electromagnetic spectrum.The Visual perception starts from eyes and they capture the surroundings visual and transfer those to the brain, where it is processed and interpreted.
Eyesight is limited because our eyes can capture only the visible light which is just one part of the electromagnetic spectrum. A typical human eye will respond to wavelengths from about 390 to 700 nm.
 Role of computers in our daily lives: The computer plays an important role in our daily lives. The analytical capability of computers made it so vital and useful to us. But it is just a machine, a machine does whatever it is programmed to do. In the last few decades, the researchers tried to make computers intelligent. This is where the new branch of computer science called as “Artificial intelligence” emerged. If a device can take an appropriate decision based on the inputs without having a direct guideline, we could say it is intelligent. The input will be some information, and decisions are the outcome that is learned from these inputs. So we focus on inputs, which contains information.
What kind of information? It can be text, images, speech etc. Here we focus on vision so that input must be the reflection of what we see. In simple words we start with images, Cameras act as eyes for the computer.
The computer sees things as images, and they are composed of basic units called as “PIXELS” arranged in a matrix format. Pixels represent intensity values. The “resolution” of an image gives the total number pixels.
For example, an image that is 1280 pixels wide and 720 pixels high (1280X720) contains 921600 pixels. More pixels means more details. A 5-megapixel camera can capture a clearer image than a 3-megapixel camera.

Now we know that computers see images as a set of numbers. It is very hard to interpret this data in the raw format. The branch of science which focuses on processing and generating insights from image data is named as “Computer vision”. Most of us might have heard of color images and black-and-white images. Color images include color information for each pixel, each color is composed of blue, green and red values. Black-and-white images or monochrome images are grayscale images those are composed of shades of grey varying from black to white (0-255) which carries the intensity (the amount of brightness) information. There are many other color spaces where images can be represented. A color image can be converted to grayscale by applying this formula to every pixel, Grayscale image = ((0.3 * Red) + (0.59 * Green) + (0.11 * Blue))

Now we are going to see how to understand image data.

We have a picture of cat and dog. We will be able to tell whether the picture contains a cat or dog. How is it possible? Because we already know how a cat and a dog looks like. The same logic applies to a computer. First, we should tell the computer how a cat and a dog looks like but It's not easy as images may be in different orientations, different scales, different illuminations, and the same object can have different kinds also (different breeds of cat and dogs).We need to define some features that can represent the object in all the above scenarios. There are many feature detection techniques used like Histogram of gradients, Haar features (used to detect faces), Local binary patterns are some of them. The feature selection is an important process because it defines the object of interest.
The features from one image of the object will be different from the other image of the same object, But both will have some relationship that we can exploit. There are algorithms from machine learning that helps us to learn relationships in the data. To extract meaningful relationship we have to have more data. The process of classification can be simplified like this, we use a lot of image of the object to extract the meaning features, then these features and the class labels (means this feature corresponds to cat and dog image) are fed to learning algorithms.

The algorithm learns the relationship in features and will be able to classify when you give a new image of cat or dog. One of the important difficulties in classification is feature selection, hard coding the features is actually hard. Haar features good to detecting faces but not vehicles. For each category of object, we need a specific set of features. Can we automate this process? Can computer decide which features to be extracted? The recent innovations in computer vision and machine learning says “YES”, the neural networks help us to do this. A neural network is a type of machine learning algorithm which has the capability to learn the features by itself and classify it. A neural network consists of input layer where input data (images) are fed, hidden layer, and an output layer where classification takes place. If a neural net contains more than one hidden layer we say it is a deep neural network (Deep learning).Usually, deep neural nets contain many hidden layers so it can learn more significant features. Learning through neural nets are computationally expensive and needs a huge amount of data. We live in the era of Bigdata and GPU makes this possible...
There are a lot of researchs and innovations happening in computer vision. Now computers are able to generate images, generate text which describes a scene, classify and recognize images, actions, even drive a car without human interaction