The eye is one of the sensors present in humans similar to our ears, nose, Tong, and skin for vision. Most of the inputs to these sensors in the real world are images, text, and audio. One eye illusion the image as color but in actuality, it's black and white.
Digital photographs or images are in 3 channels (RGB). The same image in the newspaper is in 4 colours (C, M, Y, K). The same image in the magazine is in 7 channels. A portrait can create images using millions of distinct colours. DNN works on 3 channels (RGB/ LAB (lab color channels)). DNN does not depend on channels (color). Training results on B&W images will be the same compared to the training results on color images. It is always recommended to divide an image into as many channels as possible. This must be the first step in DNN. Convert a single object into multiple channels. More channel means more features. Why 3 x 3 kernel only? 3 x 3 is a heavily accelerated kernel brought by researchers and NVIDIA in the world. Increasing the kernel puts a lot of pressure on the hardware. The Channels/ Kernel/ features extractors are the same term.
For example, the text feature extractor is in 26 channels. Each alphabet present in it will be a channel. These alphabets will be considered as features. For extracting each feature such as alphabets ‘e’ needs filters which will only allow alphabets ‘e’ and mute other features. Another example is music orchestras have channels equal to the number of musical instruments used. The sound of each instrument is connected to different channels.
How many kernels do we need? Adding high kernels (more dimension) from the beginning will give superior performance but computation will also increase.
[[0 0 0] [0 1 0] [0 0 0]] ==> identity kernel
[[0 0 0] [1 1 1] [0 0 0]] ==> horizontal kernel
[[0 1 0] [0 1 0] [0 1 0]] ==> vertical kernel
Channels Vs Features:
Features (Edges and gradients) are important rather than colour. Channels for text are going to be all the alphabets ‘e’ while features are going to be a single alphabet ‘e’. Kernel values need to be mapped to the requirement of the network. Kernels make the features brighter by convolution on it.
RF will be convolution done on the 5x5 image with a 3x3 kernel. Everywhere there will be an increment of 2 RF. 200 RF will be needed for making 1 to 400 images. Global receptive filed is of 5x5 before 3x3 RF and local RF will be 3x3 for just before.
Our brain has four layers and follows exactly the same steps of neural network:
Edges and gradients
texture and patterns
parts of objects
We will learn more about CNN in the next section.