Artificial Intelligence – Deep Learning Systems

This tutorial is intended to provide an intuitive understanding of Deep Learning to non-technical business managers who are trying to make sense of Artificial Intelligence and learn enough to be able to have a conversation about the potential benefits and the challenges of AI technology in their business.

What is Deep Learning and how it works

Deep learning is a subset of Machine Learning AI technology. It emulates the human brain in order to process data and extract patterns. A human brain contains neurons shown schematically in the diagram below. Dendrites receive signals from other neurons. The cell body sums the received signals and sends a signal to the Axon if certain signal conditions are met for example total strength. The Axon tips distribute the Axon signal by releasing neurotransmitters such as Dopamine, Glutamate, Serotonin, etc… ( there are approximately 40 types) which are picked up by dendrites of other neurons.

The network of connections looks schematically as shown in the diagram below. Here electrical signals from the eye feed into the dendrites of the brain visual cortex. Each neuron processes the signals it receives and distributes them to other neurons in the brain.

Neuroin connections in a brain

A human brain contains approximately 86Billion neurons each connected with up to 10,000 other neurons creating a network with hundreds of trillions of connections.

A Deep Learning system similarly consists of logical nodes and connections and is often referred to as a Neural Network and it is shown below.

Deep Learning Neural Network

This diagram shows a Neural Network in the context of image analysis. The Input Nodes show at the top of the diagram are like the Dendrites of the human visual cortex neurons.

The one thing that should be noted is the box labeled the Convolution Layer and the network is called a Convolution Neural Network. Convolution Networks are pretty much what all Deep Learning systems employ. The box called the Convolution Layer represents a complex arrangements of what one might call filters. This too is analogous to how a human brain functions. Scientists discovered that when an eye is looking at a vertical line only certain neurons fire (get activated). When it sees a horizontal line other set of neurons fire. It turns out that the visual cortex neurons specialize in recognizing different shapes. Deep Learning scientists have emulated this by creating llogical mini-algorithms which function as filters whose objective is to detect various image parameters such as lines, shapes, textures, edges, colors, etc…. There are over a hundred different filters that may be employed in a neural network to pre-process the image signals before they are fed to the neural network. We will discuss this in more detail in subsequent tutorials.

Although Deep Learning Convolution Neural Networks have much fewer nodes than a human brain and typically contains a couple million connections it can perform amazing tasks. It can recognize images, sounds, understand languages and speak without human involvement.

There are two categories of Deep Learning systems Supervised and Unsupervised. Supervised learning require training the system using data that is labeled such as cat, bird, tumor, jazz music, snow, cloud. This is similar to parents teaching their children. Unsupervised networks do not require training and take unlabeled data and discover patterns many of which may not be obvious even to a human.

Supervised Deep Learning

The following discussion illustrates the challenge a Deep Learning system faces when recognizing images. Let’s look at the four images below.

A human can easily identify dogs and a cat in this image

A human being can easily recognize that these are images of 3 dogs and a cat. A computer doesn’t look at the images the way a human does. It sees it as a series of ones and zeros as shown in the four picture below. In fact a computer would not see the images arranged in rectangles but as a series of 8-bit numbers for each of the cells (pixels) in each picture. For example the first 6 cells in the first row of the top left picture would look as follows to the computer:

00000000, 00000000, 00000000, 00000000, 00000001, 0000000.

The entire picture shown here which contains 1,200 elements/pixels would be represented by a string of 1,200 8-bit sequences one 8-bit string (byte) for each pixel. If this was a real picture not a line drawing, each picture would have a value between 0 and 1 representing the various levels of gray. If this was a color picture there would be 1,200 sets for each of the 3 primary colors Red, Green and Blue. A a high definition picture would have millions of values.

Now imagine writing instructions for a computer that only understand bytes how to decide if a string of millions of bytes is a dog or a cat. The task is impossible. Even if one did write such a program with instructions about how many legs, the length of tail and ears, the colors, etc… the program would not work for other pictures of dogs. For example a dog with a missing leg, a cat laying down or a dog rolling upside down and playing would not be recognized. A computer program would have to anticipate millions of such variations to work, a task impossible to realize. So can a computer tell a cat from a dog, human face from a tree or one human from another? In a word: yes

Digitized images of dogs and a cat


Neural Networks.

The way a computer can recognize images and other content is by emulating a human brain. Each neuron in a human brain receives signals from thousands of other neurons, processes them (typically adding their strengths plus adding or subtracting a small amount depending on the input strength) and sends signals to thousands of others neurons. Computer scientists decided to emulate the biological brain and create something similar called a Neural Network shown below.

This neural network consists of nodes shown as circles which perform simple logical functions like additions. In a typical neural Network there are usually several layers of such nodes shown as columns here. The first layer is called the Input Layer, the last layer consisting of 2 nodes here is the Output Layer and is the one that provides an answer one for cat and one for dog in this example. If the system is to identify numbers the Output Layer will have 10 Nodes one for each digit. The other layers (columns here) are called the Hidden Layers. There are typically 1 – 3 Hidden Layers.

Neural Network architecture

Neural Network Operation

The neural network operates as follows. Every pixel in a picture being analyzed “sends” a value between 0 and 1 depending on the intensity of the image in that pixel to one of the input nodes as shown. In this case since the image is very simple the pixels are either black or white so the values sent to the Input Nodes are 1 or 0. In real world each pixel will consist of 3 color pixels (Red, Green and Blue) and each color value (pixel) will have an intensity between 1 and 0.

Each Input Node is connected to each node in the first Hidden Layer as shown. It takes the value it received from the picture and multiplies it by a number called a weight value shown here as w1, w2, w3, etc….. If there are 100 nodes in the first Hidden Layer, each node in the Input Layer will generate 100 different weight values one for each node in the Hidden Layer to which it is connected and it typically is connected to all of the nodes in the first Hidden Layer. Every Input Node does the same and each can create its own weight values. The weight values can be random at first but they eventually are adjusted as explained below.

Each node in the first Hidden Layer receives the various numbers from the Input Layer nodes and sums them. It performs a mathematical operation called an activation function whose role is to compress the value of the sum to something manageable such as a value between 0 and 1. For example if the sums in the hidden nodes produce values of 1, 2, 5, 10 the activation function will convert them to 0.1, 0.2, 0.5 and 1. Each hidden node will also perform a non-linear function on the sum of its inputs. For example, if the sum happens to be negative it will replace it with a zero.

The resulting value is then multiplied by a weights v1, v2, v3, etc… as shown and forwarded to the next layer of nodes. Each of the nodes in the next layer again sums the received values, performs an activation function, multiplies the result by weights p1, p2, p3….. and forwards the outputs to the Output Layer which here are called a Cat and a Dog.

Suppose we present a picture of a Cat at the input to the system. The system performs the actions just described and may generate a value of 0.3 in the Cat node and 0.7 in the Dog node. This implies that the system decided the picture has the probability of 70% ( 0.7) of being a Dog. Clearly this is an error. The node called Cat should’ve had a value of 1 and a node called Dog a value of 0.

Learning Process

The way to make a DL system perform correctly in the above example is to teach it what a dog and cat look like. Parents teach their children by repetition and after a while a child learns how to distinguish objects. DL systems also require repetitive teaching (training). The first step in the training is calculating the error the system produced. The error is the difference between what the Output Node value should be and what it is. Both the Cat and Dog errors are added up to derive a total error for this picture of a cat. In actual DL the errors are the differences between the right value and the actual value which are squared and added to create a total error. In this case the errors would be (1-0.3) squared or 0.49 for the Cat Node and (0-0.7) squared or 0.09 the Dog Node. The total Error for this image is the sum or 0.58.

Now the fun begins. We have to find the values of the various weights (w1,w2,…., v1,v2…. and p1 and p2 shown in the diagram above) that make the error equal zero as that would mean that the Cat Node would have value of 1 and a Dog Node a value of 0. In reality we don’t get to zero but close typically to 0.01 – 0.05. The beauty of the DL is that we don’t have to find the weights that minimize the error, the DL system does it automatically following algorithms designed by computer programmers.

The Neural Network performs what is called a Back Propagation algorithm (illustrated above) by automatically adjusting the values of all the weights in order to minimize the error so that the value for the Cat node is as close to 1 and Dog node is close to 0 as possible. This is done by varying the p weights (p1, p2) weights first, then v1, v2, v3,…. weights and then w1, w2, w3,…. weights (i.e. going backwards hence Backpropagation) and calculating the error each time a weight value is changed. The system can only increase or decrease a value of a weight but it does not know if it should increase or decrease so it does both. It increases the value and calculates the error. It also decreases the value and calculates the error. The action that decreased the value of the error the most is the one that it accepts as the new value of that weight. If the error increases or stays the same the value of that weight does not change. This is done for every weight in the system going back all the way to the beginning. Once this is done the process is repeated starting again at the end (weights p1,p2) and going backwards. This is because once values of v and w weights were changed, there may be better values for p weights and those have to be found. Once the system goes back and forth many times (50 – 100 is typical) calculating the weight values the Cat and Dog Node values will be closer to what they should be. The second and a different image of a cat or an image of a dog is presented and the process repeated. Even if the first time around the system adjusted the weights so well that the error was zero it is likely that the next picture of a cat will not result in a zero. Although the system learned the best combination of weights for the first cat those weights are not likely to work well for the second and different image of a cat. The error is likely to be even worse for a dog picture. This is why many thousands of cat and dog pictures need to be presented until the error becomes acceptable. It is not likely to be zero though. This is called training of the Deep Learning Neural Network. As you can see this is a lot of calculations but this is what computers are good at doing. The most remarkable thing is that the system can learn and from now on be able to recognize cats and dogs no matter if they have a leg missing, they are laying down or sitting in a tree.

Real World applications need Convolutional Neural Networks

The self learning described above suffers from a problem however. Let’s do a quick estimate of what the above process may look like in a real world. Suppose we’re going to build a DL system to analyze a picture from a smart phone with a 12 Megapixel resolution. The 12 Million pixels would require that each pixel value is sent to a separate Input Node as each pixel has some information about the picture. This means 12 Million Input Nodes. Now let’s assume the first hidden layer contains about 50% of the Input Nodes which is a reasonable assumption. This means there are 6,000,000 nodes in the first Hidden Layer. If the second Hidden Layer may contain 5% of the number of nodes in the first Hidden Layer which results in 300,000 nodes. The total number of connections is 12,000,000 x 6,000,000x 300,000= 20 Quintillions (2 followed by 19 zeros). This is more connections that in a human brain! This is also the number of weights which need to be manipulated to get a value close to 1 for a cat or for a dog. This is a non starter. The amount of computing power needed to process these many parameters even if possible is huge and would make recognizing images ridiculously uneconomical.

The first step in reducing the numbers is to lower the number of input parameters, pixels in the case of image recognition. This is why Deep Learning networks work with images sizes of 256 x 256 pixels or lower which reduces the number of pixels to 256×256=65,536 pixels from 12 million. But even that would require 65,536 x 32,000 x 1,600 = 3 Trillion weights if we assume there are 2 hidden layers like in the previous example. Still a huge number.

Convolutional Neural Network

The large number of weights problem is dealt with using a new invention called a Convolutional Neural Networks (CNN). This idea was borrowed from neural science. Neural scientists studying animals (cats actually) observed that a different groups of cat’s brain neurons fire quickly when a cat is shown a line, a square or a circle. Their findings concluded that animal’s brain contain zones of neurons which react to the specific characteristics of an image i.e they perceive the environment through detecting different shapes and aggregating this information to recognize an image.

Designers of Neural Networks decided to emulate this and create special filters that could recognize specific shapes in an image and feed this information to the Input Node of a Neural Network instead of information from every pixel. There are certainly fewer shapes in a picture than there are pixels. This is how Convolutional Neural Networks (CNN) were born. A CNN shown in a diagram above processes the pixels before they are presented to the Input Nodes of the Neural Network by passing them through a series of digital filters which detect shapes. Up to 100 different filters are often used each one designed to detect shapes such as vertical, horizontal or slanted lines, curves, circles, ellipses, high contrast transitions, texture, etc… By detecting shapes the amount of information presented to the Input Nodes of a Neural Network is reduced by a factor of 50 – 100. This means that 65,536 pixels in our example of a 256×256 pixel image would result in a 655 Input Nodes, 325 first Hidden Layer nodes and 16 second layer nodes (1/20 of the number in 1st Hidden Layer). The total number of weights becomes 650x325x16= 3,400,000 which is a manageable level with today’s computer hardware. Most image recognition systems use image with lower resolution than 256 x 256 pixels in order to keep the number of weights around couple Millions.

Transfer Learning

The other challenge of a Deep Learning system including the CNN is the amount of the training data (Pictures in this example) that is needed to optimize all the weights in order to train the system. Typically it will take 10,000 images of cats and dogs or any other images we want the system to recognize to get an accuracy of 60% (40% error). With 25,000 images half of cats and half of dogs the best achieved accuracy can get up to 98% ( See source 10 below).

One area that has received a lot of attention recently is called Transfer Learning which aims to reduce the number of training data requirements. The idea relies on the fact that the CNN based Deep Learning which are the norm today, use filters to distinguish shapes and patterns we mentioned earlier. Before an image or other data to be analyzed is presented to the system it passes through up to 100 filters each one designed to detect certain characteristics in the content to be analyzed as described above. In case of an image, the filters are designed to detect lines (at different angles), curves (to the right, to the left, vertical, horizontal), circles, texture, colors, different type of contrast transitions, certain feature combinations , etc… Images of cats, dogs, cars, humans, airplanes, etc… are nothing but a combinations of these features. A new set of objects such as pictures of African animals, human faces, X-ray images or traffic jams are also a collection of shapes many of which are the same or very similar. What this means is that if someone trained a DL to recognize cats and dogs with a high accuracy, the system essentially learned how to recognize a subset of these shapes. This also means that the optimal weights calculated for a cat/dog recognition may be similar to the weights in a new category of pictures such as X-rays. The system is not going to look at an X-ray and determine if it shows the presence of a tumor on the lungs but because a tumor is also a collection of shapes albeit different than a dog it is closer to being trained than if one were starting from scratch. This means that if you can get a set of weight values derived for one category of images you can use them as a starting point for training images in a different category. The closer the images are of the trained system to the new category the better the starting point. A system trained to recognize buildings is not going to be a good start for X-ray because buildings look nothing like an X-rays but it may be a good starting pint to analyze images of machines. There are over 100 publicly available training data sets and the list is growing, that can be used to reduce the amount of training data using the Transfer Learning process.

(See source 8 below).

Data Annotation

Having data samples is not enough however. The samples have to be labeled or annotated. In our simple Cat/Dog example each training picture presented to the system has to be labeled as Cat or a Dog. This is how the system knows how to calculate the error. This is a very simple labeling task. In real business applications the annotation is much more challenging. For example, in one case where Deep Learning Neural Networks are used to distinguish between the presence and absence of cardiac arrhythmia and classify it in one of the 16 groups the labeling is far more involved. Patients ECGs are labeled with 279 attributes ranging from patient’s age to specific parameters of patient’s ECG scan. (See source 10 below). Typical business deployments are likely to have far fewer attributes but the task is still serious. Data annotation/labeling typically represents 25% of the costs of implementing an Deep Learning system in a business. What is interesting is that 80% of the cost involves various aspects of business data preparation as shown in the chart below per the Cloudfactory study. (see source 11 below).

Time allocation in Deep Learning projects

The reason the costs of data handling is high is because it requires a lot of data samples which have to be labeled by humans. There are a number of companies providing data annotation services. Amazon offers a service called Mechanical Turk which lets you outsource (crowdsource) labeling to people who do this for a fee. There are over 10 other similar services. (see source 12 below) Also, recently new tools which are themselves Deep Learning systems have emerged to help in data labeling/annotation.

Examples of Deep Learning Usage

Deep learning usage affects just about every industry from Agriculture to Healthcare through Banking to Movie Making. We will cover the most important use examples in other tutorials. Here we present a sample list with few usage and result statistics.

Financial Services

The list of uses is long and includes: risk management, fraud detection, credit scoring, workflow automation, customer service chatbots, algorithmic trading, identity verification, asset management and personalized banking

Below are some noteworthy numbers on actual deployments and results of AI systems in Finance Industry.

  • PayPal has reduced its fraud related losses from 1.2% of transactions to 0.32% (14)
  • Only 5.5% of Financial Institutions deployed AI in risk assessment and fraud detection according to a 2019 study by but 81.8% are interested to deploy AI fraud detection systems. ( See Source 15 below)
  • An auto lender and a subprime auto lender cut their annual losses (related to risk) by 23% and 25% respectively by using an AI based risk assessment systems. (See source 13 below).
  • Chatbots can automate 30% of the tasks done by a typical customer contact center staff. (16)
  • 50% of financial institutions surveyed see tech companies leveraging AI to enter financial services and view this as a threat. (17)
  • 85% of financial institutions surveyed use AI in some form already. (17)


Some of the uses of Deep Learning in healthcare include image analysis in radiology, diagnostics, process automation,


  • The convolutional neural network achieved 93.4% case-detection accuracy, with a false-negative rate of 2.4%, and automatically learned microlevel features in duodenal tissue, such as alterations in secretory cell populations. (18)
  • When diagnosing asthma, the AI system was more than 90% accurate, while physicians were between 80% and 94% accurate. (19)
  • AI was 87% accurate at diagnosing gastrointestinal disease, while physicians were between 82% and 90% accurate. (19)
  • Use of AI in joint replacement surgery led to cost reductions including a 25% drop in hospital length of stay and 91% reduction in discharges to nursing facilities. (20)

Sales & Marketing

Scoring leads and forecasts, next best sales action suggestions, automating inquiry & leads handling, on-line recommender systems, customer segmentation/targeting, chatbots and even content creation are some examples of DL use.

  • 29% of marketers use AI in some form. (22)
  • 56.6% of AI usage in Marketing is for content personalization, 56.5% for predictive analytics, 49.6% for targeting decisions and 40.9% for customer segmentation. (21)
  • Companies that have pioneered the use of AI in sales claim an increase in leads and appointments of more than 50%, cost reductions of 40%–60%, and call time reductions of 60%–70%. (24)


Employee screening, cyber threat detection, process automation, customer service, data analytics, preventive maintenance, supply chain optimization and material planning are some of the uses of AI in business operations.

  • Despite high expectations for AI, only 23% of respondents have incorporated it into processes and product and service offerings today. An additional 23% have one or more pilots in progress, and 54% have no adoption plans in progress. (23)
  • 25% of 1,200 business executives world-wide are using AI to reduce operating costs, improve productivity and tighten security. (25)
  • 68% increase in AI spending over the next 24 months is planned the same survey says. (20)

Deep Learning Systems by the numbers. Training Data is the biggest challenge

  • 10,000 of images are needed to get 60%+ image recognition accuracy. (1)
  • Average “ image recognition problems take 10,000 – 100,000 images to train. (1)
  • Hard” problems require 100,000 – 1,000,000 images. (1)
  • A rule of thumb to estimate the number of training samples is 10 times the number of dimensions (variables) contained in an image. A dimension can be a pixel so 256×256 pixel image will require 655,000 images to train. This is not just a cat and dog recognition but a general purpose image classifier. A system with 100 variables would take 1,000 training examples. (2)
  • $1.5for tagging 1,000 images – Google Vision AI image classifier and tagger.
  • ImageNet contains 15Million images in 22,000 categories that can be used for training of DL (4)
  • MNSIT dataset of hand written numbers contains 70,000 training images. (5)
  • 96% of 227 companies surveyed encounter data quality and labeling challenges (6)
  • 63% of companies surveyed have tried to build DL Networks themselves (6)
  • 71% eventually outsourced their projects (6)
  • It costs about $70,000 to label/annotate 100,000 data samples using Amazon’s Mechanical Turk service (7)
  • There are >100 publicly available Transfer Learning data sets. (8)
  • 190 training data sets are available for a wide range of DL projects (9)

Key considerations for deploying a Recommender System in your business.

  • Do you have enough data and is it usable for DL deployment? Most likely your data will need pre-processing (labeling and annotation) which costs money as indicated above.
  • You will need to dedicate a data science specialist to manage the labeling of current and new data as well as to interpret results and adjust the DL system.

Key questions you should be asking of vendors.

  • Is my data useful and if not what’s missing? Good data is more important than the best DL algorithms. AI is just software so garbage in – garbage out rule applies.
  • Have you deployed your Recommendation System in my type of (or similar) business?
  • If the answer to the above is yes then ask: how much data is needed, how many parameters per product/customer and what format for a meaningful result?
  • What is the cost of converting existing data to a format required by a recommender system can be very expensive.
  • How much time and cost to deploy a Pilot? Describe a typical deployment timeline and expected results.
  • What are you providing and what are we responsible for? Buying a bunch of AI tools/software is not likely to do much unless you have access to people with AI experience. AI specialists are expensive and the best make between $300,000 – $500,000 in Silicon Valley because there is a shortage of them.
  • How much human help is needed for the system to operate. For example, Netflix movie recommender system relies on 40 human experts to categorize each new show. It implements very advanced algorithms but they cannot function without human help.
  • How do you measure the accuracy of results delivered by your system. Can you provide examples?
  • How will I know it is working?
  • What specific business improvement can I expect and when?
  • Will the system require re-training and if so how often? Complex, deep learning AI systems rely on training which can get out of date.