Q&A on Deep Learning concepts

1.       What is Deep Learning?
Deep learning is a subset of machine learning that is concerned with neural networks.
           Deep learning represents a learning algorithm that learns representations of data through the                 use of neural nets.
2.       What is Mean?
       Mean: Average of all the numbers         

3.       What is Variance and Standard Deviation?
              Variance: The variance (σ2) is a measure of how far each value in the data set is from the                     mean. The Average of Squared difference from the mean

       Standard Deviation: Gives how spread out the numbers are

4.       What is Perceptron?
A perceptron (type of neuron) takes serval binary inputs x1, x2,….x1,x2..and produces a               single binary output

5.       What is Sigmoid?
         Just like perceptron, sigmoid (type of neuron) has inputs x1,x2….But instead of just being 0          or 1 these inputs can also take values between 0 and 1. Also, just like a perceptron, the                  sigmoid neuron has weights for each input w1, w2…and an overall bias. But out is not 0 or            1, instead σ(w.x + b), sometimes call logistic function.

6.       What is Gradient?
            Gradient is another word for "slope". The higher the gradient of a graph at a point, the steeper             the line is at that point. A negative gradient means that the line slopes downwards. 
7.       Why “Gradient Descent”?
We learned that sigmoid takes Input X (ranging from 0 to 1), W weights and B bias values to     compute the output of a neuron. But, in order to calculate “W” and “B” we need a function to calculate them. So, Gradient Descent is one of the methods to calculate W and B values.

8.       Explain “Gradient Descent” and “Stochastic Gradient Descent (SGD)”?
           Both algorithms are methods for finding a set of parameters that minimize a cost/loss function            by evaluating parameters against data and then making adjustments.

In “Gradient Descent”: You will evaluate all the training samples for each set of parameters
In “SGD”: You will evaluate 1 training sample for the set of parameters before updating them.

Helps find which “Weights” and “Bias” number results in minimizing cost function

C(w,b) = 1/2n∑ ||y(x) – a||2

       w-> Weight, b->bias
 y(x) -> What the output (ref output) should be for input x
       a -> Output given by network for a given ‘x’, ‘w’ and ‘b’

9.       Explain “Training” a CNN model?
            Supervised -> labeling the data that model needs to be trained on
            Unsupervised Learning -> Training without labeling

10.   What is an epoch?
           Epoch -> is a single pass through entire data set.

11.   Explain “Back propagation”?
       An expression for the partial derivative of the Cost function (C) w.r.t to any weight (or bias).
      The expression tells us how quickly the cost changes when we change the weights and biases.

      The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the             costfunction C with respect to any weight w or bias bb in the network

      Four Fundamental Equations behind BP
      Backpropagation is about understanding how changing the weights and biases in a network           changes the cost function. Ultimately, this means computing the partial derivatives ∂C/∂wljk,         and ∂C/∂blj.

      We first introduce an intermediate quantity, δjl, which we call the error in the jth neuron in           the lth layer.

      Backpropagation will give us a procedure to compute the error δjl, and then will                             relate δlj to ∂C/∂wjkl and ∂C/∂bjl.


          “BP” is about Understanding how changing weights and biases in a network changes the                cost function (ex: Output of Gradient Descent/SGD). Ultimately this mean computing the              partial derivatives of cost function w.r.t 

     The method calculates the gradient of a loss function with respect to all the weights in the              network

12.   Examples “Cost Functions”?
   “Gradient Descent”, “Stochastic Gradient Descent (SGD)”, L1 Regularization, and L2                    Regularization

13.   What are “Activation Functions”?
               Softmax: Its similar to Sigmoid   , 
               but with different function  

14.   What are hyper parameters and How to select “Hyper Parameters”?
              Learning rate, epoch(s), batch size (how many samples to process per iteration of training) 
              Regularization Parameter 

15.   How to initialize “weights” and “biases” for given network/model? why?
             Generally “Weights” and “Biases” are initialized randomly using Gaussian Distributions with               Mean=0 and Standard Deviation=1

              Why?? Ex: if you have 1000 input nodes and assume that you zeroed 500 samples and for                   reaming samples standard deviation is 501 (500 samples  + 1 bias) => 22.4…as shown                       below that is z  (w.x+b) has a very bad Gaussian Distribution, not sharply peaked


16.   When do you stop “training”?
       Neg case: If cost not going down (or) test accuracy is not improving (or) during overfitting            (Cost decrease but Accuracy doesn’t improve)
       Positive Case: If desired accuracy is reached

17.   What is Overfitting? How to avoid it?
     When Network shows decrease in cost, but accuracy does not change as expected w.r.t                 decrease in cost. In this scenario we say network is overfitting or overtraining.

     We need to check how Cost of train and test data varying as the network learns. If you see              opposite scenario (Cost of Training set going down and Cost of Test set going up) it’s a                  indication of Overfitting (Network not learning from dataset)

     Another sign is classification of training data set. If accuracy on training data set is 100% and        test data set is ~80%. It means that network memorizing training samples (not actually                  learning it features).

     The obvious way to detect overfitting is to use the approach above, keeping track of accuracy        on the test data as our network trains. If we see that the accuracy on the test data is no longer        improving, then we should stop training.* It might be that accuracy on the test data and the            training data both stop improving at the same time. Still, adopting this strategy will prevent            overfitting.
  •        Increasing the amount of training data is one way of reducing overfitting – Not always possible
  •     Using Regularization techniques, weight decay or L2 regularization. The idea of L2                       regularization is to add an extra term to the cost function, a term called the regularization term


18.   What is “dropout”?
      “Dropout” is another technique to remove “Overfitting”. It does not change Cost function as          done by L1 and L2, instead it changes the network itself.

      During training we forward-propagate input ‘x’ through the network and then back propagate       to compute the “gradient”. With dropout, this process is modified. We start by randomly                 removing hidden neurons in the network, while leaving the input and output neurons                     untouched and we repeat this process by randomly adding back previously removed hidden           neurons and removing other mini-batch of hidden neurons. We keep repeating this process             and calculating gradient and updating weights & biases in the network. 

19.   What is L1 and L2 norm?
              L1-norm is also known as least absolute deviations (LAD), least absolute errors (LAE). It is                 basically minimizing the sum of the absolute differences (S) between the target value (Yi)                   and the estimated values (f(xi)):

             L2-norm is also known as least squares. It is basically minimizing the sum of the square of                 the differences (S) between the target value (Yi) and the estimated values (f(xi):
20.   Difference between L1 and L2 Regularization?
          L1 regularization shrinks the weight much less than L2 regularization does

21.   Explain “Precision” and “Recall”? (Ex: Face Detection)
               TP -> Correctly detected faces
               FP -> detecting non-face as faces
               FN -> No.of.faces missed

              Precision -> Percentage of identified faces that are correct (TP)/ ((TP + FP))
              Recall -> Percentage of correctly detected faces to the total no of faces (TP)/ (TP + FN))

22.   Formula to compute size of Layers?
             Output size = [((Width – Kernel Size + 2*Padding)/Stride ) + 1]

23.   How to calculate no.of.operations per layer (FLOPS)?
        (Kernel W X Kernel H X Output No.of features X No.of channels of input image X Image W X H)

24.   How would you handle an imbalance data set?
Try collecting more data to even the imbalances
Resample the dataset to correct the imbalance
Try different algorithm altogether on your dataset

25.   When should you use “classification” over “regression”?

           Classification produces discrete values and dataset to strict categories, while regression gives              you continuous results that allow you to better distinguish differences between individual                    points. You would use classification over regression if you wanted your results to reflect the                belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to            know whether a name was male or female rather than just how correlated they were with male            and female names.)

Please leave a comment, if have any questions on above Q&A (or) Want to ask any new Question on this topic of Deep Learning.

Happy Reading.

Related Posts

Twitter Updates

Random Posts

share this post
Bookmark and Share
| More
Share/Save/Bookmark Share