1. What is Deep Learning?
Deep learning is a subset of machine learning that is
concerned with neural networks.
Deep learning represents a learning
algorithm that learns representations of data through the use of neural nets.
2. What is Mean?
Mean: Average of all the
numbers
3. What is Variance and Standard Deviation?
Variance: The variance (σ2) is a measure of
how far each value in the data set is from the mean. The Average of Squared
difference from the mean
Standard Deviation: Gives how spread out
the numbers are
4. What is Perceptron?
A perceptron (type of neuron) takes serval binary
inputs x1, x2,….x1,x2..and produces a single binary output
5. What is Sigmoid?
Just like perceptron,
sigmoid (type of neuron) has inputs x1,x2….But instead of just being 0 or 1
these inputs can also take values between 0 and 1. Also, just like a
perceptron, the sigmoid neuron has weights for each input w1, w2…and an overall
bias. But out is not 0 or 1, instead σ(w.x + b), sometimes call logistic
function.
6. What is Gradient?
Gradient is another word for
"slope". The higher the gradient of a graph at a point, the
steeper the line is at that point. A negative gradient means that the
line slopes downwards.
7. Why “Gradient Descent”?
We learned that sigmoid takes Input X (ranging from 0 to 1), W weights
and B bias values to compute the
output of a neuron. But, in order to calculate “W” and “B” we need a function
to calculate them. So, Gradient Descent is one of the methods to calculate W
and B values.
8. Explain “Gradient Descent” and “Stochastic
Gradient Descent (SGD)”?
Both algorithms are methods for finding a
set of parameters that minimize a cost/loss function by evaluating parameters
against data and then making adjustments.
In “Gradient Descent”: You will evaluate all the
training samples for each set of parameters
In “SGD”: You will evaluate 1 training sample for the set
of parameters before updating them.
Helps find which “Weights” and
“Bias” number results in minimizing cost function
C(w,b) = 1/2n∑ ||y(x) – a||2
w-> Weight, b->bias
y(x) -> What
the output (ref output) should be for input x
a -> Output given by
network for a given ‘x’, ‘w’ and ‘b’
9. Explain “Training” a CNN model?
Supervised -> labeling the data that
model needs to be trained on
Unsupervised Learning -> Training without
labeling
10. What is an epoch?
Epoch -> is a single pass through entire
data set.
11. Explain “Back propagation”?
An expression for the partial derivative of the Cost function (C) w.r.t to any weight (or bias).
The expression tells us how quickly the cost changes when we change the weights and biases.
The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the costfunction C with respect to any weight w or bias bb in the network
Four Fundamental Equations behind BP
Backpropagation is about understanding how changing the weights and biases
in a network changes the cost function. Ultimately, this means computing the
partial derivatives ∂C/∂wljk, and ∂C/∂blj.
We first introduce an intermediate quantity, δjl, which we call
the error in the jth neuron in the lth layer.
Backpropagation will give us a procedure to compute the error δjl, and
then will relate δlj to ∂C/∂wjkl and ∂C/∂bjl.
“BP” is about Understanding how changing weights and biases in a network
changes the cost function (ex: Output of Gradient Descent/SGD). Ultimately this
mean computing the partial derivatives of cost function w.r.t
The method calculates the gradient of a loss function
with respect to all the weights in the network
12. Examples “Cost Functions”?
“Gradient Descent”, “Stochastic Gradient Descent (SGD)”,
L1 Regularization, and L2 Regularization
13. What are “Activation Functions”?
Softmax: Its similar to Sigmoid ,
but with different function
14. What are hyper parameters and How to select
“Hyper Parameters”?
Learning rate, epoch(s), batch size (how
many samples to process per iteration of training)
Regularization Parameter
15. How to initialize “weights” and “biases”
for given network/model? why?
Generally “Weights” and “Biases” are
initialized randomly using Gaussian Distributions with Mean=0 and
Standard
Deviation=1
Why?? Ex: if you
have 1000 input nodes and assume that you zeroed 500 samples and for reaming
samples standard deviation is √501 (500 samples + 1 bias) => 22.4…as shown below that is
z (w.x+b) has a very bad Gaussian
Distribution, not sharply peaked
16. When do you stop “training”?
Neg case: If cost not going down
(or) test accuracy is not improving (or) during overfitting (Cost decrease but
Accuracy doesn’t improve)
Positive Case: If desired accuracy
is reached
17. What is Overfitting? How to avoid it?
When Network shows decrease in cost, but accuracy does not change as
expected w.r.t decrease in cost. In this scenario we say network is overfitting
or overtraining.
We need to check how Cost of train and test data varying as the network
learns. If you see opposite scenario (Cost of Training set going down and Cost
of Test set going up) it’s a indication of Overfitting (Network not learning
from dataset)
Another sign is classification of training data set. If accuracy on
training data set is 100% and test data set is ~80%. It means that network
memorizing training samples (not actually learning it features).
The obvious way to detect overfitting is to use
the approach above, keeping track of accuracy on the test data as our network
trains. If we see that the accuracy on the test data is no longer improving,
then we should stop training.* It
might be that accuracy on the test data and the training data both stop
improving at the same time. Still, adopting this strategy will prevent overfitting.
- Increasing the amount of training data is one way of reducing overfitting – Not always possible
- Using Regularization techniques, weight decay or L2 regularization. The idea of L2 regularization is to add an extra term to the cost function, a term called the regularization term
18. What is “dropout”?
“Dropout” is another technique to remove “Overfitting”. It does not
change Cost function as done by L1 and L2, instead it changes the network
itself.
During training we forward-propagate input ‘x’ through the network and
then back propagate to compute the “gradient”. With dropout, this process is
modified. We start by randomly removing hidden neurons in the network, while
leaving the input and output neurons untouched and we repeat this process by
randomly adding back previously removed hidden neurons and removing other
mini-batch of hidden neurons. We keep repeating this process and calculating
gradient and updating weights & biases in the network.
19. What is L1 and L2 norm?
L1-norm is also known as least absolute
deviations (LAD), least absolute errors (LAE). It is basically minimizing the
sum of the absolute differences (S) between the target value (Yi) and the estimated values (f(xi)):
L2-norm is also known as least squares. It
is basically minimizing the sum of the square of the differences (S) between
the target value (Yi) and the estimated values (f(xi):
20. Difference between L1 and L2 Regularization?
L1 regularization shrinks the weight much less than L2 regularization
does
21. Explain “Precision” and “Recall”? (Ex: Face Detection)
TP -> Correctly detected faces
FP -> detecting non-face as faces
FN -> No.of.faces missed
Precision -> Percentage of identified faces that are correct (TP)/
((TP + FP))
Recall -> Percentage of correctly detected faces to the total no of
faces (TP)/ (TP + FN))
22. Formula to compute size of Layers?
Output size = [((Width – Kernel Size + 2*Padding)/Stride
) + 1]
23. How to calculate no.of.operations per layer
(FLOPS)?
(Kernel W X Kernel H X
Output No.of features X No.of channels of input image X Image W X H)
24. How would you handle an imbalance data set?
Try collecting more data to even the imbalances
Resample the dataset to correct the imbalance
Try different algorithm altogether on your dataset
25. When should you use “classification” over
“regression”?
Classification produces discrete values and
dataset to strict categories, while regression gives you continuous results
that allow you to better distinguish differences between individual points. You
would use classification over regression if you wanted your results to reflect
the belongingness of data points in your dataset to certain explicit categories
(ex: If you wanted to know whether a name was male or female rather than just
how correlated they were with male and female names.)
Please leave a comment, if have any questions on above Q&A (or) Want to ask any new Question on this topic of Deep Learning.
Happy Reading.