A Machine Learning and Data Science Blog: Q&A on Deep Learning concepts

1. What is Deep Learning?

Deep learning is a subset of machine learning that is concerned with neural networks.

Deep learning represents a learning algorithm that learns representations of data through the use of neural nets.

2. What is Mean?

Mean: Average of all the numbers

3. What is Variance and Standard Deviation?

Variance: The variance (σ2) is a measure of how far each value in the data set is from the mean. The Average of Squared difference from the mean

Standard Deviation: Gives how spread out the numbers are

4. What is Perceptron?

A perceptron (type of neuron) takes serval binary inputs x1, x2,….x1,x2..and produces a single binary output

5. What is Sigmoid?

Just like perceptron, sigmoid (type of neuron) has inputs x1,x2….But instead of just being 0 or 1 these inputs can also take values between 0 and 1. Also, just like a perceptron, the sigmoid neuron has weights for each input w1, w2…and an overall bias. But out is not 0 or 1, instead σ(w.x + b), sometimes call logistic function.

6. What is Gradient?

Gradient is another word for "slope". The higher the gradient of a graph at a point, the steeper the line is at that point. A negative gradient means that the line slopes downwards.

7. Why “Gradient Descent”?

We learned that sigmoid takes Input X (ranging from 0 to 1), W weights and B bias values to compute the output of a neuron. But, in order to calculate “W” and “B” we need a function to calculate them. So, Gradient Descent is one of the methods to calculate W and B values.

8. Explain “Gradient Descent” and “Stochastic Gradient Descent (SGD)”?

Both algorithms are methods for finding a set of parameters that minimize a cost/loss function by evaluating parameters against data and then making adjustments.

In “Gradient Descent”: You will evaluate all the training samples for each set of parameters

In “SGD”: You will evaluate 1 training sample for the set of parameters before updating them.

Helps find which “Weights” and “Bias” number results in minimizing cost function

C(w,b) = 1/2n∑ ||y(x) – a||²

w-> Weight, b->bias

y(x) -> What the output (ref output) should be for input x

a -> Output given by network for a given ‘x’, ‘w’ and ‘b’

9. Explain “Training” a CNN model?

Supervised -> labeling the data that model needs to be trained on

Unsupervised Learning -> Training without labeling

10. What is an epoch?

Epoch -> is a single pass through entire data set.

11. Explain “Back propagation”?

An expression for the partial derivative of the Cost function (C) w.r.t to any weight (or bias).

The expression tells us how quickly the cost changes when we change the weights and biases.

The goal of backpropagation is to compute the partial derivatives ∂C/∂w and ∂C/∂b of the costfunction C with respect to any weight w or bias bb in the network

Four Fundamental Equations behind BP

Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives ∂C/∂wljk, and ∂C/∂blj.

We first introduce an intermediate quantity, δjl, which we call the error in the jth neuron in the lth layer.

Backpropagation will give us a procedure to compute the error δjl, and then will relate δlj to ∂C/∂wjkl and ∂C/∂bjl.

“BP” is about Understanding how changing weights and biases in a network changes the cost function (ex: Output of Gradient Descent/SGD). Ultimately this mean computing the partial derivatives of cost function w.r.t

The method calculates the gradient of a loss function with respect to all the weights in the network

12. Examples “Cost Functions”?

“Gradient Descent”, “Stochastic Gradient Descent (SGD)”, L1 Regularization, and L2 Regularization

13. What are “Activation Functions”?

Softmax: Its similar to Sigmoid ,

but with different function

14. What are hyper parameters and How to select “Hyper Parameters”?

Learning rate, epoch(s), batch size (how many samples to process per iteration of training)

Regularization Parameter

15. How to initialize “weights” and “biases” for given network/model? why?

Generally “Weights” and “Biases” are initialized randomly using Gaussian Distributions with Mean=0 and Standard Deviation=1

Why?? Ex: if you have 1000 input nodes and assume that you zeroed 500 samples and for reaming samples standard deviation is √501 (500 samples + 1 bias) => 22.4…as shown below that is z (w.x+b) has a very bad Gaussian Distribution, not sharply peaked

16. When do you stop “training”?

Neg case: If cost not going down (or) test accuracy is not improving (or) during overfitting (Cost decrease but Accuracy doesn’t improve)

Positive Case: If desired accuracy is reached

17. What is Overfitting? How to avoid it?

When Network shows decrease in cost, but accuracy does not change as expected w.r.t decrease in cost. In this scenario we say network is overfitting or overtraining.

We need to check how Cost of train and test data varying as the network learns. If you see opposite scenario (Cost of Training set going down and Cost of Test set going up) it’s a indication of Overfitting (Network not learning from dataset)

Another sign is classification of training data set. If accuracy on training data set is 100% and test data set is ~80%. It means that network memorizing training samples (not actually learning it features).

The obvious way to detect overfitting is to use the approach above, keeping track of accuracy on the test data as our network trains. If we see that the accuracy on the test data is no longer improving, then we should stop training.* It might be that accuracy on the test data and the training data both stop improving at the same time. Still, adopting this strategy will prevent overfitting.

Increasing the amount of training data is one way of reducing overfitting – Not always possible
Using Regularization techniques, weight decay or L2 regularization. The idea of L2 regularization is to add an extra term to the cost function, a term called the regularization term

18. What is “dropout”?

“Dropout” is another technique to remove “Overfitting”. It does not change Cost function as done by L1 and L2, instead it changes the network itself.

During training we forward-propagate input ‘x’ through the network and then back propagate to compute the “gradient”. With dropout, this process is modified. We start by randomly removing hidden neurons in the network, while leaving the input and output neurons untouched and we repeat this process by randomly adding back previously removed hidden neurons and removing other mini-batch of hidden neurons. We keep repeating this process and calculating gradient and updating weights & biases in the network.

19. What is L1 and L2 norm?

L1-norm is also known as least absolute deviations (LAD), least absolute errors (LAE). It is basically minimizing the sum of the absolute differences (S) between the target value (Yi) and the estimated values (f(xi)):

L2-norm is also known as least squares. It is basically minimizing the sum of the square of the differences (S) between the target value (Yi) and the estimated values (f(xi):

20. Difference between L1 and L2 Regularization?

L1 regularization shrinks the weight much less than L2 regularization does

21. Explain “Precision” and “Recall”? (Ex: Face Detection)

TP -> Correctly detected faces

FP -> detecting non-face as faces

FN -> No.of.faces missed

Precision -> Percentage of identified faces that are correct (TP)/ ((TP + FP))

Recall -> Percentage of correctly detected faces to the total no of faces (TP)/ (TP + FN))

22. Formula to compute size of Layers?

Output size = [((Width – Kernel Size + 2*Padding)/Stride ) + 1]

23. How to calculate no.of.operations per layer (FLOPS)?

(Kernel W X Kernel H X Output No.of features X No.of channels of input image X Image W X H)

24. How would you handle an imbalance data set?

Try collecting more data to even the imbalances

Resample the dataset to correct the imbalance

Try different algorithm altogether on your dataset

25. When should you use “classification” over “regression”?

Classification produces discrete values and dataset to strict categories, while regression gives you continuous results that allow you to better distinguish differences between individual points. You would use classification over regression if you wanted your results to reflect the belongingness of data points in your dataset to certain explicit categories (ex: If you wanted to know whether a name was male or female rather than just how correlated they were with male and female names.)

Please leave a comment, if have any questions on above Q&A (or) Want to ask any new Question on this topic of Deep Learning.

Happy Reading.

2 comments:

AnonymousJune 11, 2018 at 2:35 AM
Thanks for the information.It is really nice .In this age of Technology advancement, computer and information technology have not only brought convenience to citizens in modern life but also for policemen & various Government officials of the nation to fight cybercrime through various modus operandi. Indian Cyber Army has been dedicated in fighting cyber crime, striving to maintain law and order in cyberspace so as to ensure that everyone remains digitally safe.Read more:- Information Security
educational blogsSeptember 28, 2019 at 2:00 AM
Thanks for sharing this valuable information and we collected some information from this blog.

Machine learning in-house Corporate training in Nigeria

A Machine Learning and Data Science Blog

Q&A on Deep Learning concepts

2 comments:

Related Posts

Twitter Updates

Random Posts

Disclaimer

Recent Comments