diff --git a/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md b/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md
new file mode 100644
index 0000000..1fc4d5f
--- /dev/null
+++ b/English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md	
@@ -0,0 +1,1127 @@
+[TOC]
+
+# Chapter 3 Foundation of Deep Learning
+
+## 3.1 Basic Concepts
+
+### 3.1.1 Neural network composition?
+
+There are many types of neural networks, the most important of which is the multilayer perceptron. To describe the neural network in detail, let's start with the simplest neural network.
+
+**Perceived machine**
+
+The feature neuron model in the multi-layer perceptron is called the perceptron and was invented in 1957 by *Frank Rosenblatt*.
+
+The simple perceptron is shown below:
+
+![](./img/ch3/3-1.png)
+
+Where $x_1$, $x_2$, $x_3$ is the input to the perceptron, and its output is:
+
+$$
+Output = \left\{
+\begin{aligned}
+0, \quad if \ \ \sum_i w_i x_i \leqslant threshold \\
+1, \quad if \ \ \sum_i w_i x_i > threshold
+\end{aligned}
+\right.
+$$
+
+If the perceptron is imagined as a weighted voting mechanism, for example, three judges rate a singer with a score of $4 $, $1$, $-3 $, respectively, and the weight of the $3$ score is $1. 3, 2$, the singer will eventually score $4 * 1 + 1 * 3 + (-3) * 2 = 1$. According to the rules of the game, the selected $threshold$ is $3$, indicating that only the singer's overall score is greater than $3$. Against the perceptron, the player was eliminated because
+
+$$
+\sum_i w_i x_i < threshold=3, output = 0
+$$
+
+Replace $threshold$ with $-b$ and the output becomes:
+
+$$
+Output = \left\{
+\begin{aligned}
+0, \quad if \ \ w \cdot x + b \le threshold \\
+1, \quad if \ \ w \cdot x + b > threshold
+\end{aligned}
+\right.
+$$
+
+Set the appropriate $x$ and $b$ , a simple perceptual unit's NAND gate is expressed as follows:
+
+![](./img/ch3/3-2.png)
+
+When the input is $0$, $1$, the perceptron output is $ 0 * (-2) + 1 * (-2) + 3 = 1$.
+
+More complex perceptrons are composed of simple perceptron units:
+
+![](./img/ch3/3-3.png)
+
+**Multilayer Perceptron**
+
+The multi-layer perceptron is promoted by the perceptron. The most important feature is that there are multiple neuron layers, so it is also called deep neural network. Each neuron in the $i$ layer of the multilayer perceptron is connected to each neuron in the $i-1$ layer compared to a separate perceptron.
+
+![](./img/ch3/3.1.1.5.png)
+
+The output layer can have more than $1$ neurons. The hidden layer can have only $1 $ layers, or it can have multiple layers. The output layer is a neural network of multiple neurons such as the following:
+
+![](./img/ch3/3.1.1.6.png)
+
+
+### 3.1.2 What are the common model structures of neural networks?
+
+The figure below contains most of the commonly used models:
+
+![](./img/ch3/3-7.jpg)
+
+### 3.1.3 How to choose a deep learning development platform?
+
+The existing deep learning open source platforms mainly include Caffe, PyTorch, MXNet, CNTK, Theano, TensorFlow, Keras, fastai and so on. So how to choose a platform that suits you? Here are some measures for reference.
+
+**Reference 1: How easy is it to integrate with existing programming platforms and skills**
+
+Mainly the development experience and resources accumulated in the early stage, such as programming language, pre-dataset storage format and so on.
+
+**Reference 2: Closeness of ecological integration with related machine learning and data processing**
+
+Deep learning research is inseparable from various software packages such as data processing, visualization, and statistical inference. Is there a convenient data preprocessing tool before considering modeling? After modeling, is there a convenient tool for visualization, statistical inference, and data analysis?
+
+**Reference 3: Requirements and support for data volume and hardware**
+
+Deep learning is not the same amount of data in different application scenarios, which leads us to consider the issues of distributed computing and multi-GPU computing. For example, people working on computer image processing often need to segment image files and computing tasks onto multiple computer nodes for execution. At present, each deep learning platform is developing rapidly, and each platform's support for distributed computing and other scenarios is also evolving.
+
+**Reference 4: The maturity of the deep learning platform**
+
+The maturity consideration is a more subjective consideration. These factors can include: the level of activity of the community; whether it is easy to communicate with developers; the momentum of current applications.
+
+**Reference 5: Is the diversity of platform utilization? **
+
+Some platforms are specifically developed for deep learning research and applications. Some platforms have powerful optimizations for distributed computing, GPU and other architectures. Can you use these platforms/software to do other things? For example, some deep learning software can be used to solve quadratic optimization; some deep learning platforms are easily extended and used in reinforcement learning applications.
+
+### 3.1.4 Why use deep representation?
+
+1. Deep neural network is a feature-oriented learning algorithm. Shallow neurons learn some low-level simple features, such as edges and textures, directly from the input data. The deep features continue to learn more advanced features based on the shallow features that have been learned, and learn deep semantic information from a computer perspective.
+2. The number of hidden cells in the deep network is relatively small, and the number of hidden layers is large. If the shallow network wants to achieve the same calculation result, the number of cells requiring exponential growth can be achieved.
+
+### 3.1.5 Why is deep neural network difficult to train?
+
+
+1. Gradient Gradient
+    The disappearance of the gradient means that the gradient will become smaller and smaller as seen from the back and the front through the hidden layer, indicating that the learning of the front layer will be significantly slower than the learning of the latter layer, so the learning will get stuck unless the gradient becomes larger.
+
+    The reason for the disappearance of the gradient is affected by many factors, such as the size of the learning rate, the initialization of the network parameters, and the edge effect of the activation function. In the deep neural network, the gradient calculated by each neuron is passed to the previous layer, and the gradient received by the shallower neurons is affected by all previous layer gradients. If the calculated gradient value is very small, as the number of layers increases, the obtained gradient update information will decay exponentially, and the gradient disappears. The figure below shows the learning rate of different hidden layers:
+
+![](./img/ch3/3-8.png)
+
+2. Exploding Gradient
+    In a network structure such as a deep network or a Recurrent Neural Network (RNN), gradients can accumulate in the process of network update, becoming a very large gradient, resulting in a large update of the network weight value, making the network unstable; In extreme cases, the weight value will even overflow and become a $NaN$ value, which cannot be updated anymore.
+
+3. Degeneration of the weight matrix results in a reduction in the effective degrees of freedom of the model. The degradation rate of learning in the parameter space is slowed down, which leads to the reduction of the effective dimension of the model. The available degrees of freedom of the network contribute to the gradient norm in learning. As the number of multiplication matrices (ie, network depth) increases, The product of the matrix becomes more and more degraded. In nonlinear networks with hard saturated boundaries (such as ReLU networks), as the depth increases, the degradation process becomes faster and faster. The visualization of this degradation process is shown in a 2014 paper by Duvenaud et al:
+
+![](./img/ch3/3-9.jpg)
+
+As the depth increases, the input space (shown in the upper left corner) is twisted into thinner and thinner filaments at each point in the input space, and only one direction orthogonal to the filament affects the response of the network. In this direction, the network is actually very sensitive to change.
+
+### 3.1.6 What is the difference between deep learning and machine learning?
+
+Machine learning: use computer, probability theory, statistics and other knowledge to input data and let the computer learn new knowledge. The process of machine learning is to train the data to optimize the objective function.
+
+Deep learning: It is a special machine learning with powerful capabilities and flexibility. It learns to represent the world as a nested hierarchy, each representation is associated with a simpler feature, and the abstract representation is used to compute a more abstract representation.
+
+Traditional machine learning needs to define some manual features to purposefully extract target information, relying heavily on task specificity and expert experience in designing features. Deep learning can learn simple features from big data, and gradually learn from the deeper features of more complex abstraction, independent of artificial feature engineering, which is also a major reason for deep learning in the era of big data.
+
+
+
+![](./img/ch3/3.1.6.1.png)
+
+![](./img/ch3/3-11.jpg)
+
+## 3.2 Network Operations and Calculations
+
+### 3.2.1 Forward Propagation and Back Propagation?
+
+There are two main types of neural network calculations: foward propagation (FP) acts on the input of each layer, and the output is obtained by layer-by-layer calculation; backward propagation (BP) acts on the output of the network. Calculate the gradient from deep to shallow to update the network parameters.
+
+** Forward Propagation**
+
+![](./img/ch3/3.2.1.1.png)
+
+Suppose the upper node $ i, j, k, ... $ and so on are connected to the node $ w $ of this layer, so what is the value of the node $ w $? That is, the weighting operation is performed by the nodes of $i, j, k, ... $ above and the corresponding connection weights, and the final result is added with an offset term (for simplicity in the figure) Finally, through a non-linear function (ie activation function), such as $ReLu $, $ sigmoid $ and other functions, the final result is the output of this layer node $ w $.
+
+Finally, through this method of layer by layer operation, the output layer results are obtained.
+
+**Backpropagation**
+
+![](./img/ch3/3.2.1.2.png)
+
+Because of the final result of our forward propagation, taking the classification as an example, there is always an error in the end. How to reduce the error? One algorithm that is widely used at present is the gradient descent algorithm, but the gradient requires the partial derivative. The Chinese alphabet is used as an example to explain:
+
+Let the final error be $ E $ and the activation function of the output layer be a linear activation function, for the output then $ E $ for the output node $ y_l $ the partial derivative is $ y_l - t_l $, where $ t_l $ is the real value, $ \ Frac{\partial y_l}{\partial z_l} $ refers to the activation function mentioned above, $ z_l $ is the weighted sum mentioned above, then the $ E $ for this layer has a partial derivative of $ z_l $ Frac{\partial E}{\partial z_l} = \frac{\partial E}{\partial y_l} \frac{\partial y_l}{\partial z_l} $. In the same way, the next level is calculated as well, except that the $\frac{\partial E}{\partial y_k} $ calculation method has been changed back to the input layer, and finally $ \frac{\partial E}{ \partial x_i} = \frac{\partial E}{\partial y_j} \frac{\partial y_j}{\partial z_j} $, and $ \frac{\partial z_j}{\partial x_i} = w_i j $ . Then adjust the weights in these processes, and then continue the process of forward propagation and back propagation, and finally get a better result.
+
+### 3.2.2 How to calculate the output of the neural network?
+
+![](./img/ch3/3.2.2.1.png)
+
+As shown in the figure above, the input layer has three nodes, which we numbered as 1, 2, and 3; the four nodes of the hidden layer are numbered 4, 5, 6, and 7; the last two nodes of the output layer are numbered 8. 9. For example, node 4 of the hidden layer is connected to the three nodes 1, 2, and 3 of the input layer, and the weights on the connection are $ w_{41}, w_{42}, w_{43} $.
+
+In order to calculate the output value of node 4, we must first get the output values ​​of all its upstream nodes (ie nodes 1, 2, 3). Nodes 1, 2, and 3 are nodes of the input layer, so their output value is the input vector itself. According to the corresponding relationship in the above picture, you can see that the output values ​​of nodes 1, 2, and 3 are $ x_1, x_2, x_3 $, respectively.
+
+$$
+A_4 = \sigma(w^T \cdot a) = \sigma(w_{41}x_4 + w_{42}x_2 + w_{43}a_3 + w_{4b})
+$$
+
+Where $ w_{4b} $ is the offset of node 4.
+
+Similarly, we can continue to calculate the output values ​​of nodes 5, 6, and 7 $ a_5, a_6, a_7 $.
+
+Calculate the output value of node 8 of the output layer $ y_1 $:
+
+$$
+Y_1 = \sigma(w^T \cdot a) = \sigma(w_{84}A_4 + w_{85}a_5 + w_{86}a_6 + w_{87}a_7 + w_{8b})
+$$
+
+Where $ w_{8b} $ is the offset of node 8.
+
+For the same reason, we can also calculate $ y_2 $. So that the output values ​​of all the nodes in the output layer are calculated, we get the output vector $ y_1, y_2 $ of the neural network when the input vectors $ x_1, x_2, x_3, x_4 $. Here we also see that the output vector has the same number of dimensions as the output layer neurons.
+
+### 3.2.3 How to calculate the output value of convolutional neural network?
+
+Suppose there is a 5\*5 image, convolved with a 3\*3 filter, and I want a 3\*3 Feature Map, as shown below:
+
+![](./img/ch3/3.2.3.1.png)
+
+$ x_{i,j} $ represents the $ j $ column element of the $ i $ line of the image. $ w_{m,n} $ means filter $ m $ line $ n $ column weight. $ w_b $ represents the offset of $filter$. Table $a_i, _j$ shows the feature map $ i$ line $ j $ column element. $f$ represents the activation function, here the $ReLU$ function is used as an example.
+
+The convolution calculation formula is as follows:
+
+$$
+A_{i,j} = f(\sum_{m=0}^2 \sum_{n=0}^2 w_{m,n} x_{i+m, j+n} + w_b )
+$$
+
+When the step size is $1$, the feature map element $ a_{0,0} $ is calculated as follows:
+
+$$
+A_{0,0} = f(\sum_{m=0}^2 \sum_{n=0}^2 w_{m,n} x_{0+m, 0+n} + w_b )
+
+
+= relu(w_{0,0} x_{0,0} + w_{0,1} x_{0,1} + w_{0,2} x_{0,2} + w_{1,0} x_{ 1,0} + \\w_{1,1} x_{1,1} + w_{1,2} x_{1,2} + w_{2,0} x_{2,0} + w_{2, 1} x_{2,1} + w_{2,2} x_{2,2}) \\
+
+
+= 1 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 \\
+
+= 4
+$$
+
+The calculation process is illustrated as follows:
+
+![](./img/ch3/3.2.3.2.png)
+
+By analogy, all Feature Maps are calculated.
+
+![](./img/ch3/3.2.3.4.png)
+
+When the stride is 2, the Feature Map is calculated as follows
+
+![](./img/ch3/3.2.3.5.png)
+
+Note: Image size, stride, and the size of the Feature Map after convolution are related. They satisfy the following relationship:
+
+$$
+W_2 = (W_1 - F + 2P)/S + 1\\
+H_2 = (H_1 - F + 2P)/S + 1
+$$
+
+Where $ W_2 $ is the width of the Feature Map after convolution; $ W_1 $ is the width of the image before convolution; $ F $ is the width of the filter; $ P $ is the number of Zero Padding, and Zero Padding is around the original image Make a few laps of $0$. If the value of $P$ is $1$, then make $1$ lap $0$; $S$ is the stride; $ H_2 $ is the height of the Feature Map after convolution; $ H_1 $ is the convolution The width of the image.
+
+Example: Suppose the image width is $ W_1 = 5 $, filter width $ F=3 $, Zero Padding $ P=0 $, stride $ S=2 $, $ Z $
+
+$$
+W_2 = (W_1 - F + 2P)/S + 1
+
+= (5-3+0)/2 + 1
+
+= 2
+$$
+
+The Feature Map width is 2. Similarly, we can also calculate that the Feature Map height is also 2.
+
+If the image depth before convolution is $ D $, then the corresponding filter depth must also be $ D $. Convolution calculation formula with depth greater than 1:
+
+$$
+A_{i,j} = f(\sum_{d=0}^{D-1} \sum_{m=0}^{F-1} \sum_{n=0}^{F-1} w_{ d,m,n} x_{d,i+m,j+n} + w_b)
+$$
+
+Where $D$ is the depth; $F$ is the size of the filter; $w_{d,m,n}$ represents the $d$$ layer of the filter, the $m$ line, the $n$ column Weight; $ a_{d,i,j} $ means the $d$ of the feature map, the $i$ line, the $j$ column, and the other symbols have the same meanings and are not described again.
+
+There can be multiple filters per convolutional layer. After each filter is convolved with the original image, you get a Feature Map. The depth (number) of the Feature Map after convolution is the same as the number of filters in the convolutional layer. The following illustration shows the calculation of a convolutional layer with two filters. $7*7*3$ Input, after two convolutions of $3*3*3$ filter (step size is $2$), get the output of $3*3*2$. The Zero padding in the figure is $1$, which is a $0$ around the input element.
+
+![](./img/ch3/3.2.3.6.png)
+
+The above is the calculation method of the convolutional layer. This is a partial connection and weight sharing: each layer of neurons is only connected to the upper layer of neurons (convolution calculation rules), and the weight of the filter is the same for all neurons in the previous layer. For a convolutional layer containing two $3 * 3 * 3 $ fitlers, the number of parameters is only $ (3 * 3 * 3+1) * 2 = 56 $, and the number of parameters is the same as the previous one. The number of layers of neurons is irrelevant. Compared to a fully connected neural network, the number of parameters is greatly reduced.
+
+### 3.2.4 How to calculate the output value of the Pooling layer output value?
+
+The main role of the Pooling layer is to downsample, further reducing the number of parameters by removing unimportant samples from the Feature Map. There are many ways to pooling, the most common one is Max Pooling. Max Pooling actually takes the maximum value in the sample of n\*n as the sampled value after sampling. The figure below is 2\*2 max pooling:
+
+![](./img/ch3/3.2.4.1.png)
+
+In addition to Max Pooing, Average Pooling is also commonly used - taking the average of each sample.
+For a Feature Map with a depth of $ D $ , each layer does Pooling independently, so the depth after Pooling is still $ D $.
+
+### 3.2.5 Example Understanding Back Propagation
+
+A typical three-layer neural network is as follows:
+
+![](./img/ch3/3.2.5.1.png)
+
+Where Layer $ L_1 $ is the input layer, Layer $ L_2 $ is the hidden layer, and Layer $ L_3 $ is the output layer.
+
+Assuming the input dataset is $ D={x_1, x_2, ..., x_n} $, the output dataset is $ y_1, y_2, ..., y_n $.
+
+If the input and output are the same, it is a self-encoding model. If the raw data is mapped, it will get an output different from the input.
+
+Suppose you have the following network layer:
+
+![](./img/ch3/3.2.5.2.png)
+
+The input layer contains neurons $ i_1, i_2 $, offset $ b_1 $; the hidden layer contains neurons $ h_1, h_2 $, offset $ b_2 $, and the output layer is $ o_1, o_2 $, $ W_i $ is the weight of the connection between the layers, and the activation function is the $sigmoid $ function. Take the initial value of the above parameters, as shown below:
+
+![](./img/ch3/3.2.5.3.png)
+
+among them:
+
+- Enter the data $ i1=0.05, i2 = 0.10 $
+- Output data $ o1=0.01, o2=0.99 $;
+- Initial weights $ w1=0.15, w2=0.20, w3=0.25, w4=0.30, w5=0.40, w6=0.45, w7=0.50, w8=0.55 $
+- Target: Give the input data $ i1,i2 $ ( $0.05$ and $0.10$ ) so that the output is as close as possible to the original output $ o1,o2 $,( $0.01$ and $0.99$).
+
+** Forward Propagation**
+
+1. Input layer --> output layer
+
+Calculate the input weighted sum of neurons $ h1 $:
+
+$$
+Net_{h1} = w_1 * i_1 + w_2 * i_2 + b_1 * 1\\
+
+Net_{h1} = 0.15 * 0.05 + 0.2 * 0.1 + 0.35 * 1 = 0.3775
+$$
+
+The output of the neuron $ h1 $ $ o1 $ : (the activation function used here is the sigmoid function):
+
+$$
+Out_{h1} = \frac{1}{1 + e^{-net_{h1}}} = \frac{1}{1 + e^{-0.3775}} = 0.593269992
+$$
+
+Similarly, the output of neuron $ h2 $ can be calculated. $ o1 $:
+
+$$
+Out_{h2} = 0.596884378
+$$
+
+
+2. Implicit layer --> output layer:
+
+Calculate the values ​​of the output layer neurons $ o1 $ and $ o2 $ :
+
+$$
+Net_{o1} = w_5 * out_{h1} + w_6 * out_{h2} + b_2 * 1
+$$
+
+$$
+Net_{o1} = 0.4 * 0.593269992 + 0.45 * 0.596884378 + 0.6 * 1 = 1.105905967
+$$
+
+$$
+Out_{o1} = \frac{1}{1 + e^{-net_{o1}}} = \frac{1}{1 + e^{1.105905967}} = 0.75136079
+$$
+
+The process of forward propagation is over. We get the output value of $ [0.75136079 , 0.772928465] $, which is far from the actual value of $ [0.01 , 0.99] $. Now we reverse the error and update the right. Value, recalculate the output.
+
+**Backpropagation **
+
+Calculate the total error
+
+Total error: (Use Square Error here)
+
+$$
+E_{total} = \sum \frac{1}{2}(target - output)^2
+$$
+
+But there are two outputs, so calculate the error of $ o1 $ and $ o2 $ respectively, the total error is the sum of the two:
+
+$E_{o1} = \frac{1}{2}(target_{o1} - out_{o1})^2
+= \frac{1}{2}(0.01 - 0.75136507)^2 = 0.274811083$.
+
+$E_{o2} = 0.023560026$.
+
+$E_{total} = E_{o1} + E_{o2} = 0.274811083 + 0.023560026 = 0.298371109$.
+
+
+2. Implicit layer --> Output layer weight update:
+
+Taking the weight parameter $ w5 $ as an example, if we want to know how much influence $ w5 $ has on the overall error, we can use the overall error to find the partial derivative of $ w5 $: (chain rule)
+
+$$
+\frac{\partial E_{total}}{\partial w5} = \frac{\partial E_{total}}{\partial out_{o1}} * \frac{\partial out_{o1}}{\partial net_{ O1}} * \frac{\partial net_{o1}}{\partial w5}
+$$
+
+The following diagram can be more intuitive to see how the error propagates back:
+
+![](./img/ch3/3.2.5.4.png)
+
+### 3.2.6 What is the meaning of the neural network more "deep"?
+
+Premise: within a certain range.
+
+- In the case of the same number of neurons, the deep network structure has a larger capacity, and the layered combination brings an exponential expression space, which can be combined into more different types of substructures, which makes learning and representation easier. Various features.
+- An increase in the hidden layer means that the number of nesting layers of the nonlinear transformation brought by the activation function is more, you can construct more complex mappings.
+
+## 3.3 Hyperparameters
+
+### 3.3.1 What is a hyperparameter?
+
+**Super-parameter**: For example, the learning rate in the algorithm, the iterations of the gradient descent method, the hidden layers, the number of hidden layer units, and the activation function are all required. The actual situation is set, these numbers actually control the last parameter and the value, so they are called hyperparameters.
+
+### 3.3.2 How to find the optimal value of the hyperparameter?
+
+There are always some difficult parameters to adjust when using machine learning algorithms. For example, weight attenuation size, Gaussian kernel width, and so on. These parameters require manual settings, and the set values ​​have a large impact on the results. Common methods for setting hyperparameters are:
+
+1. Guess and check: Select parameters based on experience or intuition, and iterate over.
+
+2. Grid Search: Let the computer try to evenly distribute a set of values ​​within a certain range.
+
+3. Random search: Let the computer randomly pick a set of values.
+
+4. Bayesian optimization: Using Bayesian optimization of hyperparameters, it is difficult to meet the Bayesian optimization algorithm itself.
+
+5. The MITIE method performs local optimization under the premise of good initial guessing. It uses the BOBYQA algorithm and has a carefully chosen starting point. Since BOBYQA only looks for the nearest local optimal solution, the success of this method depends largely on whether there is a good starting point. In the case of MITIE, we know a good starting point, but this is not a universal solution, because usually you won't know where the good starting point is. On the plus side, this approach is well suited to finding local optimal solutions. I will discuss this later.
+
+6. The latest proposed global optimization method for LIPO. This method has no parameters and is proven to be better than a random search method.
+
+### 3.3.3 Superparameter search general process?
+
+The general process of hyperparameter search:
+1. Divide the data set into a training set, a validation set, and a test set.
+2. Optimize the model parameters based on the performance indicators of the model on the training set.
+3. Search the model's hyperparameters based on the model's performance metrics on the validation set.
+4. Steps 2 and 3 alternately iteratively, finalizing the parameters and hyperparameters of the model, and verifying the pros and cons of the evaluation model in the test set.
+
+Among them, the search process requires a search algorithm, generally: grid search, random search, heuristic intelligent search, Bayesian search.
+
+## 3.4 Activation function
+
+### 3.4.1 Why do I need a nonlinear activation function?
+
+**Why do I need to activate the function? **
+
+1. The activation function plays an important role in model learning and understanding very complex and nonlinear functions.
+2. The activation function can introduce nonlinear factors. If the activation function is not used, the output signal is only a simple linear function. The linear function is a first-order polynomial. The complexity of the linear equation is limited, and the ability to learn complex function mapping from the data is small. Without an activation function, the neural network will not be able to learn and simulate other complex types of data, such as images, video, audio, speech, and so on.
+3. The activation function can convert the current feature space to another space through a certain linear mapping, so that the data can be better classified.
+
+**Why does the activation function require a nonlinear function? **
+
+1. If the network is all linear, the linear combination is linear, just like a single linear classifier. This makes it impossible to approximate arbitrary functions with nonlinearities.
+2. Use a nonlinear activation function to make the network more powerful, increasing its ability to learn complex things, complex form data, and complex arbitrary function mappings that represent nonlinearities between input and output. A nonlinear activation function can be used to generate a nonlinear mapping from input to output.
+
+### 3.4.2 Common activation functions and images
+
+1. sigmoid activation function
+
+   The function is defined as: $ f(x) = \frac{1}{1 + e^{-x}} $, whose value is $ (0,1) $.
+
+   The function image is as follows:
+
+![](./img/ch3/3-26.png)
+
+2. tanh activation function
+
+   The function is defined as: $ f(x) = tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $, the value range is $ (- 1,1) $.
+
+   The function image is as follows:
+
+![](./img/ch3/3-27.png)
+
+3. Relu activation function
+
+   The function is defined as: $ f(x) = max(0, x) $ , and the value field is $ [0,+∞) $;
+
+   The function image is as follows:
+
+![](./img/ch3/3-28.png)
+
+4. Leak Relu activation function
+
+   The function is defined as: $ f(x) = \left\{
+   \begin{aligned}
+   Ax, \quad x<0 \\
+   x, \quad x>0
+   \end{aligned}
+   \right. $, the value field is $ (-∞, +∞) $.
+
+   The image is as follows ($ a = 0.5 $):
+
+![](./img/ch3/3-29.png)
+
+5. SoftPlus activation function
+
+   The function is defined as: $ f(x) = ln( 1 + e^x) $, and the value range is $ (0, +∞) $.
+
+   The function image is as follows:
+
+![](./img/ch3/3-30.png)
+
+6. softmax function
+
+   The function is defined as: $ \sigma(z)_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} $.
+
+   Softmax is mostly used for multi-class neural network output.
+
+### 3.4.3 Derivative calculation of common activation functions?
+
+For common activation functions, the derivative is calculated as follows:
+
+![](./img/ch3/3-31.png)
+
+### 3.4.4 What are the properties of the activation function?
+
+1. Nonlinearity: When the activation function is linear, a two-layer neural network can approximate all functions. But if the activation function is an identity activation function, ie $ f(x)=x $, this property is not satisfied, and if the MLP uses an identity activation function, then the entire network is waiting with the single layer neural network. Price
+2. Differentiability: This property is reflected when the optimization method is based on gradients;
+3. Monotonicity: When the activation function is monotonous, the single-layer network can guarantee a convex function;
+4. $ f(x)≈x $: When the activation function satisfies this property, if the initialization of the parameter is a random, small value, the training of the neural network will be very efficient; if this property is not met, then Set the initial value in detail;
+5. Range of output values: Gradient-based optimization methods are more stable when the output value of the activation function is finite, because the representation of the feature is more significantly affected by the finite weight; when the output of the activation function is infinite, the model Training will be more efficient, but in this case it is small and generally requires a smaller Learning Rate.
+
+### 3.4.5 How to choose an activation function?
+
+Choosing a suitable activation function is not easy. There are many factors to consider. Usually, if you are not sure which activation function works better, you can try them all and then evaluate them on the verification set or test set. Then see which one performs better, just use it.
+
+The following are common choices:
+
+1. If the output is a 0, 1 value (two-category problem), the output layer selects the sigmoid function, and then all other cells select the Relu function.
+2. If you are not sure which activation function to use on the hidden layer, then the Relu activation function is usually used. Sometimes, the tanh activation function is also used, but one advantage of Relu is that the derivative is equal to 0 when it is negative.
+3. sigmoid activation function: basically it will not be used except that the output layer is a two-class problem.
+4. tanh activation function: tanh is very good, almost suitable for all occasions.
+5. ReLu activation function: The most commonly used default function. If you are not sure which activation function to use, use ReLu or Leaky ReLu and try other activation functions.
+6. If we encounter some dead neurons, we can use the Leaky ReLU function.
+
+### 3.4.6 What are the advantages of using the ReLu activation function?
+
+1. In the case where the interval varies greatly, the derivative of the ReLu activation function or the slope of the activation function will be much larger than 0. In the program implementation is an if-else statement, and the sigmoid function needs to perform floating-point arithmetic. In practice, Using ReLu to activate function neural networks is usually faster than using sigmoid or tanh activation functions.
+2. The derivatives of the sigmoid and tanh functions will have a gradient close to 0 in the positive and negative saturation regions, which will cause the gradient to diffuse, while the Relu and Leaky ReLu functions are more constant than the 0 part, and will not produce gradient dispersion.
+3. It should be noted that when Relu enters the negative half, the gradient is 0, and the neurons are not trained at this time, resulting in so-called sparsity, which Leaky ReLu does not cause.
+
+### 3.4.7 When can I use the linear activation function?
+
+1. The output layer mostly uses a linear activation function.
+2. Some linear activation functions may be used at the hidden layer.
+3. There are very few linear activation functions commonly used.
+
+### 3.4.8 How to understand Relu (< 0) is a nonlinear activation function?
+
+The Relu activation function image is as follows:
+
+![](./img/ch3/3-32.png)
+
+According to the image, it can be seen that it has the following characteristics:
+
+Unilateral inhibition
+2. A relatively broad excitement boundary;
+3. Sparse activation;
+
+From the image, the ReLU function is a piecewise linear function that changes all negative values ​​to 0, while the positive values ​​are unchanged, thus becoming a one-sided suppression.
+
+Because of this unilateral inhibition, the neurons in the neural network also have sparse activation.
+
+**Sparse activation**: From the signal point of view, the neurons only selectively respond to a small part of the input signal at the same time, a large number of signals are deliberately shielded, which can improve the learning accuracy and extract better and faster. Sparse features. When $ x<0 $, ReLU is hard saturated, and when $ x>0 $, there is no saturation problem. ReLU is able to keep the gradient from decaying when $ x>0 $, thus alleviating the gradient disappearance problem.
+
+### 3.4.9 How does the Softmax function be applied to multiple classifications?
+
+Softmax is used in the multi-classification process. It maps the output of multiple neurons to the $ (0,1) $ interval, which can be understood as a probability to be multi-classified!
+
+Suppose we have an array, $ V_i $ represents the $ i $ element in $ V $ , then the softmax value of this element is
+
+$$
+S_i = \frac{e^{V_i}}{\sum_j e^{V_j}}
+$$
+
+From the following figure, the neural network contains the input layer, and then processed by two feature layers. Finally, the softmax analyzer can get the probability under different conditions. Here, it needs to be divided into three categories, and finally get $ y=0. , y=1, y=2 probability value of $.
+
+![](./img/ch3/3.4.9.1.png)
+
+Continuing with the picture below, the three inputs pass through softmax to get an array of $[0.05, 0.10, 0.85] $, which is the function of soft.
+
+![](./img/ch3/3.4.9.2.png)
+
+The more visual mapping process is shown below:
+
+![****](./img/ch3/3.4.9.3.png)
+
+In the case of softmax, the original output is $3,1,-3$, which is mapped to the value of $(0,1)$ by the softmax function, and the sum of these values ​​is $1 $( Satisfy the nature of the probability), then we can understand it as a probability, when we finally select the output node, we can select the node with the highest probability (that is, the value corresponds to the largest) as our prediction target!
+
+### 3.4.10 Cross entropy cost function definition and its derivative derivation
+
+(**Contributors: Huang Qinjian - South China University of Technology**)
+
+
+The output of the neuron is a = σ(z), where $z=\sum w_{j}i_{j}+b $ is the weighted sum of the inputs.
+
+$C=-\frac{1}{n}\sum[ylna+(1-y)ln(1-a)]$
+
+Where n is the total number of training data, summation is performed on all training inputs x, and y is the corresponding target output.
+
+Whether the expression solves the problem of slow learning is not obvious. In fact, even seeing this definition as a cost function is not obvious! Before solving the slow learning, let's see why the cross entropy can be interpreted as a cost function.
+
+There are two reasons for considering cross entropy as a cost function.
+
+First, it is non-negative, C > 0. It can be seen that all independent terms in the summation in the expression are negative, because the domain of the logarithm function is (0,1), and there is a negative sign before the sum, so the result is non- negative.
+
+Second, if the actual output of the neuron is close to the target value for all training inputs x, then the cross entropy will be close to zero.
+
+Suppose in this example, y = 0 and a ≈ 0. This is the result we think of. We see that the first term in the formula is eliminated because y = 0 and the second is actually − ln(1 −a) ≈ 0. Conversely, y = 1 and a ≈ 1. So the smaller the difference between the actual output and the target output, the lower the value of the final cross entropy. (This assumes that the output is not 0, which is 1, the actual classification is also the same)
+
+In summary, the cross entropy is non-negative and will approach zero when the neuron reaches a good rate of accuracy. These are actually the characteristics of the cost function we want. In fact, these characteristics are also available in the quadratic cost function. Therefore, cross entropy is a good choice. But the cross entropy cost function has a better feature than the quadratic cost function, which is that it avoids the problem of slow learning speed. In order to clarify this situation, we calculate the partial derivative of the cross entropy function with respect to the weight. We substitute $a={\varsigma}(z)$ into the formula and apply the two-chain rule to get:
+
+$\begin{eqnarray}\frac{\partial C}{\partial w_{j}}&=&-\frac{1}{n}\sum \frac{\partial }{\partial w_{j}}[ Ylna+(1-y)ln(1-a)]\\&=&-\frac{1}{n}\sum \frac{\partial }{\partial a}[ylna+(1-y)ln(1 -a)]*\frac{\partial a}{\partial w_{j}}\\&=&-\frac{1}{n}\sum (\frac{y}{a}-\frac{1 -y}{1-a})*\frac{\partial a}{\partial w_{j}}\\&=&-\frac{1}{n}\sum (\frac{y}{\varsigma (z)}-\frac{1-y}{1-\varsigma(z)})\frac{\partial \varsigma(z)}{\partial w_{j}}\\&=&-\frac{ 1}{n}\sum (\frac{y}{\varsigma(z)}-\frac{1-y}{1-\varsigma(z)}){\varsigma}'(z)x_{j} \end{eqnarray}$
+
+According to the definition of $\varsigma(z)=\frac{1}{1+e^{-z}}$, and some operations, we can get ${\varsigma}'(z)=\varsigma(z ) (1-\varsigma(z))$. After simplification, you can get:
+
+$\frac{\partial C}{\partial w_{j}}=\frac{1}{n}\sum x_{j}({\varsigma}(z)-y)$
+
+This is a beautiful formula. It tells us that the speed of weight learning is controlled by $\varsigma(z)-y$, which is the error in the output. Greater error and faster learning. This is the result of our intuitive expectation. In particular, this cost function also avoids the slow learning caused by ${\varsigma}'(z)$ in a similar equation in the quadratic cost function. When we use cross entropy, ${\varsigma}'(z)$ is dropped, so we no longer need to care if it gets small. This addition is the special effect of cross entropy. In fact, this is not a very miraculous thing. As we will see later, cross entropy is actually just a choice to satisfy this characteristic.
+
+According to a similar approach, we can calculate the partial derivative of the bias. I will not give a detailed process here, you can easily verify:
+
+$\frac{\partial C}{\partial b}=\frac{1}{n}\sum ({\varsigma}(z)-y) $
+
+
+Again, this avoids slow learning caused by similar ${\varsigma}'(z)$ items in the quadratic cost function.
+
+### 3.4.11 Why is Tanh faster than Sigmoid?
+
+** (Contributor: Huang Qinjian - South China University of Technology)**
+
+$tanh^{,}(x)=1-tanh(x)^{2}\in (0,1) $
+
+$s^{,}(x)=s(x)*(1-s(x))\in (0,\frac{1}{4}]$
+
+It can be seen from the above two formulas that the problem of disappearing the tanh(x) gradient is lighter than sigmoid, so Tanh converges faster than Sigmoid.
+
+## 3.5 Batch_Size
+
+### 3.5.1 Why do I need Batch_Size?
+
+The choice of Batch, the first decision is the direction of the decline.
+
+If the data set is small, it can take the form of a full data set. The benefits are:
+
+1. The direction determined by the full data set better represents the sample population and is more accurately oriented in the direction of the extreme value.
+2. Since the gradient values ​​of different weights are very different, it is difficult to select a global learning rate. Full Batch Learning can use Rprop to update each weight individually based on gradient symbols only.
+
+For larger data sets, if you use a full data set, the downside is:
+1. With the massive growth of data sets and memory limitations, it is becoming increasingly infeasible to load all of the data at once.
+2. Iteratively in the Rprop manner, due to the sampling difference between the batches, the gradient correction values ​​cancel each other and cannot be corrected. This was followed by a compromise with RMSProp.
+
+### 3.5.2 Selection of Batch_Size value
+
+If only one sample is trained at a time, Batch_Size = 1. The error surface of a linear neuron in the mean square error cost function is a paraboloid with an ellipse in cross section. For multi-layered neurons and nonlinear networks, the local approximation is still a paraboloid. At this time, each correction direction is corrected by the gradient direction of each sample, and the traverse is directly inconsistent, and it is difficult to achieve convergence.
+
+Since Batch_Size is a full data set or Batch_Size = 1 has its own shortcomings, can you choose a moderate Batch_Size value?
+
+At this time, a batch-grading learning method (Mini-batches Learning) can be employed. Because if the data set is sufficient, then the gradient calculated using half (or even much less) data training is almost the same as the gradient trained with all the data.
+
+### 3.5.3 What are the benefits of increasing Batch_Size within a reasonable range?
+
+1. The memory utilization is improved, and the parallelization efficiency of large matrix multiplication is improved.
+2. The number of iterations required to complete an epoch (full data set) is reduced, and the processing speed for the same amount of data is further accelerated.
+3. Within a certain range, generally, the larger the Batch_Size, the more accurate the determined direction of decline, resulting in less training shock.
+
+### 3.5.4 What is the disadvantage of blindly increasing Batch_Size?
+
+1. Memory utilization has increased, but memory capacity may not hold up.
+2. The number of iterations required to complete an epoch (full data set) is reduced. To achieve the same accuracy, the time it takes is greatly increased, and the correction of the parameters becomes slower.
+3. Batch_Size is increased to a certain extent, and its determined downward direction has not changed substantially.
+
+### 3.5.5 What is the effect of adjusting Batch_Size on the training effect?
+
+1. Batch_Size is too small, and the model performs extremely badly (error is soaring).
+2. As Batch_Size increases, the faster the same amount of data is processed.
+3. As Batch_Size increases, the number of epochs required to achieve the same accuracy is increasing.
+4. Due to the contradiction between the above two factors, Batch_Size is increased to a certain time, and the time is optimal.
+5. Since the final convergence accuracy will fall into different local extrema, Batch_Size will increase to some point and achieve the best convergence accuracy.
+
+## 3.6 Normalization
+
+### 3.6.1 What is the meaning of normalization?
+
+1. Inductive statistical distribution of unified samples. The normalized between $ 0-1 $ is the statistical probability distribution, normalized between $ -1--+1 $ is the statistical coordinate distribution.
+
+2. Whether for modeling or calculation, the basic unit of measurement is the same, the neural network is trained (probability calculation) and prediction based on the statistical probability of the sample in the event, and the value of the sigmoid function is 0 to 1. Between, the output of the last node of the network is also the same, so it is often necessary to normalize the output of the sample.
+
+3. Normalization is a statistical probability distribution unified between $ 0-1 $. When the input signals of all samples are positive, the weights connected to the first hidden layer neurons can only increase or decrease simultaneously. Small, which leads to slow learning.
+
+4. In addition, singular sample data often exists in the data. The network training time caused by the existence of singular sample data increases, and may cause the network to fail to converge. In order to avoid this situation and the convenience of data processing and speed up the network learning speed, the input signal can be normalized so that the input signals of all samples have an average value close to 0 or small compared with their mean square error.
+
+### 3.6.2 Why do you want to normalize?
+
+1. For the convenience of subsequent data processing, normalization can avoid some unnecessary numerical problems.
+2. The convergence speeds up for the program to run.
+3. The same dimension. The evaluation criteria of the sample data are different, and it is necessary to standardize and standardize the evaluation criteria. This is an application level requirement.
+4. Avoid neuron saturation. What do you mean? That is, when the activation of a neuron is saturated near 0 or 1, in these regions, the gradient is almost zero, so that during the backpropagation, the local gradient will approach 0, which effectively "kills" the gradient.
+5. Ensure that the value in the output data is small and not swallowed.
+
+### 3.6.3 Why can normalization improve the solution speed?
+
+![](./img/ch3/3.6.3.1.png)
+
+The above figure is the optimal solution finding process that represents whether the data is uniform (the circle can be understood as a contour). The left image shows the search process without normalization, and the right image shows the normalized search process.
+
+When using the gradient descent method to find the optimal solution, it is very likely to take the "Zigzag" route (vertical contour line), which leads to the need to iterate many times to converge; the right picture normalizes the two original features. The corresponding contour line appears to be very round, and it can converge faster when the gradient is solved.
+
+Therefore, if the machine learning model uses the gradient descent method to find the optimal solution, normalization is often necessary, otherwise it is difficult to converge or even converge.
+
+### 3.6.4 3D illustration is not normalized
+
+example:
+
+Suppose $w1$ ranges in $[-10, 10]$, while $w2$ ranges in $[-100, 100]$, the gradient advances by 1 unit each time, then every time in the $w1$ direction Going forward for $1/20$, and on $w2$ is only equivalent to $1/200$! In a sense, the step forward on $ w2 $ is smaller, and $ w1 $ will "walk" faster than $ w2 $ during the search.
+
+This will result in a more bias toward the direction of $ w1 $ during the search. Go out of the "L" shape, or become the "Zigzag" shape.
+
+![](./img/ch3/3-37.png)
+
+### 3.6.5 What types of normalization?
+
+Linear normalization
+
+$$
+X^{\prime} = \frac{x-min(x)}{max(x) - min(x)}
+$$
+
+Scope of application: It is more suitable for the case where the numerical comparison is concentrated.
+
+Disadvantages: If max and min are unstable, it is easy to make the normalization result unstable, which makes the subsequent use effect unstable.
+
+2. Standard deviation standardization
+
+$$
+X^{\prime} = \frac{x-\mu}{\sigma}
+$$
+
+Meaning: The processed data conforms to the standard normal distribution, ie the mean is 0, the standard deviation is 1 where $ \mu $ is the mean of all sample data, and $ \sigma $ is the standard deviation of all sample data.
+
+3. Nonlinear normalization
+
+Scope of application: It is often used in scenes where data differentiation is relatively large. Some values ​​are large and some are small. The original values ​​are mapped by some mathematical function. The method includes $ log $, exponent, tangent, and so on.
+
+### 3.6.6 Local response normalization
+
+LRN is a technical method to improve the accuracy of deep learning. LRN is generally a method after activation and pooling functions.
+
+In ALexNet, the LRN layer is proposed to create a competitive mechanism for the activity of local neurons, which makes the response larger and the value becomes relatively larger, and suppresses other neurons with less feedback, which enhances the generalization ability of the model.
+
+### 3.6.7 Understanding local response normalization
+
+The local response normalization principle is to mimic the inhibition phenomenon (side inhibition) of biologically active neurons on adjacent neurons. The formula is as follows:
+
+$$
+B_{x,y}^i = a_{x,y}^i / (k + \alpha \sum_{j=max(0, in/2)}^{min(N-1, i+n/2 )}(a_{x,y}^j)^2 )^\beta
+$$
+
+among them,
+1) $ a $: indicates the output of the convolutional layer (including the convolution operation and the pooling operation), which is a four-dimensional array [batch, height, width, channel].
+
+- batch: number of batches (one image per batch).
+- height: the height of the image.
+- width: the width of the image.
+- channel: number of channels. Can be understood as a picture of a batch of pictures after the convolution operation and outputThe number of neurons, or the depth of the image after processing.
+
+2) $ a_{x,y}^i $ means a position in the output structure $ [a,b,c,d] $, which can be understood as a certain height under a certain channel in a certain picture. The point at a certain width position, that is, the point under the $d$$ channel of the $a$$ map is the point where the b width is c.
+
+3) $ N $: $ N $ in the paper formula indicates the number of channels.
+
+4) $ a $, $ n/2 $, $ k $ respectively represent input, depth_radius, and bias in the function. The parameters $ k, n, \alpha, \beta $ are all hyperparameters, generally set $ k=2, n=5, \alpha=1*e-4, \beta=0.75 $
+
+5) $ \sum $:$ \sum $ The direction of the overlay is along the channel direction, that is, the sum of the squares of each point value is along the 3rd dimension channel direction in $ a $, which is a Points in the same direction as the front of the $n/2$ channel (minimum for the $0$$ channel) and after the $n/2$ channel (maximum for the $d-1$ channel) Sum of squares (total $ n+1 $ points). The English annotation of the function also shows that input is treated as $ d $ 3 dimensional matrix. To put it plainly, the number of channels of input is regarded as the number of 3D matrix, and the direction of superposition is also in the channel direction.
+
+A simple diagram is as follows:
+
+![](./img/ch3/3.6.7.1.png)
+
+### 3.6.8 What is Batch Normalization?
+
+In the past, in neural network training, only the input layer data was normalized, but it was not normalized in the middle layer. Be aware that although we normalize the input data, after the input data is subjected to matrix multiplication such as $ \sigma(WX+b) $ and nonlinear operations, the data distribution is likely to be changed, and After the multi-layer operation of the deep network, the data distribution will change more and more. If we can also normalize in the middle of the network, is it possible to improve the training of the network? The answer is yes.
+
+This method of normalizing the middle layer of the neural network to make the training effect better is batch normalization (BN).
+
+### 3.6.9 Advantages of Batch Normalization (BN) Algorithm
+
+Let us talk about the advantages of the BN algorithm:
+1. Reduced artificial selection parameters. In some cases, you can cancel the dropout and L2 regular item parameters, or take a smaller L2 regular item constraint parameter;
+2. Reduced the requirement for learning rate. Now we can use the initial large learning rate or choose a smaller learning rate, and the algorithm can also quickly train convergence;
+3. Local response normalization can no longer be used. BN itself is a normalized network (local response normalization exists in the AlexNet network)
+4. Destroy the original data distribution, to some extent alleviate the over-fitting (to prevent one sample in each batch of training from being frequently selected, the literature says this can improve the accuracy by 1%).
+5. Reduce the disappearance of the gradient, speed up the convergence, and improve the training accuracy.
+
+### 3.6.10 Batch normalization (BN) algorithm flow
+
+The process of BN algorithm during training is given below.
+
+Input: The output of the previous layer $ X = {x_1, x_2, ..., x_m} $, learning parameters $ \gamma, \beta $
+
+Algorithm flow:
+
+1. Calculate the mean of the output data of the previous layer
+
+$$
+\mu_{\beta} = \frac{1}{m} \sum_{i=1}^m(x_i)
+$$
+
+Where $ m $ is the size of the training sample batch.
+
+2. Calculate the standard deviation of the output data of the previous layer
+
+$$
+\sigma_{\beta}^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_{\beta})^2
+$$
+
+3. Normalize and get
+
+$$
+\hat x_i = \frac{x_i + \mu_{\beta}}{\sqrt{\sigma_{\beta}^2} + \epsilon}
+$$
+
+Where $ \epsilon $ is a small value close to 0 added to avoid the denominator being 0
+
+4. Refactoring, reconstructing the data obtained by the above normalization process,
+
+$$
+Y_i = \gamma \hat x_i + \beta
+$$
+
+Among them, $ \gamma, \beta $ is a learnable parameter.
+
+Note: The above is the process of BN training, but when it is put into use, it is often just a sample, there is no so-called mean $ \mu_{\beta} $ and standard deviation $ \sigma_{\beta}^2 $. At this point, the mean $ \mu_{\beta} $ is calculated by averaging all the batch $ \mu_{\beta} $ values, and the standard deviation is $ \sigma_{\beta}^2 $ using each batch $ \sigma_{ The unbiased estimate of \beta}^2 $ is obtained.
+
+### 3.6.11 Batch normalization and group normalization comparison
+
+| Name | Features |
+| ------------------------------------------------ | :------------------------------------------------- ---------- |
+| Batch Normalization (BN) | allows various networks to be trained in parallel. However, normalization of batch dimensions can cause problems - inaccurate batch statistics estimates result in batches becoming smaller, and BN errors can increase rapidly. In training large networks and transferring features to computer vision tasks (including detection, segmentation, and video), memory consumption limits the use of small batches of BN. |
+| Group Normalization ** Group Normalization (GN) | GN groups channels into groups and calculates normalized mean and variance within each group. GN calculations are independent of batch size and their accuracy is stable across a wide range of batch sizes. |
+| Comparison | On ResNet-50 trained on ImageNet, the GN uses a batch size of 2 with an error rate that is 10.6% lower than the BN error rate; when using a typical batch, GN is comparable to BN and is superior to other targets. A variant. Moreover, GN can naturally migrate from pre-training to fine-tuning. In the target detection and segmentation in COCO and the video classification competition in Kinetics, GN can outperform its competitors, indicating that GN can effectively replace powerful BN in various tasks. |
+
+### 3.6.12 Comparison of Weight Normalization and Batch Normalization
+
+Both Weight Normalization and Batch Normalization are methods of parameter rewriting, but they are used in different ways.
+
+Weight Normalization is a normalization of the network weight $ W $ , so it is also called Weight Normalization;
+
+Batch Normalization is the normalization of input data to a layer of the network.
+
+Weight Normalization has the following three advantages over Batch Normalization:
+
+1. Weight Normalization accelerates the deep learning network parameter convergence by rewriting the weight of the deep learning network W, without introducing the dependency of minbatch, which is applicable to the RNN (LSTM) network (Batch Normalization cannot be directly used for RNN, normalization operation, reason It is: 1) the sequence processed by RNN is variable length; 2) RNN is calculated based on time step. If you use Batch Normalization directly, you need to save the mean and variance of mini btach under each time step, which is inefficient and takes up memory. .
+
+2. Batch Normalization calculates the mean and variance based on a mini batch's data, rather than based on the entire training set, which is equivalent to introducing a gradient calculation to introduce noise. Therefore, Batch Normalization is not suitable for noise-sensitive reinforcement learning and generation models (Generative model: GAN, VAE). In contrast, Weight Normalization rewrites the weight $W$ by scalar $g$ and vector $v$, and the rewrite vector $v$ is fixed, so Normalization based on Weight Normalization can be seen as less introduction than Batch Normalization The noise.
+
+3. No additional storage space is required to preserve the mean and variance of the mini batch. At the same time, when Weight Normalization is implemented, the additional computational overhead caused by the forward signal propagation and the inverse gradient calculation of the deep learning network is also small. Therefore, it is faster than normalization with Batch Normalization. However, Weight Normalization does not have Batch Normalization to fix the output Y of each layer of the network in a range of variation. Therefore, special attention should be paid to the initial value selection of parameters when using Normal Normalization.
+
+### 3.6.13 When is Batch Normalization suitable?
+
+** (Contributor: Huang Qinjian - South China University of Technology)**
+
+In CNN, BN should act before nonlinear mapping. BN can be tried when the neural network training encounters a slow convergence rate, or a situation such as a gradient explosion that cannot be trained. In addition, in general use, BN can also be added to speed up the training and improve the accuracy of the model.
+
+The applicable scenario of BN is: each mini-batch is relatively large, and the data distribution is relatively close. Before doing the training, you should do a good shuffle, otherwise the effect will be much worse. In addition, since BN needs to count the first-order statistics and second-order statistics of each mini-batch during operation, it is not applicable to dynamic network structures and RNN networks.
+
+
+## 3.7 Pre-training and fine tuning
+
+### 3.7.1 Why can unsupervised pre-training help deep learning?
+There is a problem with the deep network:
+
+1. The deeper the network, the more training samples are needed. If supervision is used, a large number of samples need to be labeled, otherwise small-scale samples are likely to cause over-fitting. There are many characteristics of deep network, and there are many multi-feature problems, such as multi-sample problem, regularization problem and feature selection problem.
+
+2. Multi-layer neural network parameter optimization is a high-order non-convex optimization problem, and often obtains a local solution with poor convergence;
+
+3. Gradient diffusion problem, the gradient calculated by the BP algorithm drops significantly with the depth forward, resulting in little contribution to the previous network parameters and slow update speed.
+
+**Solution:**
+
+Layer-by-layer greedy training, unsupervised pre-training is the first hidden layer of the training network, and then the second one is trained... Finally, these trained network parameter values ​​are used as the initial values ​​of the overall network parameters.
+
+After pre-training, a better local optimal solution can be obtained.
+
+### 3.7.2 What is the model fine tuning fine tuning
+
+Training with other people's parameters, modified network and their own data, so that the parameters adapt to their own data, such a process, usually called fine tuning.
+
+** Example of fine-tuning the model: **
+
+We know that CNN has made great progress in the field of image recognition. If you want to apply CNN to our own dataset, you will usually face a problem: usually our dataset will not be particularly large, generally no more than 10,000, or even less, each type of image is only a few Ten or ten. At this time, the idea of ​​directly applying these data to train a network is not feasible, because a key factor in the success of deep learning is the training set consisting of a large number of tagged data. If we only use this data on hand, even if we use a very good network structure, we can't achieve high performance. At this time, the idea of ​​fine-tuning can solve our problem well: we pass the model trained on ImageNet (such as Caf).feNet, VGGNet, ResNet) Fine-tune and apply to our own dataset.
+
+### 3.7.3 Is the network parameter updated when fine tuning?
+
+Answer: Will be updated.
+
+1. The process of finetune is equivalent to continuing training. The difference from direct training is when initialization.
+2. Direct training is initiated in the manner specified by the network definition.
+3. Finetune is initialized with the parameter file you already have.
+
+### 3.7.4 Three states of the fine-tuning model
+
+1. State 1: Only predict, not training.
+Features: Relatively fast and simple, it is very efficient for projects that have been trained and now actually label unknown data;
+
+2. State 2: Training, but only train the final classification layer.
+Features: The final classification of the fine-tuning model and the requirements are met, and now only on their basis for category dimension reduction.
+
+3. State 3: Full training, classification layer + previous convolution layer training
+Features: The difference with state two is very small. Of course, state three is time-consuming and requires training of GPU resources, but it is very suitable for fine-tuning into the model that you want. The prediction accuracy is also improved compared with state two.
+
+## 3.8 Weight Deviation Initialization
+
+### 3.8.1 All initialized to 0
+
+** Deviation Initialization Trap**: Both are initialized to 0.
+
+**The reason for the trap**: Because we don't know the last value of each weight in the training neural network, but if we do the proper data normalization, we can reasonably think that half of the weight is positive, the other half It is negative. The weight of ownership is initialized to 0. If the output value calculated by the neural network is the same, the gradient value calculated by the neural network in the back propagation algorithm is the same, and the parameter update value is also the same. More generally, if the weights are initialized to the same value, the network is symmetric.
+
+**Visualized understanding**: When considering the gradient drop in the neural network, imagine that you are climbing, but in a linear valley, the two sides are symmetrical peaks. Because of the symmetry, the gradient you are in can only follow the direction of the valley, not to the mountain; after you take a step, the situation remains the same. The result is that you can only converge to a maximum in the valley and not reach the mountain.
+
+### 3.8.2 All initialized to the same value
+
+Deviation Initialization Trap: Both are initialized to the same value.
+Take a three-layer network as an example:
+First look at the structure
+
+![](./img/ch3/3.8.2.1.png)
+
+Its expression is:
+
+$$
+A_1^{(2)} = f(W_{11}^{(1)} x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{( 1)})
+$$
+
+$$
+A_2^{(2)} = f(W_{21}^{(1)} x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{( 1)})
+$$
+
+$$
+A_3^{(2)} = f(W_{31}^{(1)} x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{( 1)})
+$$
+
+$$
+H_{W,b}(x) = a_1^{(3)} = f(W_{11}^{(2)} a_1^{(2)} + W_{12}^{(2)} a_2^ {(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)})
+$$
+
+$$
+Xa_1^{(2)} = f(W_{11}^{(1)} x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{( 1)})a_2^{(2)} = f(W_{21}^{(1)} x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 +
+$$
+
+If each weight is the same, then in a multi-layer network, starting from the second layer, the input values ​​of each layer are the same, that is, $ a1=a2=a3=.... $, since they are all the same, It is equivalent to an input, why? ?
+
+If it is a reverse transfer algorithm (see the above connection if you don't understand it here), the iterative partial derivative of the bias term and the weight term is calculated as follows
+
+$$
+\frac{\partial}{\partial W_{ij}^{(l)}} J(W,b;x,y) = a_j^{(l)} \delta_i^{(l+1)}
+
+\frac{\partial}{\partial b_{i}^{(l)}} J(W,b;x,y) = \delta_i^{(l+1)}
+$$
+
+Calculation formula for $ \delta $
+
+$$
+\delta_i^{(l)} = (\sum_{j=1}^{s_{t+1}} W_{ji}^{(l)} \delta_j^{(l+1)} ) f^{ \prime}(z_i^{(l)})
+$$
+
+
+If you are using the sigmoid function
+
+$$
+F^{\prime}(z_i^{(l)}) = a_i^{(l)}(1-a_i^{(l)})
+$$
+
+Substituting the latter two formulas, we can see that the obtained gradient descent method has the same partial derivatives, non-stop iterations, non-stopping the same, non-stop iterations, non-stop the same... and finally got The same value (weight and intercept).
+
+### 3.8.3 Initializing to a small random number
+
+Initializing the weights to very small numbers is a common solution to breaking network symmetry. The idea is that neurons are random and unique at first, so they calculate different updates and integrate themselves into the various parts of the network. An implementation of a weight matrix might look like $ W=0.01∗np.random.randn(D,H) $, where randn is sampled from a unit standard Gaussian distribution with a mean of 0. Through this formula (function), the weight vector of each neuron is initialized to a random vector sampled from the multidimensional Gaussian distribution, so the neuron points in the input direction in the input direction. It should mean that the input space has an effect on the random direction). In fact, it is also possible to randomly select decimals from a uniform distribution, but in practice it does not seem to have much effect on the final performance.
+
+Note: Warning: Not the smaller the number, the better it will perform. For example, if the weight of a neural network layer is very small, then the backpropagation algorithm will calculate a small gradient (because the gradient gradient is proportional to the weight). The "gradient signal" will be greatly reduced during the continuous back propagation of the network, and may become a problem that needs to be paid attention to in the deep network.
+
+### 3.8.4 Calibrating the variance with $ 1/\sqrt n $
+
+One problem with the above suggestions is that the distribution of the output of the randomly initialized neuron has a variance that varies as the input increases. It turns out that we can normalize the variance of the output of each neuron to 1 by scaling its weight vector by the square root of its input (ie the number of inputs). That is to say, the recommended heuristic is to initialize the weight vector of each neuron as follows: $ w=np.random.randn(n)/\sqrt n $, where n is the number of inputs . This ensures that the initial output distribution of all neurons in the network is roughly the same and empirically increases the rate of convergence.
+
+### 3.8.5 Sparse Initialization (Sparse Initialazation)
+
+Another way to solve the uncalibrated variance problem is to set all weight matrices to zero, but to break the symmetry, each neuron is randomly connected (extracted from a small Gaussian distribution as described above) Weight) to a fixed number of neurons below it. The number of a typical neuron connection can be as small as ten.
+
+### 3.8.6 Initialization deviation
+
+It is possible to initialize the deviation to zero, which is also very common, because asymmetry damage is caused by small random numbers of weights. Because ReLU has non-linear characteristics, some people like to use to set all deviations to small constant values ​​such as 0.01, because this ensures that all ReLU units activate the fire at the very beginning and therefore can acquire and propagate some Gradient value. However, it is not clear whether this will provide continuous improvement (in fact some results indicate that doing so makes performance worse), so it is more common to simply initialize the deviation to 0.
+
+## 3.9 Softmax
+
+### 3.9.1 Softmax Definition and Function
+
+Softmax is a function of the form:
+
+$$
+P(i) = \frac{exp(\theta_i^T x)}{\sum_{k=1}^{K} exp(\theta_i^T x)}
+$$
+
+Where $ \theta_i $ and $ x $ are the column vectors, $ \theta_i^T x $ may be replaced by the function $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
+
+With the softmax function, you can make $ P(i) $ range between $ [0,1] $. In regression and classification problems, usually $ \theta $ is a parameter to be sought by looking for $ \(ta) $ max $ $theta_i $ as the best parameter.
+
+However, there are many ways to make the range between $ [0,1] $, in order to add the power function of $ e $ in front of it? Reference logistic function:
+
+$$
+P(i) = \frac{1}{1+exp(-\theta_i^T x)}
+$$
+
+The function of this function is to make $ P(i) $ tend to 0 in the interval from negative infinity to 0, and to 1 in the interval from 0 to positive infinity. Similarly, the softmax function adds the power function of $ e $ for the purpose of polarization: the result of the positive sample will approach 1 and the result of the negative sample will approach 0. This makes it convenient for multiple categories (you can think of $ P(i) $ as the probability that a sample belongs to a category). It can be said that the Softmax function is a generalization of the logistic function.
+
+The softmax function can process its input, commonly referred to as logits or logit scores, between 0 and 1, and normalize the output to a sum of 1. This means that the softmax function is equivalent to the probability distribution of the classification. It is the best output activation function for network prediction of polyphenols.
+
+### 3.9.2 Softmax Derivation
+
+## 3.10 Understanding the principle and function of One Hot Encodeing?
+
+The origin of the problem
+
+In many **machine learning** tasks, features are not always continuous values, but may be categorical values.
+
+For example, consider the three characteristics:
+
+```
+["male", "female"] ["from Europe", "from US", "from Asia"]
+["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]
+```
+
+If the above features are represented by numbers, the efficiency will be much higher. E.g:
+
+```
+["male", "from US", "uses Internet Explorer"] is expressed as [0, 1, 3]
+["female", "from Asia", "uses Chrome"] means [1, 2, 1]
+```
+
+However, even after converting to a digital representation, the above data cannot be directly used in our classifier. Because the classifier tends to default data data is continuous (can calculate distance?), and is ordered (and the above 0 is not said to be higher than 1). However, according to our above representation, the numbers are not ordered, but are randomly assigned.
+
+**Individual heating code**
+
+In order to solve the above problem, one possible solution is to use One-Hot Encoding. One-Hot encoding, also known as one-bit efficient encoding, uses an N-bit status register to encode N states, each state being independent of its register bits, and at any time, only One is valid.
+
+E.g:
+
+```
+The natural status code is: 000,001,010,011,100,101
+The unique heat code is: 000001,000010,000100, 001000, 010000, 100000
+```
+
+It can be understood that for each feature, if it has m possible values, it will become m binary features after being uniquely encoded (such as the score, the feature is good, the middle, the difference becomes one-hot 100, 010, 001). Also, these features are mutually exclusive, with only one activation at a time. Therefore, the data becomes sparse.
+
+The main benefits of this are:
+
+1. Solved the problem that the classifier is not good at processing attribute data;
+2. To a certain extent, it also plays a role in expanding features.
+
+## 3.11 What are the commonly used optimizers?
+
+List separately
+
+```
+Optimizer:
+tf.train.GradientDescentOptimizer
+tf.train.AdadeltaOptimizer
+tf.train.AdagradOptimizer
+Tf.train.AdagradDAOptimizer
+tf.train.MomentumOptimizer
+tf.train.AdamOptimizer
+tf.train.FtrlOptimizer
+tf.train.ProximalGradientDescentOptimizer
+tf.train.ProximalAdagradOptimizer
+tf.train.RMSPropOptimizer
+```
+
+## 3.12 Dropout series questions
+
+### 3.12.1 Why do you want to regularize?
+1. Deep learning may have over-fitting problems - high variance, there are two solutions, one is regularization, the other is to prepare more data, this is a very reliable method, but you may not be able to prepare at all times The cost of training enough data or getting more data is high, but regularization often helps to avoid overfitting or reducing your network error.
+2. If you suspect that the neural network over-fitting the data, that is, there is a high variance problem, the first method that comes to mind may be regularization. Another way to solve the high variance is to prepare more data, which is also a very reliable method. But you may not be able to prepare enough training data from time to time, or the cost of getting more data is high, but regularization helps to avoid overfitting or reduce network errors.
+
+### 3.12.2 Why is regularization helpful in preventing overfitting?
+
+![](./img/ch3/3.12.2.1.png)
+![](./img/ch3/3.12.2.2.png)
+
+The left picture is high deviation, the right picture is high variance, and the middle is Just Right, which we saw in the previous lesson.
+
+### 3.12.3 Understanding dropout regularization
+Dropout can randomly remove neural units from the network. Why can it play such a big role through regularization?
+
+Intuitively understand: don't rely on any feature, because the input of the unit may be cleared at any time, so the unit propagates in this way and adds a little weight to the four inputs of the unit. By spreading the weight of ownership, dropout will generate The effect of the squared norm of the contraction weight is similar to the L2 regularization mentioned earlier; the result of implementing the dropout will compress the weight and complete some outer regularization to prevent overfitting; L2 has different attenuation for different weights. It depends on the size of the activation function multiplication.
+
+### 3.12.4 Choice of dropout rate
+
+1. After cross-validation, the implicit node dropout rate is equal to 0.5, which is the best, because the dropout randomly generates the most network structure at 0.5.
+2. dropout can also be used as a way to add noise, directly on the input. The input layer is set to a number closer to 1. Make the input change not too big (0.8)
+3. The max-normalization of the training of the parameter $ w $ is very useful for the training of dropout.
+4. Spherical radius $ c $ is a parameter that needs to be adjusted, and the validation set can be used for parameter tuning.
+5. dropout itself is very good, but dropout, max-normalization, large decaying learning rates and high momentum are better combined. For example, max-norm regularization can prevent the parameter blow up caused by large learning rate.
+6. Use the pretraining method to also help dropout training parameters. When using dropout, multiply all parameters by $ 1/p $.
+
+### 3.12.5 What are the disadvantages of dropout?
+
+One of the major drawbacks of dropout is that the cost function J is no longer explicitly defined. Each iteration will randomly remove some nodes. If you repeatedly check the performance of the gradient descent, it is actually difficult to review. The well-defined cost function J will drop after each iteration, because the cost function J we optimized is not really defined, or it is difficult to calculate to some extent, so we lost the debugging tool to draw such a picture. . I usually turn off the dropout function, set the keep-prob value to 1, and run the code to make sure that the J function is monotonically decreasing. Then open the dropout function, I hope that the code does not introduce bugs during the dropout process. I think you can try other methods as well, although we don't have statistics on the performance of these methods, but you can use them with the dropout method.
+
+## 3.13 Data Augmentation commonly used in deep learning?
+
+**(Contributor: Huang Qinjian - South China University of Technology)**
+
+- Color Jittering: Data enhancement for color: image brightness, saturation, contrast change (here the understanding of color jitter is not known);
+
+- PCA Jittering: First calculate the mean and standard deviation according to the RGB three color channels, then calculate the covariance matrix on the entire training set, perform feature decomposition, and obtain the feature vector and eigenvalue for PCA Jittering;
+
+- Random Scale: scale transformation;
+
+- Random Crop: crops and scales the image using random image difference methods; including Scale Jittering method (used by VGG and ResNet models) or scale and aspect ratio enhancement transform;
+
+- Horizontal/Vertical Flip: horizontal/vertical flip;
+
+- Shift: translation transformation;
+
+- Rotation/Reflection: rotation/affine transformation;
+
+- Noise: Gaussian noise, fuzzy processing;
+
+- Label Shuffle: an augmentation of category imbalance data;
+
+## 3.14 How to understand Internal Covariate Shift?
+
+**(Contributor: Huang Qinjian - South China University of Technology)**
+
+Why is the training of deep neural network models difficult? One of the important reasons is that the deep neural network involves the superposition of many layers, and the parameter update of each layer will cause the distribution of the input data of the upper layer to change. By layer-by-layer superposition, the input distribution of the upper layer will change very sharply. This makes it necessary for the upper layers to constantly adapt to the underlying parameter updates. In order to train the model, we need to be very careful to set the learning rate, initialization weights, and the most detailed parameter update strategy.
+
+Google summarizes this phenomenon as Internal Covariate Shift, referred to as ICS. What is ICS?
+
+Everyone knows that a classic assumption in statistical machine learning is that "the data distribution of the source domain and the target domain is consistent." If they are inconsistent, then new machine learning problems arise, such as transfer learning / domain adaptation. Covariate shift is a branch problem under the assumption of inconsistent distribution. It means that the conditional probability of source space and target space is consistent, but its edge probability is different.
+
+Everyone will find out, indeed, for the output of each layer of the neural network, because they have undergone intra-layer operation, the distribution is obviously different from the corresponding input signal distribution of each layer, and the difference will increase as the network depth increases. Large, but the sample labels they can "instruct" are still invariant, which is consistent with the definition of covariate shift. Because it is the analysis of the signal between layers, it is the reason of "internal".
+
+**What problems does ICS cause? **
+
+In short, the input data for each neuron is no longer "independently distributed."
+
+First, the upper parameters need to constantly adapt to the new input data distribution and reduce the learning speed.
+
+Second, the change in the input of the lower layer may tend to become larger or smaller, causing the upper layer to fall into the saturation region, so that the learning stops prematurely.
+
+Third, the update of each layer will affect other layers, so the parameter update strategy of each layer needs to be as cautious as possible.
+
+## 3.15 When do you use local-conv? When do you use full convolution?
+
+**(Contributor: Liang Zhicheng - Meizu Technology)**
+
+1. When the data set has a global local feature distribution, that is to say, there is a strong correlation between the local features, suitable for full convolution.
+
+2. When there are different feature distributions in different areas, it is suitable to use local-Conv.
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-1.png
new file mode 100644
index 0000000..5b065bc
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-10.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-10.jpg
new file mode 100644
index 0000000..526dfa3
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-10.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-11.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-11.jpg
new file mode 100644
index 0000000..d671beb
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-11.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-12.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-12.png
new file mode 100644
index 0000000..af6b11a
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-12.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-13.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-13.png
new file mode 100644
index 0000000..6d8799c
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-13.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-14.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-14.png
new file mode 100644
index 0000000..124af84
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-14.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-15.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-15.png
new file mode 100644
index 0000000..a32940e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-15.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-16.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-16.png
new file mode 100644
index 0000000..856802f
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-16.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-17.gif b/English version/ch03_DeepLearningFoundation/img/ch3/3-17.gif
new file mode 100644
index 0000000..ba665c6
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-17.gif differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-18.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-18.png
new file mode 100644
index 0000000..c17ecad
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-18.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-19.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-19.png
new file mode 100644
index 0000000..cd20e85
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-19.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-2.png
new file mode 100644
index 0000000..3dbddcb
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-20.gif b/English version/ch03_DeepLearningFoundation/img/ch3/3-20.gif
new file mode 100644
index 0000000..308e93d
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-20.gif differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-21.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-21.png
new file mode 100644
index 0000000..c4403e7
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-21.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-22.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-22.png
new file mode 100644
index 0000000..05be668
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-22.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-23.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-23.png
new file mode 100644
index 0000000..2e4c9b2
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-23.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-24.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-24.png
new file mode 100644
index 0000000..fbec6c0
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-24.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-25.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-25.png
new file mode 100644
index 0000000..cd2f64c
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-25.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-26.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-26.png
new file mode 100644
index 0000000..9d67734
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-26.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-27.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-27.png
new file mode 100644
index 0000000..9678e96
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-27.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-28.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-28.png
new file mode 100644
index 0000000..ea2a6f9
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-28.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-29.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-29.png
new file mode 100644
index 0000000..1d589b6
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-29.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-3.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-3.png
new file mode 100644
index 0000000..c7178b5
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-3.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-30.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-30.png
new file mode 100644
index 0000000..0296411
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-30.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-31.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-31.png
new file mode 100644
index 0000000..4c757ad
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-31.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-32.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-32.png
new file mode 100644
index 0000000..ea2a6f9
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-32.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-33.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-33.png
new file mode 100644
index 0000000..9798bf8
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-33.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-34.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-34.png
new file mode 100644
index 0000000..71edbcc
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-34.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-35.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-35.jpg
new file mode 100644
index 0000000..5bb87c9
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-35.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-36.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-36.png
new file mode 100644
index 0000000..5b90a4e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-36.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-37.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-37.png
new file mode 100644
index 0000000..087ae7b
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-37.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-38.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-38.png
new file mode 100644
index 0000000..731c159
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-38.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-39.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-39.jpg
new file mode 100644
index 0000000..50fe309
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-39.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-4.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-4.png
new file mode 100644
index 0000000..a6f322c
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-4.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-40.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-40.png
new file mode 100644
index 0000000..2303937
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-40.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-41.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-41.png
new file mode 100644
index 0000000..a16e265
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-41.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-5.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-5.png
new file mode 100644
index 0000000..d094c2c
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-5.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-6.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-6.png
new file mode 100644
index 0000000..10ac6d4
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-6.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-7.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-7.jpg
new file mode 100644
index 0000000..3cdfc2d
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-7.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-8.png b/English version/ch03_DeepLearningFoundation/img/ch3/3-8.png
new file mode 100644
index 0000000..dd13a29
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-8.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3-9.jpg b/English version/ch03_DeepLearningFoundation/img/ch3/3-9.jpg
new file mode 100644
index 0000000..49d4e00
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3-9.jpg differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.5.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.5.png
new file mode 100644
index 0000000..dde211c
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.5.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.6.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.6.png
new file mode 100644
index 0000000..1bd7697
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.1.6.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.1.6.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.6.1.png
new file mode 100644
index 0000000..e0ce77d
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.1.6.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.1.png
new file mode 100644
index 0000000..34f2d8b
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.2.png
new file mode 100644
index 0000000..ad3d4bd
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.12.2.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.1.png
new file mode 100644
index 0000000..160cf46
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.2.png
new file mode 100644
index 0000000..fec7b5b
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.1.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.2.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.2.1.png
new file mode 100644
index 0000000..02dbca0
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.2.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.1.png
new file mode 100644
index 0000000..1c4dacd
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.2.png
new file mode 100644
index 0000000..09f4d9e
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.4.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.4.png
new file mode 100644
index 0000000..03f1c6a
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.4.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.5.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.5.png
new file mode 100644
index 0000000..56443fd
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.5.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.6.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.6.png
new file mode 100644
index 0000000..bda9c5c
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.3.6.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.4.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.4.1.png
new file mode 100644
index 0000000..f91e8ab
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.4.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.1.png
new file mode 100644
index 0000000..1236163
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.2.png
new file mode 100644
index 0000000..c36f3bf
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.3.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.3.png
new file mode 100644
index 0000000..d53c8ea
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.3.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.4.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.4.png
new file mode 100644
index 0000000..f00cca4
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.2.5.4.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.1.png
new file mode 100644
index 0000000..3e126ba
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.2.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.2.png
new file mode 100644
index 0000000..ea9b0c3
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.2.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.3.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.3.png
new file mode 100644
index 0000000..fe1d777
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.4.9.3.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.6.3.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.6.3.1.png
new file mode 100644
index 0000000..d091389
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.6.3.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.6.7.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.6.7.1.png
new file mode 100644
index 0000000..4b98c00
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.6.7.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/img/ch3/3.8.2.1.png b/English version/ch03_DeepLearningFoundation/img/ch3/3.8.2.1.png
new file mode 100644
index 0000000..1236163
Binary files /dev/null and b/English version/ch03_DeepLearningFoundation/img/ch3/3.8.2.1.png differ
diff --git a/English version/ch03_DeepLearningFoundation/readme.md b/English version/ch03_DeepLearningFoundation/readme.md
new file mode 100644
index 0000000..26ad0b9
--- /dev/null
+++ b/English version/ch03_DeepLearningFoundation/readme.md	
@@ -0,0 +1,14 @@
+###########################################################
+
+\### Deep Learning 500 Questions - First * Chapter xxx
+
+**Responsible person (in no particular order): **
+Xxx graduate student-xxx(xxx)
+Xxx doctoral student-xxx
+Xxx-xxx
+
+
+**Contributors (in no particular order): **
+Content contributors can add information
+
+###########################################################
\ No newline at end of file
diff --git a/English version/ch04_ClassicNetwork/ChapterIV_ClassicNetwork.md b/English version/ch04_ClassicNetwork/ChapterIV_ClassicNetwork.md
new file mode 100644
index 0000000..f5c39d4
--- /dev/null
+++ b/English version/ch04_ClassicNetwork/ChapterIV_ClassicNetwork.md	
@@ -0,0 +1,293 @@
+[TOC]
+
+# Chapter 4 Classic Network
+## 4.1 LeNet-5
+
+### 4.1.1 Introduction to the model
+
+LeNet-5 is a Convolutional Neural Network (CNN) $^{[1]}$ proposed by $LeCun$ for recognizing handwritten digits and machine-printed characters. The name is derived from the author $LeCun$ The name, 5 is the code name for its research, and LeNet-4 and LeNet-1 were little known before LeNet-5. LeNet-5 illustrates that the correlation between pixel features in an image can be extracted by a convolution operation shared by parameters, and a combination of convolution, downsampling (pooling), and nonlinear mapping is currently popular. The basis of most depth image recognition networks.
+
+### 4.1.2 Model structure
+
+![](./img/ch4/LeNet-5.jpg)
+
+Figure 4.1 LeNet-5 network structure
+
+As shown in Figure 4.1, LeNet-5 consists of 7 layers (the input layer is not used as the network structure), which consists of 2 convolution layers, 2 downsampling layers and 3 connection layers. The parameter configuration of the network is shown in Table 4.1. The kernel size of the downsampling layer and the fully connected layer respectively represents the sampling range and the size of the connection matrix (eg, "5\times5\times1/1,6" in the convolution kernel size indicates that the kernel size is $5\times5 \times1$, convolution kernel with a step size of $1$ and a core number of 6.)
+
+Table 4.1 LeNet-5 Network Parameter Configuration
+
+| Network Layer | Input Size | Core Size | Output Size | Trainable Parameter Quantity |
+| :-------------: | :------------------: | :----------- -----------: | :------------------: | :--------------- --------------: |
+| Convolutional Layer $C_1$ | $32\times32\times1$ | $5\times5\times1/1,6$ | $28\times28\times6$ | $(5\times5\times1+1)\times6$ |
+| Downsampling layer $S_2$ | $28\times28\times6$ | $2\times2/2$ | $14\times14\times6$ | $(1+1)\times6$ $^*$ |
+| Convolutional Layer $C_3$ | $14\times14\times6$ | $5\times5\times6/1,16$ | $10\times10\times16$ | $1516^*$ |
+| Downsampling layer $S_4$ | $10\times10\times16$ | $2\times2/2$ | $5\times5\times16$ | $(1+1)\times16$ |
+| Convolutional Layer $C_5$$^*$ | $5\times5\times16$ | $5\times5\times16/1,120$ | $1\times1\times120$ | $(5\times5\times16+1)\times120$ |
+| Full Connect Layer $F_6$ | $1\times1\times120$ | $120\times84$ | $1\times1\times84$ | $(120+1)\times84$ |
+| Output Layer | $1\times1\times84$ | $84\times10$ | $1\times1\times10$ | $(84+1)\times10$ |
+
+> $^*$ In LeNet, the downsampling operation is similar to the pooling operation, but after multiplying the sampled result by a coefficient and adding an offset term, the number of parameters sampled below is $(1+1) )\times6$ instead of zero.
+>
+> $^*$ $C_3$ Convolutional layer training parameters are not directly connected to all feature maps in $S_2$, but are connected using the sampling feature as shown in Figure 4.2 (sparse connection). The generated 16 channel feature maps are mapped according to the adjacent three feature maps, the adjacent four feature maps, the non-adjacent four feature maps, and all six feature maps. The calculated number of parameters is calculated as $6\ Times(25\times3+1)+6\times(25\times4+1)+3\times(25\times4+1)+1\times(25\times6+1)=1516$, explained in the original paper There are two reasons for using this sampling method: the number of connections is not too large (the computing power of the current year is weak); forcing the combination of different feature maps can make the mapped feature maps learn different feature patterns.
+
+![FeatureMap](./img/ch4/featureMap.jpg)
+
+Figure 4.2 Sparse connection between feature maps between $S_2$ and $C_3$
+
+> $^*$ $C_5$ The convolutional layer is shown in Figure 4.1 as a fully connected layer. The original paper explains that the convolution operation is actually used here, but the size is compressed to $1 just after the $5\times5$ convolution. Times1$, the output looks very similar to a full connection.
+
+### 4.1.3 Model Features
+- The convolutional network uses a three-layer sequence combination: convolution, downsampling (pooling), and non-linear mapping (the most important feature of LeNet-5, which forms the basis of the current deep convolutional network)
+- Extract spatial features using convolution
+- Downsampling using the mapped spatial mean
+- Non-linear mapping using $tanh$ or $sigmoid$
+- Multilayer Neural Network (MLP) as the final classifier
+- Sparse connection matrix between layers to avoid huge computational overhead
+
+## 4.2 AlexNet
+
+### 4.2.1 Introduction to the model
+
+AlexNet is the first deep convolutional neural network applied to image classification by $Alex$ $Krizhevsky $, which was 15.3% top-5 in the 2012 ILSVRC (ImageNet Large Scale Visual Recognition Competition) image classification competition. The test error rate won the first place $^{[2]}$. AlexNet uses GPU instead of CPU to make the model structure more complex in an acceptable time range. Its appearance proves the effectiveness of deep convolutional neural networks in complex models, making CNN popular in computer vision. Directly or indirectly triggered a wave of deep learning.
+
+### 4.2.2 Model structure
+
+![](./img/ch4/alexnet.png)
+
+Figure 4.3 AlexNet network structure
+
+As shown in Figure 4.3, except for downsampling (pooling layer) and Local Responsible Normalization (LRN), AlexNet consists of 8 layers. The first 5 layers are composed of convolution layers, while the remaining 3 layers are all. Connection layer. The network structure is divided into upper and lower layers, corresponding to the operation process of the two GPUs, except for some layers in the middle ($C_3$ convolutional layer and $F_{6-8}$ full connection layer will have GPU interaction), other Layer two GPUs calculate the results separately. The output of the last layer of the fully connected layer is input as $softmax$, and the probability values ​​corresponding to 1000 image classification labels are obtained. Excluding the design of the GPU parallel structure, the AlexNet network structure is very similar to that of LeNet. The parameter configuration of the network is shown in Table 4.2.
+
+Table 4.2 AlexNet Network Parameter Configuration
+
+| Network Layer | Input Size | Core Size | Output Size | Trainable Parameter Quantity |
+| :-------------------: | :-------------------------- --------: | :--------------------------------------: | :----------------------------------: | :----------- --------------------------: |
+| Convolutional Layer $C_1$ $^*$ | $224\times224\times3$ | $11\times11\times3/4,48(\times2_{GPU})$ | $55\times55\times48(\times2_{GPU})$ | $(11\times11\times3+1)\times48\times2$ |
+| Downsampling layer $S_{max}$$^*$ | $55\times55\times48(\times2_{GPU})$ | $3\times3/2(\times2_{GPU})$ | $27\times27\times48(\ Times2_{GPU})$ | 0 |
+| Convolutional Layer $C_2$ | $27\times27\times48(\times2_{GPU})$ | $5\times5\times48/1,128(\times2_{GPU})$ | $27\times27\times128(\times2_{GPU}) $ | $(5\times5\times48+1)\times128\times2$ |
+| Downsampling layer $S_{max}$ | $27\times27\times128(\times2_{GPU})$ | $3\times3/2(\times2_{GPU})$ | $13\times13\times128(\times2_{GPU} )$ | 0 |
+| Convolutional layer $C_3$ $^*$ | $13\times13\times128\times2_{GPU}$ | $3\times3\times256/1,192(\times2_{GPU})$ | $13\times13\times192(\times2_{GPU })$ | $(3\times3\times256+1)\times192\times2$ |
+| Convolutional layer $C_4$ | $13\times13\times192(\times2_{GPU})$ | $3\times3\times192/1,192(\times2_{GPU})$ | $13\times13\times192(\times2_{GPU}) $ | $(3\times3\times192+1)\times192\times2$ |
+| Convolutional layer $C_5$ | $13\times13\times192(\times2_{GPU})$ | $3\times3\times192/1,128(\times2_{GPU})$ | $13\times13\times128(\times2_{GPU}) $ | $(3\times3\times192+1)\timEs128\times2$ |
+| Downsampling layer $S_{max}$ | $13\times13\times128(\times2_{GPU})$ | $3\times3/2(\times2_{GPU})$ | $6\times6\times128(\times2_{GPU} )$ | 0 |
+| Fully connected layer $F_6$ $^*$ | $6\times6\times128\times2_{GPU}$ | $9216\times2048(\times2_{GPU})$ | $1\times1\times2048(\times2_{GPU})$ | $(9216+1)\times2048\times2$ |
+| Fully connected layer $F_7$ | $1\times1\times2048\times2_{GPU}$ | $4096\times2048(\times2_{GPU})$ | $1\times1\times2048(\times2_{GPU})$ | $(4096+ 1)\times2048\times2$ |
+| Full Connect Layer $F_8$ | $1\times1\times2048\times2_{GPU}$ | $4096\times1000$ | $1\times1\times1000$ | $(4096+1)\times1000\times2$ |
+
+>Convolution layer $C_1$ input image data of $224\times224\times3$, respectively, after convolution convolution of $11\times11\times3$ and stride of 4 in two GPUs. Get two separate output data for $55\times55\times48$.
+>
+The downsampling layer $S_{max}$ is actually the largest pooling operation nested in the convolution, but is separately listed to distinguish the convolutional layers that do not have the largest pooling. After the pooling operation in the $C_{1-2}$ convolutional layer (before the ReLU activation operation), there is also an LRN operation for normalization of adjacent feature points.
+>
+The input of the convolutional layer $C_3$ is different from other convolutional layers. $13\times13\times192\times2_{GPU}$ means that the output of the upper layer network on the two GPUs is collected as input, so the convolution is performed. The convolution kernel dimension on the channel during operation is 384.
+>
+The input data size of the fully connected layer $F_{6-8}$ is similar to $C_3$, which is an input that combines the output of two GPU flows.
+
+
+### 4.2.3 Model Features
+- All convolutional layers use ReLU as a nonlinear mapping function to make the model converge faster
+- Training of models on multiple GPUs not only improves the training speed of the model, but also increases the scale of data usage.
+- Normalize the local features using LRN, and as a result the input to the ReLU activation function can effectively reduce the error rate
+- Overlapping max pooling, ie the pooling range z has a relationship with the step size s $z>s$ (eg $S\{max}$ in the kernel scale is $3\times3/2$), avoiding average pooling Average effect of (average pooling)
+- Use random dropout to selectively ignore individual neurons in training to avoid overfitting of the model
+
+
+
+## 4.3 ZFNet
+### 4.3.1 Introduction to the model
+
+ZFNet is a large convolutional network based on AlexNet from $Matthew$ $D. Zeiler$ and $Rob$ $Fergus$. In the 2013 ILSVRC image classification competition, it won the championship with an error rate of 11.19% (actually the original ZFNet The team is not a real champion. The original ZFNet ranked 8th with a 13.51% error rate. The real champion is the $Clarifai$ team, and the CEO of a startup company corresponding to $Clarifai$ is $Zeiler$. And $Clarifai$ has a relatively small change to ZFNet, so it is generally considered that ZFNet won the championship) $^{[3-4]}$. ZFNet is actually fine-tuning AlexNet, and visualizes the output features of each layer by means of deconvolution, further explaining why convolution operations are significant in large networks.
+
+### 4.3.2 Model structure
+
+![](./img/ch4/image21.jpeg)
+
+Figure 4.4 ZFNet network structure diagram (original structure diagram and AlexNet style structure diagram)
+
+As shown in Figure 4.4, ZFNet is similar to AlexNet. It is a convolutional neural network consisting of 8 layers of networks, including 5 layers of convolutional layers and 3 layers of fully connected layers. The biggest difference between the two network architectures is that the ZFNet first-layer convolution replaces the convolution of the first-order convolution kernel $11\times11\times3/4$ in AlexNet with a convolution kernel of $7\times7\times3/2$. nuclear. In Figure 4.5, ZFNet contains more intermediate frequency information in the feature map of the first layer output than AlexNet, while the characteristic map of the first layer output of AlexNet is mostly low frequency or high frequency information, and the lack of intermediate frequency features leads to The characteristics of the subsequent network level as shown in Figure 4.5(c) are not detailed enough, and the root cause of this problem is that the convolution kernel and step size adopted by AlexNet in the first layer are too large.
+
+![](./img/ch4/zfnet-layer1.png)
+
+![](./img/ch4/zfnet-layer2.png)
+
+Figure 4.5 (a) Characteristic map of the first layer output of ZFNet (b) Characteristic map of the first layer output of AlexNet (c) Characteristic map of the output of the second layer of AlexNet (d) Characteristic map of the output of the second layer of ZFNet
+
+Table 4.3 ZFNet Network Parameter Configuration
+| Network Layer | Input Size | Core Size | Output Size | Trainable Parameter Quantity |
+| :-------------------: | :-------------------------- --------: | :--------------------------------------: | :----------------------------------: | :----------- --------------------------: |
+| Convolutional Layer $C_1$ $^*$ | $224\times224\times3$ | $7\times7\times3/2,96$ | $110\times110\times96$ | $(7\times7\times3+1)\times96$ |
+| Downsampling layer $S_{max}$ | $110\times110\times96$ | $3\times3/2$ | $55\times55\times96$ | 0 |
+| Convolutional Layer $C_2$ $^*$ | $55\times55\times96$ | $5\times5\times96/2,256$ | $26\times26\times256$ | $(5\times5\times96+1)\times256$ |
+| Downsampling layer $S_{max}$ | $26\times26\times256$ | $3\times3/2$ | $13\times13\times256$ | 0 |
+| Convolutional Layer $C_3$ | $13\times13\times256$ | $3\times3\times256/1,384$ | $13\times13\times384$ | $(3\times3\times256+1)\times384$ |
+| Convolutional layer $C_4$ | $13\times13\times384$ | $3\times3\times384/1,384$ | $13\times13\times384$ | $(3\times3\times384+1)\times384$ |
+| Convolutional layer $C_5$ | $13\times13\times384$ | $3\times3\times384/1,256$ | $13\times13\times256$ | $(3\times3\times384+1)\times256$ |
+| Downsampling layer $S_{max}$ | $13\times13\times256$ | $3\times3/2$ | $6\times6\times256$ | 0 |
+| Full Connect Layer $F_6$ | $6\times6\times256$ | $9216\times4096$ | $1\times1\times4096$ | $(9216+1)\times4096$ |
+| Full Connect Layer $F_7$ | $1\times1\times4096$ | $4096\times4096$ | $1\times1\times4096$ | $(4096+1)\times4096$ |
+| Full Connect Layer $F_8$ | $1\times1\times4096$ | $4096\times1000$ | $1\times1\times1000$ | $(4096+1)\times1000$ |
+> Convolutional layer $C_1$ is different from $C_1$ in AlexNet, using $7\times7\times3/2$ convolution kernel instead of $11\times11\times3/4 $ to make the first layer convolution output The result can include more medium frequency features, providing more choices for a diverse set of features in subsequent network layers, facilitating the capture of more detailed features.
+>
+> The convolutional layer $C_2$ uses a convolution kernel of step size 2, which is different from the convolution kernel step size of $C_2$ in AlexNet, so the output dimensions are different.
+
+### 4.3.3 Model Features
+
+ZFNet and AlexNet are almost identical in structure. Although this part belongs to the model characteristics, it should be accurately the contribution of visualization technology in the original ZFNet paper.
+
+- Visualization techniques reveal individual feature maps for each layer in the excitation model.
+- Visualization techniques allow observation of the evolution of features during the training phase and the diagnosis of potential problems with the model.
+- Visualization technology uses a multi-layer deconvolution network that returns to the input pixel space by feature activation.
+- Visualization techniques perform sensitivity analysis of the classifier output by revealing that part of the input image reveals which part is important for classification.
+- The visualization technique provides a non-parametric invariance to show which piece of the training set activates which feature map, not only the cropped input image, but also a top-down projection to expose a feature map from each block.
+- Visualization techniques rely on deconvolution operations, the inverse of convolution operations, to map features onto pixels.
+
+## 4.4 Network in Network
+
+### 4.4.1 Introduction to the model
+NetwThe ork In Network (NIN) was proposed by $Min Lin$ et al. to achieve the best level at the time of the CIFAR-10 and CIFAR-100 classification tasks, as its network structure was made up of three multi-layered perceptron stacks. NIN$^{[5]}$. NIN examines the convolution kernel design in convolutional neural networks from a new perspective, and replaces the linear mapping part in pure convolution by introducing subnetwork structures. This form of network structure stimulates more complex convolutional neural networks. The structural design of GoogLeNet's Inception structure introduced in the next section is derived from this idea.
+
+### 4.4.2 Model Structure
+
+![](./img/ch4/image23.jpeg)
+Figure 4.6 NIN network structure
+
+NIN consists of three layers of multi-layer perceptual convolutional layer (MLPConv Layer). Each layer of multi-layer perceptual convolutional layer is composed of several layers of local fully connected layers and nonlinear activation functions instead of traditional convolutional layers. Linear convolution kernel used. In network inference, the multi-layer perceptron calculates the local features of the input feature map, and the weights of the products corresponding to the local feature maps of each window are shared. The convolution operation is completely consistent, the biggest difference is that the multilayer perceptron performs a nonlinear mapping of local features, while the traditional convolution method is linear. NIN's network parameter configuration table 4.4 is shown (the original paper does not give the network parameters, the parameters in the table are the compiler combined network structure diagram and CIFAR-100 data set with $3\times3$ convolution as an example).
+
+Table 4.4 NIN network parameter configuration (combined with the original paper NIN structure and CIFAR-100 data)
+
+| Network Layer | Input Size | Core Size | Output Size | Number of Parameters |
+|:------:|:-------:|:------:|:--------:|:-------:|
+| Local Fully Connected Layer $L_{11}$ $^*$ | $32\times32\times3$ | $(3\times3)\times16/1$ | $30\times30\times16$ | $(3\times3\times3+ 1)\times16$ |
+| Fully connected layer $L_{12}$ $^*$ | $30\times30\times16$ | $16\times16$ | $30\times30\times16$ | $((16+1)\times16)$ |
+| Local Full Connection Layer $L_{21}$ | $30\times30\times16$ | $(3\times3)\times64/1$ | $28\times28\times64$ | $(3\times3\times16+1)\times64 $ |
+| Fully connected layer $L_{22}$ | $28\times28\times64$ | $64\times64$ | $28\times28\times64$ | $((64+1)\times64)$ |
+| Local Full Connection Layer $L_{31}$ | $28\times28\times64$ | $(3\times3)\times100/1$ | $26\times26\times100$ | $(3\times3\times64+1)\times100 $ |
+| Fully connected layer $L_{32}$ | $26\times26\times100$ | $100\times100$ | $26\times26\times100$ | $((100+1)\times100)$ |
+| Global Average Sampling $GAP$ $^*$ | $26\times26\times100$ | $26\times26\times100/1$ | $1\times1\times100$ | $0$ |
+> The local fully connected layer $L_{11}$ is actually a windowed full join operation on the original input image, so the windowed output feature size is $30\times30$($\frac{32-3_k+1 }{1_{stride}}=30$)
+> The fully connected layer $L_{12}$ is a fully connected operation immediately following $L_{11}$. The input feature is the activated local response feature after windowing, so only need to connect $L_{11}$ and The node of $L_{12}$ is sufficient, and each partial fully connected layer and the immediately connected fully connected layer constitute a multilayer perceptual convolutional layer (MLPConv) instead of a convolution operation.
+> The global average sampling layer or the global average pooling layer $GAP$(Global Average Pooling) performs a global average pooling operation on each feature map output by $L_{32}$, directly obtaining the final number of categories, which can effectively Reduce the amount of parameters.
+
+### 4.4.3 Model Features
+
+- The use of a multi-layer perceptron structure instead of convolution filtering operation not only effectively reduces the problem of excessive parameterization caused by excessive convolution kernels, but also improves the abstraction ability of the model by introducing nonlinear mapping.
+- Using global average pooling instead of the last fully connected layer, can effectively reduce the amount of parameters (no trainable parameters), while pooling uses the information of the entire feature map, which is more robust to the transformation of spatial information, and finally obtained The output can be directly used as a confidence level for the corresponding category.
+
+## 4.5 VGGNet
+
+### 4.5.1 Introduction to the model
+
+VGGNet is a deep convolutional network structure proposed by the Visual Geometry Group (VGG) of Oxford University. They won the runner-up of the 2014 ILSVRC classification task with a 7.32% error rate (the champion was 6.65% by GoogLeNet). The error rate was won) and the error rate of 25.32% won the first place in the Localization (GoogLeNet error rate was 26.44%) $^{[5]}$, and the network name VGGNet was taken from the group name abbreviation. VGGNet was the first to reduce the error rate of image classification to less than 10%, and the idea of ​​the $3\times3$ convolution kernel used in the network was the basis of many later models published at the 2015 International Conference on Learning Representation ( International Conference On Learning Representations (ICLR) has been cited more than 14,000 times since.
+
+### 4.5.2 Model structure
+
+![](./img/ch4/vgg16.png)
+
+Figure 4.7 VGG16 network structure
+
+In the original paper, VGGNet contains six versions of evolution, corresponding to VGG11, VGG11-LRN, VGG13, VGG16-1, VGG16-3 and VGG19, respectively. Different suffix values ​​indicate different network layers (VGG11-LRN is represented in In the first layer, VGG11 of LRN is used. VGG16-1 indicates that the convolution kernel size of the last three convolutional blocks is $1\times1$, and the corresponding VGG16-3 indicates that the convolution kernel size is $3\. Times3$), the VGG16 introduced in this section is VGG16-3. The VGG16 in Figure 4.7 embodies the core idea of ​​VGGNet, using the convolution combination of $3\times3$ instead of the large convolution (2 $3\times3 convolutions can have the same perception field as the $$5\times5$ convolution ), network parameter settings are shown in Table 4.5.
+
+Table 4.5 VGG16 network parameter configuration
+
+| Network Layer | Input Size | Core Size | Output Size | Number of Parameters |
+| :--------------------: | :-------------------: | :--- ------------------: | :--------------------: | :------ -----------------------: |
+| Convolutional Layer $C_{11}$ | $224\times224\times3$ | $3\times3\times64/1$ | $224\times224\times64$ | $(3\times3\times3+1)\times64$ |
+| Convolutional Layer $C_{12}$ | $224\times224\times64$ | $3\times3\times64/1$ | $224\times224\times64$ | $(3\times3\times64+1)\times64$ |
+| Downsampling layer $S_{max1}$ | $224\times224\times64$ | $2\times2/2$ | $112\times112\times64$ | $0$ |
+| Convolutional layer $C_{21}$ | $112\times112\times64$ | $3\times3\times128/1$ | $112\times112\times128$ | $(3\times3\times64+1)\times128$ |
+| Convolutional Layer $C_{22}$ | $112\times112\times128$ | $3\times3\times128/1$ | $112\times112\times128$ | $(3\times3\times128+1)\times128$ |
+| Downsampling layer $S_{max2}$ | $112\times112\times128$ | $2\times2/2$ | $56\times56\times128$ | $0$ |
+| Convolutional layer $C_{31}$ | $56\times56\times128$ | $3\times3\times256/1$ | $56\times56\times256$ | $(3\times3\times128+1)\times256$ |
+| Convolutional layer $C_{32}$ | $56\times56\times256$ | $3\times3\times256/1$ | $56\times56\times256$ | $(3\times3\times256+1)\times256$ |
+| Convolutional layer $C_{33}$ | $56\times56\times256$ | $26\times26\times256/1$ | $56\times56\times256$ | $(3\times3\times256+1)\times256$ |
+| Downsampling layer $S_{max3}$ | $56\times56\times256$ | $2\times2/2$ | $28\times28\times256$ | $0$ |
+| Convolutional Layer $C_{41}$ | $28\times28\times256$ | $3\times3\times512/1$ | $28\times28\times512$ | $(3\times3\times256+1)\times512$ |
+| Convolutional Layer $C_{42}$ | $28\times28\times512$ | $3\times3\times512/1$ | $28\times28\times512$ | $(3\times3\times512+1)\times512$ |
+| Convolutional Layer $C_{43}$ | $28\times28\times512$ | $3\times3\times512/1$ | $28\times28\times512$ | $(3\times3\times512+1)\times512$ |
+| Downsampling layer $S_{max4}$ | $28\times28\times512$ | $2\times2/2$ | $14\times14\times512$ | $0$ |
+| Convolutional Layer $C_{51}$ | $14\times14\times512$ | $3\times3\times512/1$ | $14\times14\times512$ | $(3\times3\times512+1)\times512$ |
+| Convolutional Layer $C_{52}$ | $14\times14\times512$ | $3\times3\times512/1$ | $14\times14\times512$ | $(3\times3\times512+1)\times512$ || Convolutional Layer $C_{53}$ | $14\times14\times512$ | $3\times3\times512/1$ | $14\times14\times512$ | $(3\times3\times512+1)\times512$ |
+| Downsampling layer $S_{max5}$ | $14\times14\times512$ | $2\times2/2$ | $7\times7\times512$ | $0$ |
+| Fully connected layer $FC_{1}$ | $7\times7\times512$ | $(7\times7\times512)\times4096$ | $1\times4096$ | $(7\times7\times512+1)\times4096$ |
+| Fully connected layer $FC_{2}$ | $1\times4096$ | $4096\times4096$ | $1\times4096$ | $(4096+1)\times4096$ |
+| Fully connected layer $FC_{3}$ | $1\times4096$ | $4096\times1000$ | $1\times1000$ | $(4096+1)\times1000$ |
+
+### 4.5.3 Model Features
+
+- The entire network uses the same size convolution kernel size $3\times3$ and the maximum pooled size $2\times2$.
+- The meaning of $1\times1$convolution is mainly linear transformation, while the number of input channels and the number of output channels are unchanged, and no dimensionality reduction occurs.
+- Two convolutional layers of $3\times3$ are concatenated as a convolutional layer of $5\times5$ with a receptive field size of $5\times5$. Similarly, the concatenation of three $3\times3$ convolutions is equivalent to a convolutional layer of $7\times7$. This type of connection makes the network parameters smaller, and the multi-layer activation function makes the network more capable of learning features.
+- VGGNet has a trick in training. It first trains the shallow simple network VGG11, and then reuses the weight of VGG11 to initialize VGG13. This training and initialization VGG19 can make the convergence faster during training.
+- Use multi-scale transformations in the training process to enhance the data of the original data, making the model difficult to overfit.
+
+## 4.6 GoogLeNet
+### 4.6.1 Introduction to the model
+
+As the winner of the ILSVRC classification task in 2014, GoogLeNet pressured VGGNet and other models with an error rate of 6.65%. Compared with the previous two championships ZFNet and AlexNet, the accuracy of the classification is greatly improved. From the name **GoogLe**Net, you can know that this is a network structure designed by Google engineers, and the name Goog**LeNet** is a tribute to LeNet$^{[0]}$. The core part of GoogLeNet is its internal subnet structure, Inception, which is inspired by NIN and has undergone four iterations (Inception$_{v1-4}$).
+
+![](./img/ch4/img_inception_01.png)
+Figure 4.8 Inception performance comparison chart
+
+### 4.6.2 Model Structure
+
+![](./img/ch4/image25.jpeg)
+Figure 4.9 GoogLeNet network structure
+As shown in Figure 4.9, GoogLeNet extends the width of the network in addition to the depth of the previous convolutional neural network structure. The entire network is composed of a number of block subnetworks. This subnet constitutes the Inception structure. Figure 4.9 shows four versions of Inception: $Inception_{v1}$ uses different convolution kernels in the same layer and merges the convolution results; $Inception_{v2}$ combines the stacking of different convolution kernels, and The convolution results are merged; $Inception_{v3}$ is a deep combination attempt on the basis of $v_2$; the $Inception_{v4} $ structure is more complex than the previous version, nested in the subnet The internet.
+
+![](./img/ch4/img_inception_02.png)
+![](./img/ch4/img_inception_05.png)
+![](./img/ch4/img_inception_03.png)
+![](./img/ch4/img_inception_04.png)
+Figure 4.10 Inception$_{v1-4}$ structure diagram
+
+Table 4.6 Inception$_{v1}$ Network Parameter Configuration in GoogLeNet
+
+| Network Layer | Input Size | Core Size | Output Size | Number of Parameters |
+| :--------------------: | :-------------------: | :--- ------------------: | :--------------------: | :------ -----------------------: |
+| Convolutional layer $C_{11}$ | $H\times{W}\times{C_1}$ | $1\times1\times{C_2}/2$ | $\frac{H}{2}\times\frac {W}{2}\times{C_2}$ | $(1\times1\times{C_1}+1)\times{C_2}$ |
+| Convolutional layer $C_{21}$ | $H\times{W}\times{C_2}$ | $1\times1\times{C_2}/2$ | $\frac{H}{2}\times\frac {W}{2}\times{C_2}$ | $(1\times1\times{C_2}+1)\times{C_2}$ |
+| Convolutional Layer $C_{22}$ | $H\times{W}\times{C_2}$ | $3\times3\times{C_2}/1$ | $H\times{W}\times{C_2}/ 1$ | $(3\times3\times{C_2}+1)\times{C_2}$ |
+| Convolutional layer $C_{31}$ | $H\times{W}\times{C_1}$ | $1\times1\times{C_2}/2$ | $\frac{H}{2}\times\frac {W}{2}\times{C_2}$ | $(1\times1\times{C_1}+1)\times{C_2}$ |
+| Convolutional Layer $C_{32}$ | $H\times{W}\times{C_2}$ | $5\times5\times{C_2}/1$ | $H\times{W}\times{C_2}/ 1$ | $(5\times5\times{C_2}+1)\times{C_2}$ |
+| Downsampling layer $S_{41}$ | $H\times{W}\times{C_1}$ | $3\times3/2$ | $\frac{H}{2}\times\frac{W}{2 }\times{C_2}$ | $0$ |
+| Convolutional Layer $C_{42}$ | $\frac{H}{2}\times\frac{W}{2}\times{C_2}$ | $1\times1\times{C_2}/1$ | $ \frac{H}{2}\times\frac{W}{2}\times{C_2}$ | $(3\times3\times{C_2}+1)\times{C_2}$ |
+| Merge layer $M$ | $\frac{H}{2}\times\frac{W}{2}\times{C_2}(\times4)$ | Stitching | $\frac{H}{2}\times \frac{W}{2}\times({C_2}\times4)$ | $0$ |
+
+### 4.6.3 Model Features
+
+- The use of convolution kernels of different sizes means different sizes of receptive fields, and the final splicing means the fusion of different scale features;
+- The reason why the convolution kernel size is 1, 3 and 5 is mainly for the convenience of alignment. After setting the convolution step stride=1, as long as pad=0, 1, 2 are set respectively, the features of the same dimension can be obtained after convolution, and then these features can be directly stitched together;
+- The more the network goes to the end, the more abstract the features are, and the more susceptible fields are involved in each feature. As the number of layers increases, the proportion of 3x3 and 5x5 convolutions also increases. However, using a 5x5 convolution kernel still has a huge amount of computation. To this end, the article draws on NIN2 and uses a 1x1 convolution kernel for dimensionality reduction.
+
+## 4.7 Why are the current CNN models adjusted on GoogleNet, VGGNet or AlexNet?
+
+- Evaluation comparison: In order to make your results more convincing, when you publish your own results, you will compare them with a standard baseline and improve on the baseline. Common problems such as detection and segmentation will be based on VGG or Resnet101. Basic network.
+- Limited time and effort: In research pressure and work stress, time and energy only allow you to explore in a limited range.
+- Model innovation is difficult: Improving the basic model requires a lot of experimentation and experimentation, and requires a lot of experiment accumulation and strong inspiration. It is very likely that the input-output ratio is relatively small.
+- Resource limitations: Creating a new model requires a lot of time and computing resources, often not feasible in schools and small business teams.
+- In actual application scenarios, there are actually a large number of non-standard model configurations.
+
+## Related Literature
+
+[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. *Proceedings of the IEEE*, november 1998.
+
+[2] A. Krizhevsky, I. Sutskever and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. *Advances in Neural Information Processing Systems 25*. Curran Associates, Inc. 1097–1105.
+
+[3] LSVRC-2013. http://www.image-net.org/challenges/LSVRC/2013/results.php
+
+[4] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. *European Conference on Computer Vision*.
+
+[5] M. Lin, Q. Chen, and S. Yan. Network in network. *Computing Research Repository*, abs/1312.4400, 2013.
+
+[6] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. *International Conference on Machine Learning*, 2015.
+
+[7] Bharath Raj. [a-simple-guide-to-the-versions-of-the-inception-network] (https://towardsdatascience.com/a-simple-guide-to-the-versions-of- The-inception-network-7fc52b863202), 2018.
+
+[8] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. [Inception-v4, Inception-ResNet and
+the Impact of Residual Connections on Learning](https://arxiv.org/pdf/1602.07261.pdf), 2016.
+
+[9] Sik-Ho Tsang. [review-inception-v4-evolved-from-googlenet-merged-with-resnet-idea-image-classification](https://towardsdatascience.com/review-inception-v4-evolved-from-googlenet-merged-with-resnet-idea-image-classification-5e8c339d18bc), 2018.
+
+[10] Zbigniew Wojna, Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens. [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/pdf/1512.00567v3.pdf), 2015.
+
+[11] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. [Going deeper with convolutions](https://arxiv.org/pdf/1409.4842v1.pdf), 2014.
diff --git a/English version/ch04_ClassicNetwork/img/ch4/LeNet-5.jpg b/English version/ch04_ClassicNetwork/img/ch4/LeNet-5.jpg
new file mode 100644
index 0000000..fb65aed
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/LeNet-5.jpg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/alexnet.png b/English version/ch04_ClassicNetwork/img/ch4/alexnet.png
new file mode 100644
index 0000000..1a954ca
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/alexnet.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/featureMap.jpg b/English version/ch04_ClassicNetwork/img/ch4/featureMap.jpg
new file mode 100644
index 0000000..2e077ef
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/featureMap.jpg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image10.png b/English version/ch04_ClassicNetwork/img/ch4/image10.png
new file mode 100644
index 0000000..ef7ed01
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image10.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image11.GIF b/English version/ch04_ClassicNetwork/img/ch4/image11.GIF
new file mode 100644
index 0000000..ab72b16
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image11.GIF differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image12.png b/English version/ch04_ClassicNetwork/img/ch4/image12.png
new file mode 100644
index 0000000..a842820
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image12.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image13.png b/English version/ch04_ClassicNetwork/img/ch4/image13.png
new file mode 100644
index 0000000..56c7a12
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image13.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image14.png b/English version/ch04_ClassicNetwork/img/ch4/image14.png
new file mode 100644
index 0000000..0f52f55
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image14.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image15.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image15.jpeg
new file mode 100644
index 0000000..2ea99b4
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image15.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image18.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image18.jpeg
new file mode 100644
index 0000000..ba94f6c
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image18.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image19.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image19.jpeg
new file mode 100644
index 0000000..920f056
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image19.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image2.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image2.jpeg
new file mode 100644
index 0000000..284e36e
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image2.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image20.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image20.jpeg
new file mode 100644
index 0000000..48f8eb4
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image20.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image21.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image21.jpeg
new file mode 100644
index 0000000..6afa65c
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image21.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image22.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image22.jpeg
new file mode 100644
index 0000000..4a46634
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image22.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image23.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image23.jpeg
new file mode 100644
index 0000000..5aa4af0
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image23.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image24.png b/English version/ch04_ClassicNetwork/img/ch4/image24.png
new file mode 100644
index 0000000..538640e
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image24.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image25.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image25.jpeg
new file mode 100644
index 0000000..536e655
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image25.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image26.png b/English version/ch04_ClassicNetwork/img/ch4/image26.png
new file mode 100644
index 0000000..51c7d18
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image26.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image27.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image27.jpeg
new file mode 100644
index 0000000..6e526ee
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image27.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image28.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image28.jpeg
new file mode 100644
index 0000000..931a1d6
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image28.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image3.png b/English version/ch04_ClassicNetwork/img/ch4/image3.png
new file mode 100644
index 0000000..aeedc2d
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image3.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image31.png b/English version/ch04_ClassicNetwork/img/ch4/image31.png
new file mode 100644
index 0000000..5d23886
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image31.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image32.png b/English version/ch04_ClassicNetwork/img/ch4/image32.png
new file mode 100644
index 0000000..7b463da
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image32.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image33.png b/English version/ch04_ClassicNetwork/img/ch4/image33.png
new file mode 100644
index 0000000..43d7908
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image33.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image34.png b/English version/ch04_ClassicNetwork/img/ch4/image34.png
new file mode 100644
index 0000000..9501608
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image34.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image35.png b/English version/ch04_ClassicNetwork/img/ch4/image35.png
new file mode 100644
index 0000000..083675c
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image35.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image36.png b/English version/ch04_ClassicNetwork/img/ch4/image36.png
new file mode 100644
index 0000000..404390c
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image36.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image37.png b/English version/ch04_ClassicNetwork/img/ch4/image37.png
new file mode 100644
index 0000000..6f16300
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image37.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image38.png b/English version/ch04_ClassicNetwork/img/ch4/image38.png
new file mode 100644
index 0000000..2bed01b
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image38.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image39.png b/English version/ch04_ClassicNetwork/img/ch4/image39.png
new file mode 100644
index 0000000..66c49a2
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image39.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image4.png b/English version/ch04_ClassicNetwork/img/ch4/image4.png
new file mode 100644
index 0000000..3719377
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image4.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image40.png b/English version/ch04_ClassicNetwork/img/ch4/image40.png
new file mode 100644
index 0000000..8643e43
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image40.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image41.png b/English version/ch04_ClassicNetwork/img/ch4/image41.png
new file mode 100644
index 0000000..32bf30d
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image41.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image42.png b/English version/ch04_ClassicNetwork/img/ch4/image42.png
new file mode 100644
index 0000000..233128e
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image42.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image43.png b/English version/ch04_ClassicNetwork/img/ch4/image43.png
new file mode 100644
index 0000000..4371335
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image43.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image44.png b/English version/ch04_ClassicNetwork/img/ch4/image44.png
new file mode 100644
index 0000000..c1da265
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image44.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image45.png b/English version/ch04_ClassicNetwork/img/ch4/image45.png
new file mode 100644
index 0000000..f5257d0
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image45.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image46.png b/English version/ch04_ClassicNetwork/img/ch4/image46.png
new file mode 100644
index 0000000..3dbb511
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image46.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image47.png b/English version/ch04_ClassicNetwork/img/ch4/image47.png
new file mode 100644
index 0000000..adcebab
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image47.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image48.png b/English version/ch04_ClassicNetwork/img/ch4/image48.png
new file mode 100644
index 0000000..97801f0
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image48.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image49.png b/English version/ch04_ClassicNetwork/img/ch4/image49.png
new file mode 100644
index 0000000..883bc7b
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image49.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image5.png b/English version/ch04_ClassicNetwork/img/ch4/image5.png
new file mode 100644
index 0000000..6c518d6
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image5.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image50.png b/English version/ch04_ClassicNetwork/img/ch4/image50.png
new file mode 100644
index 0000000..e265b45
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image50.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image51.png b/English version/ch04_ClassicNetwork/img/ch4/image51.png
new file mode 100644
index 0000000..34133a0
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image51.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image52.png b/English version/ch04_ClassicNetwork/img/ch4/image52.png
new file mode 100644
index 0000000..a6ed65b
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image52.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image53.png b/English version/ch04_ClassicNetwork/img/ch4/image53.png
new file mode 100644
index 0000000..ec92de4
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image53.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image54.png b/English version/ch04_ClassicNetwork/img/ch4/image54.png
new file mode 100644
index 0000000..3558da8
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image54.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image55.png b/English version/ch04_ClassicNetwork/img/ch4/image55.png
new file mode 100644
index 0000000..963436a
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image55.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image56.png b/English version/ch04_ClassicNetwork/img/ch4/image56.png
new file mode 100644
index 0000000..c07aed2
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image56.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image6.png b/English version/ch04_ClassicNetwork/img/ch4/image6.png
new file mode 100644
index 0000000..f1509b8
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image6.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image60.jpeg b/English version/ch04_ClassicNetwork/img/ch4/image60.jpeg
new file mode 100644
index 0000000..5f45a37
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image60.jpeg differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image61.png b/English version/ch04_ClassicNetwork/img/ch4/image61.png
new file mode 100644
index 0000000..dd7b128
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image61.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image62.png b/English version/ch04_ClassicNetwork/img/ch4/image62.png
new file mode 100644
index 0000000..221fb4b
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image62.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image63.png b/English version/ch04_ClassicNetwork/img/ch4/image63.png
new file mode 100644
index 0000000..38167b0
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image63.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image7.png b/English version/ch04_ClassicNetwork/img/ch4/image7.png
new file mode 100644
index 0000000..0c5fe01
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image7.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image8.png b/English version/ch04_ClassicNetwork/img/ch4/image8.png
new file mode 100644
index 0000000..20a6a5d
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image8.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/image9.png b/English version/ch04_ClassicNetwork/img/ch4/image9.png
new file mode 100644
index 0000000..b2fc2e7
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/image9.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/vgg16.png b/English version/ch04_ClassicNetwork/img/ch4/vgg16.png
new file mode 100644
index 0000000..3dca57d
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/vgg16.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer1.png b/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer1.png
new file mode 100644
index 0000000..7fb93f7
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer1.png differ
diff --git a/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer2.png b/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer2.png
new file mode 100644
index 0000000..c2e2f6c
Binary files /dev/null and b/English version/ch04_ClassicNetwork/img/ch4/zfnet-layer2.png differ
diff --git a/English version/ch04_ClassicNetwork/readme.md b/English version/ch04_ClassicNetwork/readme.md
new file mode 100644
index 0000000..5072a51
--- /dev/null
+++ b/English version/ch04_ClassicNetwork/readme.md	
@@ -0,0 +1,12 @@
+###########################################################
+
+### Deep Learning 500 Questions - Chapter 4 Classic Network
+
+**Responsible person (in no particular order):**
+South China Institute of Technology graduate student - Huang Qinjian (wechat: HQJ199508212176, email: csqjhuang@mail.scut.edu.cn)
+
+
+**Contributors (in no particular order):**
+Content contributors can add information
+
+###########################################################