Update Chapter 2_TheBasisOfMachineLearning.md

This commit is contained in:
scutan90 2019-04-05 10:38:49 +08:00
parent 0d01cbe2b4
commit a36e060efe
1 changed files with 26 additions and 26 deletions

View File

@ -40,7 +40,8 @@ There are different ways to model a problem, depending on the type of data. Acco
2. Common application scenarios include learning of association rules and clustering.
3. Common algorithms include the Apriori algorithm and the k-Means algorithm.
** Semi-supervised learning**:
**Semi-supervised learning**:
1. In this learning mode, the input data part is marked and some parts are not marked. This learning model can be used for prediction.
2. The application scenario includes classification and regression. The algorithm includes some extensions to commonly used supervised learning algorithms. By modeling the marked data, on the basis of this, the unlabeled data is predicted.
3. Common algorithms such as Graph Inference or Laplacian SVM.
@ -130,7 +131,7 @@ Therefore, when using the classification model or the regression model in the ac
### 2.8.2 How to evaluate the classification algorithm?
- ** Several common terms**
- **Several common terms**
Here are a few common model evaluation terms. Now suppose that our classification target has only two categories, which are considered positive and negative:
1) True positives (TP): the number of positive cases that are correctly divided into positive examples, that is, the number of instances that are actually positive and are classified as positive by the classifier;
2) False positives (FP): the number of positive examples that are incorrectly divided into positive examples, that is, the number of instances that are actually negative but are classified as positive by the classifier;
@ -165,7 +166,7 @@ The figure above is the confusion matrix of these four terms, and the following
Robustness: The ability to handle missing values and outliers;
Scalability: The ability to handle large data sets;
Interpretability: The comprehensibility of the classifier's prediction criteria, such as the rules generated by the decision tree, is easy to understand, and the neural network's parameters are not well understood, we have to think of it as a black box.
8) Accuracy and recall rate reflect two aspects of classifier classification performance. If you comprehensively consider the precision and recall rate, you can get a new evaluation index F1-score, also known as the comprehensive classification rate: $F1=\frac{2 \times precision \times recall}{precision + recall} $.
8) Accuracy and recall rate reflect two aspects of classifier classification performance. If you comprehensively consider the precision and recall rate, you can get a new evaluation index F1-score, also known as the comprehensive classification rate: $F1=\frac{2 \times precision \times recall}{precision + recall} $.
In order to integrate the classification of multiple categories and evaluate the overall performance of the system, micro-average F1 (micro-averaging) and macro-averaging F1 (macro-averaging) are often used.
@ -285,13 +286,12 @@ As shown below:
![](./img/ch2/2.16/1.jpg)
To fit the discrete points in the graph, we need to find the best $A$ and $B$ as possible to make this line more representative of all the data. How to find the optimal solution, which needs to be solved using the cost function, taking the squared error cost function as an example, assuming the function is $h(x)=\theta_0x$.
The main idea of the **square error cost function is to make the difference between the value given by the actual data and the corresponding value of the fitted line, and find the difference between the fitted line and the actual line. In practical applications, in order to avoid the impact of individual extreme data, a similar variance is used to take one-half of the variance to reduce the impact of individual data. Therefore, the cost function is derived:
The main idea of the square error cost function is to make the difference between the value given by the actual data and the corresponding value of the fitted line, and find the difference between the fitted line and the actual line. In practical applications, in order to avoid the impact of individual extreme data, a similar variance is used to take one-half of the variance to reduce the impact of individual data. Therefore, the cost function is derived:
$$
J(\theta_0, \theta_1) = \frac{1}{m}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2
$$
**The optimal solution is the minimum value of the cost function **$\min J(\theta_0, \theta_1) $. If it is a parameter, the cost function is generally visualized by a two-dimensional curve. If it is 2 parameters, the cost function can see the effect through the 3D image. The more parameters, the more complicated.
**The optimal solution is the minimum value of the cost function **$\min J(\theta_0, \theta_1) $. If it is a parameter, the cost function is generally visualized by a two-dimensional curve. If it is 2 parameters, the cost function can see the effect through the 3D image. The more parameters, the more complicated.
When the parameter is 2, the cost function is a three-dimensional image.
![](./img/ch2/2.16/2.jpg)
@ -315,7 +315,7 @@ If Gradient descent is used to adjust the size of the weight parameter, the grad
$$
\frac{\partial J}{\partial b}=(a-y)\sigma'(z)
$$
Where $z $ represents the input of the neuron and $\sigma $ represents the activation function. The gradient of the weight $w $ and offset $b $ is proportional to the gradient of the activation function. The larger the gradient of the activation function, the faster the weights $w $ and offset $b $ are adjusted. The faster the training converges.
Where $z $ represents the input of the neuron and $\sigma $ represents the activation function. The gradient of the weight $w $ and offset $b $ is proportional to the gradient of the activation function. The larger the gradient of the activation function, the faster the weights $w $ and offset $b $ are adjusted. The faster the training converges.
*Note*: The activation function commonly used in neural networks is the sigmoid function. The curve of this function is as follows:
@ -326,7 +326,7 @@ Assume that the target is to converge to 1.0. 0.88 is farther from the target 1.
If the target is converged to 0. 0.88 is closer to the target 0, the gradient is larger, and the weight adjustment is larger. 0.98 is far from the target 0, the gradient is relatively small, and the weight adjustment is relatively small. The adjustment plan is unreasonable.
Cause: In the case of using the sigmoid function, the larger the initial cost (error), the slower the training.
2. ** Cross-entropy cost function (cross-entropy)**:
2. **Cross-entropy cost function (cross-entropy)**:
$$
J = -\frac{1}{n}\sum_x[y\ln a + (1-y)\ln{(1-a)}]
@ -353,10 +353,10 @@ In pytorch:
The cross entropy function used with softmax: `torch.nn.CrossEntropyLoss()`.
### 2.10.5 Why use cross entropy instead of quadratic cost function
1. ** Why not use the quadratic cost function**
1. **Why not use the quadratic cost function**
As you can see from the previous section, the partial derivative of the weight $w$ and the offset $b$ is $\frac{\partial J}{\partial w}=(ay)\sigma'(z)x$,$\frac {\partial J}{\partial b}=(ay)\sigma'(z)$, the partial derivative is affected by the derivative of the activation function, and the derivative of the sigmoid function is very small when the output is close to 0 and 1, which causes some instances to be Learning very slowly when starting training.
2. ** Why use cross entropy**
2. **Why use cross entropy**
The gradient of the cross entropy function weights $w$ and the offset $b$ is derived as:
$$
@ -407,7 +407,7 @@ $$
L(Y, f(x)) = |Y-f(x)|
$$
3. ** squared loss function**
3. **squared loss function**
$$
L(Y, f(x)) = \sum_N{(Y-f(x))}^2
@ -423,7 +423,7 @@ $$
The common logistic regression uses the logarithmic loss function. Many people think that the loss of the functionalized square of the logistic regression is not. Logistic Regression It assumes that the sample obeys the Bernoulli distribution, and then finds the likelihood function that satisfies the distribution, and then takes the logarithm to find the extremum. The empirical risk function derived from logistic regression is to minimize the negative likelihood function. From the perspective of the loss function, it is the log loss function.
5. ** Exponential loss function**
5. **Exponential loss function**
The standard form of the exponential loss function is:
$$
@ -643,9 +643,9 @@ Out, the gradient direction of the current position is determined by all samples
### 2.12.5 How to tune the gradient descent method?
When the gradient descent method is actually used, each parameter index can not reach the ideal state in one step, and the optimization of the gradient descent method is mainly reflected in the following aspects:
1. ** Algorithm iteration step $\alpha$ selection. **
1. **Algorithm iteration step $\alpha$ selection. **
When the algorithm parameters are initialized, the step size is sometimes initialized to 1 based on experience. The actual value depends on the data sample. You can take some values from big to small, and run the algorithm to see the iterative effect. If the loss function is smaller, the value is valid. If the value is invalid, it means to increase the step size. However, the step size is too large, sometimes causing the iteration speed to be too fast and missing the optimal solution. The step size is too small, the iteration speed is slow, and the algorithm runs for a long time.
2. ** The initial value selection of the parameter. **
2. **The initial value selection of the parameter. **
The initial values are different, and the minimum values obtained may also be different. It is possible to obtain a local minimum due to the gradient drop. If the loss function is a convex function, it must be the optimal solution. Due to the risk of local optimal solutions, it is necessary to run the algorithm with different initial values multiple times, the minimum value of the key loss function, and the initial value of the loss function minimized.
3. **Standardization process. **
Due to the different samples, the range of feature values is different, resulting in slow iteration. In order to reduce the influence of feature values, the feature data can be normalized so that the new expectation is 0 and the new variance is 1, which can save the algorithm running time.
@ -665,9 +665,9 @@ J(\theta_0, \theta_1, ... , \theta_n) =
$$
Among them, $m $ is the number of samples, and $j $ is the number of parameters.
1, ** batch gradient descent solution ideas are as follows: **
1, **batch gradient descent solution ideas are as follows: **
a) Get the gradient corresponding to each $\theta $:
a) Get the gradient corresponding to each $\theta $:
$$
\frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)=\frac{1}{m}\sum^{m }_{j=0}(h_\theta (x^{(j)}_0
,x^{(j)}_1,...,x^{(j)}_n)-y_j)x^{(j)}_i
@ -680,7 +680,7 @@ $$
c) It can be noticed from the above equation that although it is a global optimal solution, all data of the training set is used for each iteration. If the sample data is large, the iteration speed of this method is very slow.
In contrast, a random gradient drop can avoid this problem.
2. The solution to the stochastic gradient descent is as follows: **
2. **The solution to the stochastic gradient descent is as follows: **
a) Compared to all training samples compared to the batch gradient drop, the loss function in the stochastic gradient descent method corresponds to the granularity of each sample in the training set.
The loss function can be written in the form of
$$
@ -708,8 +708,8 @@ d) In terms of convergence speed, the stochastic gradient descent method iterate
The following describes a small batch gradient descent method that combines the advantages of both methods.
3, ** small batch (mini-batch) gradient drop solution is as follows **
For data with a total of $m$ samples, according to the sample data, select $n(1< n< m)$ subsamples to iterate. Its parameter $\theta$ updates the $\theta_i$ formula in the gradient direction as follows:
3, **small batch (mini-batch) gradient drop solution is as follows **
For data with a total of $m$ samples, according to the sample data, select $n(1< n< m)$ subsamples to iterate. Its parameter $\theta$ updates the $\theta_i$ formula in the gradient direction as follows:
$$
\theta_i = \theta_i - \alpha \sum^{t+n-1}_{j=t}
( h_\theta (x^{(j)}_{0}, x^{(j)}_{1}, ... , x^{(j)}_{n} ) - y_j ) x^ {j}_{i}
@ -807,7 +807,7 @@ $$
U_j = \frac{1}{N_j} \sum_{x\epsilon X_j}x(j=0,1),
\sum_j = \sum_{x\epsilon X_j}(x-u_j)(x-u_j)^T(j=0,1)
$$
Suppose the projection line is the vector $w$. For any sample $x_i$, its projection on the line $w$ is $w^tx_i$, the center point of the two categories is $u_0$, $u_1 $ is in the line $w The projections of $ are $w^Tu_0$ and $w^Tu_1$ respectively.
Suppose the projection line is the vector $w$. For any sample $x_i$, its projection on the line $w$ is $w^tx_i$, the center point of the two categories is $u_0$, $u_1 $ is in the line $w $The projections of are $w^Tu_0$ and $w^Tu_1$ respectively.
The goal of LDA is to maximize the distance between the two categories of data centers, $\| w^Tu_0 - w^Tu_1 \|^2_2$, and at the same time, hope that the covariance of the similar sample projection points is $w^T \sum_0 w$, $w^T \sum_1 w$ Try to be as small as possible, and minimize $w^T \sum_0 w - w^T \sum_1 w $ .
definition
@ -825,14 +825,14 @@ $$
\frac{w^T(u_0-u_1)(u_0-u_1)^Tw}{w^T(\sum_0 + \sum_1)w} =
\frac{w^TS_bw}{w^TS_ww}
$$
According to the nature of the generalized Rayleigh quotient, the maximum eigenvalue of the matrix $S^{-1}_{w} S_b$ is the maximum value of $j(w)$, and the matrix $S^{-1}_{w} The feature vector corresponding to the largest eigenvalue of S_b$ is $w $.
According to the nature of the generalized Rayleigh quotient, the maximum eigenvalue of the matrix $S^{-1}_{w} S_b$ is the maximum value of $j(w)$, and the matrix $S^{-1}_{w} $The feature vector corresponding to the largest eigenvalue of $S_b$ is $w $.
### 2.14.4 Summary of LDA algorithm flow?
The LDA algorithm dimension reduction process is as follows:
Input: Dataset $D = \{ (x_1,y_1), (x_2,y_2), ... ,(x_m,y_m) \}$, where the sample $x_i $ is an n-dimensional vector, $y_i \epsilon \ {C_1, C_2, ..., C_k\}$, the dimension dimension $d$ after dimension reduction.
Input: Dataset $D = \{ (x_1,y_1), (x_2,y_2), ... ,(x_m,y_m) \}$, where the sample $x_i $ is an n-dimensional vector, $y_i\epsilon \{C_1, C_2, ..., C_k \} $, the dimension dimension $d$ after dimension reduction.
Output: Divised data set $\overline{D} $ .
Output: Divised data set $\overline{D} $ .
step:
1. Calculate the intra-class divergence matrix $S_w$.
@ -1708,11 +1708,11 @@ J(\theta)=-\frac{1}{m}\left[\sum^m_{i=1}y^{(i)}logh_{\theta}(x^{(i)})+ ( 1-y^{(i
$$
The objective function of the support vector machine:
$$
L(w,n,a)=\frac{1}{2}||w||^2-\sum^n_{i=1}\alpha_i \left( y_i(w^Tx_i+b)-1\ Right)
L(w,n,a)=\frac{1}{2}||w||^2-\sum^n_{i=1}\alpha_i \left(y_i(w^Tx_i+b)-1 \right)
$$
The logistic regression method is based on probability theory. The probability that the sample is 1 can be represented by the sigmoid function, and then the value of the parameter is estimated by the method of maximum likelihood estimation**.
The logistic regression method is based on probability theory. The probability that the sample is 1 can be represented by the sigmoid function, and then the value of the parameter is estimated by the method of maximum likelihood estimation.
The support vector machine is based on the principle of geometric ** interval maximization**, and it is considered that the classification plane with the largest geometric interval is the optimal classification plane.
The support vector machine is based on the principle of geometric interval maximizationand it is considered that the classification plane with the largest geometric interval is the optimal classification plane.
2. **LR is sensitive to outliers and SVM is not sensitive to outliers**.