Update Chapter 2_TheBasisOfMachineLearning.md

This commit is contained in:
scutan90 2019-04-05 10:38:49 +08:00
parent 0d01cbe2b4
commit a36e060efe
1 changed files with 26 additions and 26 deletions

View File

@ -41,6 +41,7 @@ There are different ways to model a problem, depending on the type of data. Acco
3. Common algorithms include the Apriori algorithm and the k-Means algorithm.
**Semi-supervised learning**:
1. In this learning mode, the input data part is marked and some parts are not marked. This learning model can be used for prediction.
2. The application scenario includes classification and regression. The algorithm includes some extensions to commonly used supervised learning algorithms. By modeling the marked data, on the basis of this, the unlabeled data is predicted.
3. Common algorithms such as Graph Inference or Laplacian SVM.
@ -165,7 +166,7 @@ The figure above is the confusion matrix of these four terms, and the following
Robustness: The ability to handle missing values and outliers;
Scalability: The ability to handle large data sets;
Interpretability: The comprehensibility of the classifier's prediction criteria, such as the rules generated by the decision tree, is easy to understand, and the neural network's parameters are not well understood, we have to think of it as a black box.
8) Accuracy and recall rate reflect two aspects of classifier classification performance. If you comprehensively consider the precision and recall rate, you can get a new evaluation index F1-score, also known as the comprehensive classification rate: $F1=\frac{2 \times precision \times recall}{precision + recall} $.
8) Accuracy and recall rate reflect two aspects of classifier classification performance. If you comprehensively consider the precision and recall rate, you can get a new evaluation index F1-score, also known as the comprehensive classification rate: $F1=\frac{2 \times precision \times recall}{precision + recall} $.
In order to integrate the classification of multiple categories and evaluate the overall performance of the system, micro-average F1 (micro-averaging) and macro-averaging F1 (macro-averaging) are often used.
@ -285,13 +286,12 @@ As shown below:
![](./img/ch2/2.16/1.jpg)
To fit the discrete points in the graph, we need to find the best $A$ and $B$ as possible to make this line more representative of all the data. How to find the optimal solution, which needs to be solved using the cost function, taking the squared error cost function as an example, assuming the function is $h(x)=\theta_0x$.
The main idea of the **square error cost function is to make the difference between the value given by the actual data and the corresponding value of the fitted line, and find the difference between the fitted line and the actual line. In practical applications, in order to avoid the impact of individual extreme data, a similar variance is used to take one-half of the variance to reduce the impact of individual data. Therefore, the cost function is derived:
The main idea of the square error cost function is to make the difference between the value given by the actual data and the corresponding value of the fitted line, and find the difference between the fitted line and the actual line. In practical applications, in order to avoid the impact of individual extreme data, a similar variance is used to take one-half of the variance to reduce the impact of individual data. Therefore, the cost function is derived:
$$
J(\theta_0, \theta_1) = \frac{1}{m}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2
$$
**The optimal solution is the minimum value of the cost function **$\min J(\theta_0, \theta_1) $. If it is a parameter, the cost function is generally visualized by a two-dimensional curve. If it is 2 parameters, the cost function can see the effect through the 3D image. The more parameters, the more complicated.
**The optimal solution is the minimum value of the cost function **$\min J(\theta_0, \theta_1) $. If it is a parameter, the cost function is generally visualized by a two-dimensional curve. If it is 2 parameters, the cost function can see the effect through the 3D image. The more parameters, the more complicated.
When the parameter is 2, the cost function is a three-dimensional image.
![](./img/ch2/2.16/2.jpg)
@ -315,7 +315,7 @@ If Gradient descent is used to adjust the size of the weight parameter, the grad
$$
\frac{\partial J}{\partial b}=(a-y)\sigma'(z)
$$
Where $z $ represents the input of the neuron and $\sigma $ represents the activation function. The gradient of the weight $w $ and offset $b $ is proportional to the gradient of the activation function. The larger the gradient of the activation function, the faster the weights $w $ and offset $b $ are adjusted. The faster the training converges.
Where $z $ represents the input of the neuron and $\sigma $ represents the activation function. The gradient of the weight $w $ and offset $b $ is proportional to the gradient of the activation function. The larger the gradient of the activation function, the faster the weights $w $ and offset $b $ are adjusted. The faster the training converges.
*Note*: The activation function commonly used in neural networks is the sigmoid function. The curve of this function is as follows:
@ -667,7 +667,7 @@ Among them, $m $ is the number of samples, and $j $ is the number of parameters.
1, **batch gradient descent solution ideas are as follows: **
a) Get the gradient corresponding to each $\theta $:
a) Get the gradient corresponding to each $\theta $:
$$
\frac{\partial}{\partial \theta_i}J({\theta}_0,{\theta}_1,...,{\theta}_n)=\frac{1}{m}\sum^{m }_{j=0}(h_\theta (x^{(j)}_0
,x^{(j)}_1,...,x^{(j)}_n)-y_j)x^{(j)}_i
@ -680,7 +680,7 @@ $$
c) It can be noticed from the above equation that although it is a global optimal solution, all data of the training set is used for each iteration. If the sample data is large, the iteration speed of this method is very slow.
In contrast, a random gradient drop can avoid this problem.
2. The solution to the stochastic gradient descent is as follows: **
2. **The solution to the stochastic gradient descent is as follows: **
a) Compared to all training samples compared to the batch gradient drop, the loss function in the stochastic gradient descent method corresponds to the granularity of each sample in the training set.
The loss function can be written in the form of
$$
@ -709,7 +709,7 @@ d) In terms of convergence speed, the stochastic gradient descent method iterate
The following describes a small batch gradient descent method that combines the advantages of both methods.
3, **small batch (mini-batch) gradient drop solution is as follows **
For data with a total of $m$ samples, according to the sample data, select $n(1< n< m)$ subsamples to iterate. Its parameter $\theta$ updates the $\theta_i$ formula in the gradient direction as follows:
For data with a total of $m$ samples, according to the sample data, select $n(1< n< m)$ subsamples to iterate. Its parameter $\theta$ updates the $\theta_i$ formula in the gradient direction as follows:
$$
\theta_i = \theta_i - \alpha \sum^{t+n-1}_{j=t}
( h_\theta (x^{(j)}_{0}, x^{(j)}_{1}, ... , x^{(j)}_{n} ) - y_j ) x^ {j}_{i}
@ -807,7 +807,7 @@ $$
U_j = \frac{1}{N_j} \sum_{x\epsilon X_j}x(j=0,1),
\sum_j = \sum_{x\epsilon X_j}(x-u_j)(x-u_j)^T(j=0,1)
$$
Suppose the projection line is the vector $w$. For any sample $x_i$, its projection on the line $w$ is $w^tx_i$, the center point of the two categories is $u_0$, $u_1 $ is in the line $w The projections of $ are $w^Tu_0$ and $w^Tu_1$ respectively.
Suppose the projection line is the vector $w$. For any sample $x_i$, its projection on the line $w$ is $w^tx_i$, the center point of the two categories is $u_0$, $u_1 $ is in the line $w $The projections of are $w^Tu_0$ and $w^Tu_1$ respectively.
The goal of LDA is to maximize the distance between the two categories of data centers, $\| w^Tu_0 - w^Tu_1 \|^2_2$, and at the same time, hope that the covariance of the similar sample projection points is $w^T \sum_0 w$, $w^T \sum_1 w$ Try to be as small as possible, and minimize $w^T \sum_0 w - w^T \sum_1 w $ .
definition
@ -825,14 +825,14 @@ $$
\frac{w^T(u_0-u_1)(u_0-u_1)^Tw}{w^T(\sum_0 + \sum_1)w} =
\frac{w^TS_bw}{w^TS_ww}
$$
According to the nature of the generalized Rayleigh quotient, the maximum eigenvalue of the matrix $S^{-1}_{w} S_b$ is the maximum value of $j(w)$, and the matrix $S^{-1}_{w} The feature vector corresponding to the largest eigenvalue of S_b$ is $w $.
According to the nature of the generalized Rayleigh quotient, the maximum eigenvalue of the matrix $S^{-1}_{w} S_b$ is the maximum value of $j(w)$, and the matrix $S^{-1}_{w} $The feature vector corresponding to the largest eigenvalue of $S_b$ is $w $.
### 2.14.4 Summary of LDA algorithm flow?
The LDA algorithm dimension reduction process is as follows:
Input: Dataset $D = \{ (x_1,y_1), (x_2,y_2), ... ,(x_m,y_m) \}$, where the sample $x_i $ is an n-dimensional vector, $y_i \epsilon \ {C_1, C_2, ..., C_k\}$, the dimension dimension $d$ after dimension reduction.
Input: Dataset $D = \{ (x_1,y_1), (x_2,y_2), ... ,(x_m,y_m) \}$, where the sample $x_i $ is an n-dimensional vector, $y_i\epsilon \{C_1, C_2, ..., C_k \} $, the dimension dimension $d$ after dimension reduction.
Output: Divised data set $\overline{D} $ .
Output: Divised data set $\overline{D} $ .
step:
1. Calculate the intra-class divergence matrix $S_w$.
@ -1708,11 +1708,11 @@ J(\theta)=-\frac{1}{m}\left[\sum^m_{i=1}y^{(i)}logh_{\theta}(x^{(i)})+ ( 1-y^{(i
$$
The objective function of the support vector machine:
$$
L(w,n,a)=\frac{1}{2}||w||^2-\sum^n_{i=1}\alpha_i \left( y_i(w^Tx_i+b)-1\ Right)
L(w,n,a)=\frac{1}{2}||w||^2-\sum^n_{i=1}\alpha_i \left(y_i(w^Tx_i+b)-1 \right)
$$
The logistic regression method is based on probability theory. The probability that the sample is 1 can be represented by the sigmoid function, and then the value of the parameter is estimated by the method of maximum likelihood estimation**.
The logistic regression method is based on probability theory. The probability that the sample is 1 can be represented by the sigmoid function, and then the value of the parameter is estimated by the method of maximum likelihood estimation.
The support vector machine is based on the principle of geometric ** interval maximization**, and it is considered that the classification plane with the largest geometric interval is the optimal classification plane.
The support vector machine is based on the principle of geometric interval maximizationand it is considered that the classification plane with the largest geometric interval is the optimal classification plane.
2. **LR is sensitive to outliers and SVM is not sensitive to outliers**.