본문으로 건너뛰기

FDA +010

· 약 4분

SVM

a supervised learning framework for finding a boundary between data points belonging to different classes.

Convex Hull

Convex Hull

  • the lines surrounding the outermost points of each class.
  • since the classes are linearly separable, convex hulls do not intersect.

Margins

  • the distance between data points of different classes seprarated by a hyperplane.
  • multiple possible hyperplanes can separate the classes, but the optimal hyperplane maximizes the margin.
  • Maximum Margin Hyperplane
    • a separation line/plane orthogonal to the shortest line connecting the convex hull.
    • the line that is farthest apart from each convex hull.
    • it separates the data points with the widest margin.

Kernels

  • kernel is used to represent kernel functions
    • which are used to convert low-dimensional space to high-dimensional space
    • by applying a function on low-dimensional data points to come up with higher dimensions
      • which can be used to linearly separate the data points of classes using a hyperplane.
  • the function of a kernel is to take data as input and transform it into the required form.
    • different algorithms use different types of kernel functions.
    • linear, non-linear, polynomial, radial basis function (RBF), and sigmoid.

Kernel trick

  • Basic idea of SVM and Kernel trick is to find the plane which can separate, classify or split the data with maximum margin as possible.
  • The distance from a point (x0,y0)(x_0, y_0) to a line Ax+By+C=0Ax + By + C = 0 is given by the formula: Distance=Ax0+By0+CA2+B2\text{Distance} = \frac{|Ax_0 + By_0 + C|}{\sqrt{A^2 + B^2}}
  • The distance between H0H_0, H1H_1 is then: wx+b/w=1/w|w \cdot x + b| / ||w|| = 1/||w||.
    • The total distance between H1H_1 and H2H_2 is 2/w2/||w||.
    • H1:wx+b=1H_1: w \cdot x + b = 1
    • H2:wx+b=1H_2: w \cdot x + b = -1.
    Distance=(+1)(1)w=2w \text{Distance} = \frac{|(+1) - (-1)|}{||w||} = \frac{2}{||w||}
  • to maximiza the margin, we need to minimize w||w||
  • this can be solved through Lagrangian formula or Lagrangian multipliers. L=12w2i=1nαi[yi(wxi+b)1] L = \frac{1}{2} ||w||^2 - \sum_{i=1}^{n} \alpha_i [y_i (w \cdot x_i + b) - 1]
  • Setting the gradient of LL Lw=0w=i=1nαiyixi \frac{\partial L}{\partial w} = 0 \Rightarrow w = \sum_{i=1}^{n} \alpha_i y_i x_i Lb=0i=1nαiyi=0 \frac{\partial L}{\partial b} = 0 \Rightarrow \sum_{i=1}^{n} \alpha_i y_i = 0
  • The dual form of the optimization problem is: Maximize W(α)=i=1nαi12i=1nj=1nαiαjyiyj(xixj) \text{Maximize } W(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)
  • Capital letters such as A, X, Y denote matrices;
  • Greek letters such as Φ\Phi, κ\kappa denote functions;
  • lower-case bold letters such as aa, bb denote vectors;
  • script letters such as A\mathcal{A}, B\mathcal{B}, V\mathcal{V}, E\mathcal{E} denote sets or spaces;
  • a||a|| denotes the L2L^2 norm of vector;
  • xF||x||_F denotes the Frobenius norm of matrix, and is given by xF=(i,jxij2)1/2||x||_F = \big(\sum_{i,j} |x_{ij}|^2\big)^{1/2};

Linear Kernel

K(xi,xj)=xixjK(x_i, x_j) = x_i \cdot x_j

Gaussian Kernel / RBF Kernel

K(xi,xj)=exp(xixj22σ2)K(x_i, x_j) = \exp\Big(-\frac{||x_i - x_j||^2}{2\sigma^2}\Big)

Polynomial Kernel

K(xi,xj)=(xixj+1)hK(x_i, x_j) = (x_i \cdot x_j + 1)^h

Sequential Minimal Optimization (SMO)

  • an algorithm for solving the quadratic programming (QP) problem that arises during the training of support vector machines (SVMs).

Advantages

  • Overfitting is unlikely
    • SVM uses the maximum margin hyperplane, which is relatively stable.
    • the maximum margin hyperplane is only sensitive to the changes in the support vectors.
    • the variance in the data may have a relatively low effect on the performance
  • Computational complexity
    • since every time we need to classify a new sample we need to examine that sample with all the support vectors.
    • Using the kernel trick can help to alleviate this by calculating the dot product before doing the nonlinear mapping.

Disadvantages

  • not perform very well when the data set has more noise
  • where the number of features for each data point exceeds the number of training data samples, the SVM will underperform.

Applications