1.Explain SVM machine learning algorithm in detail.
SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyper planes to separate out different classes based on the provided kernel function.
2. What are support vectors in SVM.
In the above diagram we see that the thinner lines mark the distance from the classifier to the closest data points called the support vectors (darkened data points). The distance between the two thin lines is called the margin.
3. What are the different kernels functions in SVM ?
There are four types of kernels in SVM.
- Linear Kernel
- Polynomial kernel
- Radial basis kernel
- Sigmoid kernel
4.Give some situations where you will use an SVM over a RandomForest Machine Learning algorithm and vice-versa?
The performance depends on many factors
- the number of training instances
- the distribution of the data
- linear vs. non-linear problems
- input scale of the features
- the chosen hyperparameters
- how you validate/evaluate your model
In general, It is easier to train a well-performing Random Forest classifier since you have to worry less about hyperparameter optimization. Due to the nature Random Forests, you are less likely to overfit. You simply grow ntrees on n bootstrap samples of the training set on feature subspaces — using the majority vote, the estimate will be pretty robust.
Using Support Vector Machines, you have “more things” to “worry” about such as choosing an appropriate kernel (poly, RBF, linear …), the regularization penalty, the regularization strength, kernel parameters such as the poly degree or gamma, and so forth.
So, in sum, We can say that Random Forests are much more automated and thus “easier” to train compared to SVMs, but there are many examples in literature where SVMs outperform Random Forests and vice versa on different datasets. So, if you like to compare these two, make sure that you run a large enough grid search for the SVM and use nested cross-validation to reduce the performance estimation bias.
5. Why SVM is an example of a large margin classifier?
- SVM is a type of classifier which classifies positive and negative examples, here blue and red data points
- As shown in the image, the largest margin is found in order to avoid overfitting ie,.. the optimal hyperplane is at the maximum distance from the positive and negative examples(Equal distant from the boundary lines).
- To satisfy this constraint, and also to classify the data points accurately, the margin is maximised, that is why this is called the large margin classifier.
6. What is the role of C in SVM?
The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable.
7. What is the intuition of a large margin classifier?
Let’s say you’ve found a hyperplane that completely separates the two classes in your training set. We expect that when new data comes along (i.e. your test set), the new data will look like your training data. Points that should be classified as one class or the other should lie near the points in your training data with the corresponding class. Now, if your hyperplane is oriented such that it is close to some of the points in your training set, there’s a good chance that the new data will lie on the wrong side of the hyperplane, even if the new points lie close to training examples of the correct class.
So we say that we want to find the hyperplane with the maximum margin. That is, find a hyperplane that divides your data properly, but is also as far as possible from your data points. That way, when new data comes in, even if it is a little closer to the wrong class than the training points, it will still lie on the right side of the hyperplane.
If your data is separable, then there are infinitely many hyperplanes that will separate it. SVM (and some other classifiers) optimizes for the one with the maximum margin, as described above.
8. What is a kernel in SVM? Why do we use kernels in SVM?
SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. Introduce Kernel functions for sequence data, graphs, text, images, as well as vectors. The most used type of kernel function is RBF. Because it has localized and finite response along the entire x-axis. The kernel functions return the inner product between two points in a suitable feature space. Thus by defining a notion of similarity, with little computational cost even in very high-dimensional spaces.
9. Can we apply the kernel trick to logistic regression? Why is it not used in practice then?
- Classification performance is almost identical in both cases.
- KLR (Kernal Logistic Regression) can provide class probabilities whereas SVM is a deterministic classifier.
- KLR has a natural extension to multi-class classification whereas in SVM, there are multiple ways to extend it to multi-class classification (and it is still an area of research whether there is a version which has provably superior qualities over the others).
- Surprisingly or unsurprisingly, KLR also has optimal margin properties that the SVMs enjoy (well in the limit at least)!
Looking at the above it almost feels like kernel logistic regression is what you should be using. However, there are certain advantages that SVMs enjoy
- KLR is computationally more expensive than SVM — O(N3) vs O(N2k) where kk is the number of support vectors.
- The classifier in SVM is designed such that it is defined only in terms of the support vectors, whereas in KLR, the classifier is defined over all the points and not just the support vectors. This allows SVMs to enjoy some natural speed-ups (in terms of efficient code-writing) that is hard to achieve for KLR.
10. What is the difference between logistic regression and SVM without a kernel?
Only in implementation, One is much more efficient and has good optimization packages
11. What is the difference between logistic regression and SVM
Logistic regression assumes that the predictors aren’t sufficient to determine the response variable, but determine a probability that is a logistic function of a linear combination of them. If there’s a lot of noise, logistic regression (usually fit with maximum-likelihood techniques) is a great technique.
On the other hand, there are problems where you have thousands of dimensions and the predictors do nearly-certainly determine the response, but in some hard-to-explicitly-program way. An example would be image recognition. If you have a grayscale image, 100 by 100 pixels, you have 10,000 dimensions already. With various basis transforms (kernel trick) you will be able to get a linear separator of the data.
Non-regularized logistic regression techniques don’t work well (in fact, the fitted coefficients diverge) when there’s a separating hyperplane, because the maximum likelihood is achieved by any separating plane, and there’s no guarantee that you’ll get the best one. What you get is an extremely confident model with poor predictive power near the margin.
SVMs get you the best separating hyperplane, and they’re efficient in high dimensional spaces. They’re similar to regularization in terms of trying to find the lowest-normed vector that separates the data, but with a margin condition that favors choosing a good hyperplane. A hard-margin SVM will find a hyperplane that separates all the data (if one exists) and fail if there is none; soft-margin SVMs (generally preferred) do better when there’s noise in the data.
Additionally, SVMs only consider points near the margin (support vectors). Logistic regression considers all the points in the data set. Which you prefer depends on your problem.
Logistic regression is great in a low number of dimensions and when the predictors don’t suffice to give more than a probabilistic estimate of the response. SVMs do better when there’s a higher number of dimensions, and especially on problems where the predictors do certainly (or near-certainly) determine the responses.
12. Suppose you are using RBF kernel in SVM with high Gamma value. What does this signify?
The gamma parameter in SVM tuning signifies the influence of points either near or far away from the hyperplane.
For a low gamma, the model will be too constrained and include all points of the training dataset, without really capturing the shape.
For a higher gamma, the model will capture the shape of the dataset well.
13. What is generalization error in terms of the SVM?
Generalisation error in statistics is generally the out-of-sample error which is the measure of how accurately a model can predict values for previously unseen data.
14.Advantages of Support Vector Machines
1. Avoiding Over-Fitting
Once the hyperplane of the vector machine has been found, apart from the points closest to the plane( support vectors), most of the other data become redundant can be omitted.
This implies that small changes made cannot make any significant changes to the overall data and also leave the hyper-plane also unaffected. Thus the name ‘support vector machine’ means that such algorithms tend to generalize data efficiently.
2. Simplification of Calculation
They have comprehensive algorithms of regression which help in the classification of class data of two classes. This allows us to make our predictions and calculations simpler as the algorithm is presented in a graph image and can be used to estimate the class distinction.
Simpler visual calculation helps faster and more reliable data output rather than individually corresponding each support co-ordinate of the 2 cases.
15.Disadvantages of Support Vector Machines
The main disadvantages are primarily in the theory which actually only covers the determination of what the parameters will be a given set of values. Also. Kernel models can be sensitive to overfitting to the criteria of the model.
Moreover, the optimal choice of kernel often ends up to have all the data points as the supporting vector. This makes it more cumbersome to proceed with the algorithm.