SVM P.1 – Loading Sklearn Datasets
Subscribe to Tech with Tim
Support Vector Machines (SVM)
SVM stands for a support vector machine. SVM's are typically used for classification tasks similar to what we did with K Nearest Neighbors. They work very well for high dimensional data and are allow for us to classify data that does not have a linear correspondence. For example classifying a data set like the one below. Attempting to use K Nearest Neighbors to do this would likely give us a very low accuracy score and is not favorable. This is where SVM's are useful.
Importing Modules
Before we start we need to import a few things from sklearn.
import sklearn from sklearn import svm from sklearn import datasets
Loading Data
In previous tutorials we did quite a bit of work to load in our data sets from places like the UCI Machine Learning Repository. That is a very useful skill and is something you will often have to do when applying these algorithm to your own data. However, now that we have learned this we will use the data sets that come with sklearn. These are much nicer to work with and have some nice methods that make loading in data very quick.
For this tutorial we will be using a breast cancer data set. It consists of many features describing a tumor and classifies them as either cancerous or non cancerous.
To load our data we will simply do the following.
cancer = datasets.load_breast_cancer()
To see a list of the features in the data set we can do:
print("Features: ", cancer.feature_names)
Similarly for the labels.
print("Labels: ", cancer.target_names)
The output should look like this.
Splitting Data
Now that we have loaded in our data set it is time to split it into training and testing data. We will do this like seen in previous tutorials.
x = cancer.data # All of the features y = cancer.target # All of the labels x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
If we want to have a look at our data we can print the first few instances.
print(x_train[:5], y_train[:5])
Full Code
import sklearn from sklearn import datasets from sklearn import svm
cancer = datasets.load_breast_cancer()
print(cancer.feature_names) print(cancer.target_names)
x = cancer.data y = cancer.target
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
print(x_train, y_train)
The next tutorial will explain how a SVM works and the math behind it. Following that I will go over implementing the algorithm.