Cross validation (CV) is a technique to test the effectiveness of a statistical model. To perform CV we need to keep aside a sample/portion of the data on which is do not use to train the model, later us this sample for testing/validating. Below are the few common techniques used for CV.
1. Train_Test Split approach.
In this approach the complete data is split into training and test sets. The model is trained on the training set and tested using the test data. If the data is limited, we would miss a lot of information on the data which are not used in training.
2. K-Folds Cross Validation
K-Fold ensures that every observation from the original dataset has the chance of appearing in training and test set. This is very appealing when data is limited. This method can be described as follows
- Split the entire data randomly into k folds. The higher value of K leads to less biased model (but large variance might lead to overfit), where as the lower value of K is similar to the train-test split approach we saw before.
- Fit the model using the k-1 folds and calculate the performance (error, ROC, etc) using the remaining Kth fold.
- Repeat this process for every K-fold as a test set. Then take the average performance scores. That will be the performance metric for the model
from sklearn.model_selection import KFold # import KFold
x = np.array([[1, 3], [3, 10], [3, 4], [4, 8], [5, 7], [6, 7]])
y = np.array([1, 5, 8, 9, 10, 15])
kf = KFold(n_splits = 2)
for train_index, test_index in kf.split(x):
print('train_index =', train_index, 'test_index =', test_index)
train_index = [3 4 5] test_index = [0 1 2]
train_index = [0 1 2] test_index = [3 4 5]
As you can see, the function split the original data into different subsets of the data