Cross validation (CV) is a technique to test the effectiveness of a statistical model. To perform CV we need to keep aside a sample/portion of the data on which is do not use to train the model, later us this sample for testing/validating. Below are the few common techniques used for CV.
1. Train_Test Split approach.
In this approach the complete data is split into training and test sets. The model is trained on the training set and tested using the test data. If the data is limited, we would miss a lot of information on the data which are not used in training.
2. K-Folds Cross Validation
K-Fold ensures that every observation from the original dataset has the chance of appearing in training and test set. This is very appealing when data is limited. This method can be described as follows
from sklearn.model_selection import KFold # import KFold x = np.array([[1, 3], [3, 10], [3, 4], [4, 8], [5, 7], [6, 7]]) y = np.array([1, 5, 8, 9, 10, 15]) kf = KFold(n_splits = 2) kf.get_n_splits(x) for train_index, test_index in kf.split(x): print('train_index =', train_index, 'test_index =', test_index)
train_index = [3 4 5] test_index = [0 1 2] train_index = [0 1 2] test_index = [3 4 5]
As you can see, the function split the original data into different subsets of the data