[Python] Use ShuffleSplit() To Process Cross-Validation Step

Cross-validation is an important concept in data splitting of machine learning. Simply to put, when we want to train a model, we need to split data to training data and testing data.

We always use training data to train our model and use testing data to test our model. Any data in testing data cannot contained in the training data.

So we can use the data that model never know to test the effect of our model.

But when we are using the same training data and the same testing data, somebody worried about we are just improving the model for testing data. If we change the testing data, maybe our model is not very useful.

In the past, I wrote a article to record how to use train_test_split() function in scikit-learn package, but today I want to note another useful function ShuffleSplit().

ShuffleSplit() function can help us to quickly split many different training data and testing data.

train_test_split

In fact, train_test_split() is very easy-to-use.

The following is a simple example.

# coding: utf-8
from sklearn.model_selection import train_test_split


# train_test_split
elements = list(range(10))
train_data, test_data = train_test_split(elements, train_size=0.8)
print('Train: {} Test: {}'.format(train_data, test_data))

# coding: utf-8
from sklearn.model_selection import train_test_split


# train_test_split
elements = list(range(10))
train_data, test_data = train_test_split(elements, train_size=0.8)
print(‘Train: {} Test: {}’.format(train_data, test_data))

Output:

Train: [5, 1, 8, 7, 4, 0, 9, 3] Test: [2, 6]

ShuffleSplit

The parameters of ShuffleSplit():

n_splits (int, default=10): The number of random data combinations generated
test_size: test data size (0.0 – 1.0)
train_size: train data size (0.0 – 1.0)
random_state: random seed

Just like train_test_split() function, you only set one of test_size and train_size.

# coding: utf-8
from sklearn.model_selection import ShuffleSplit


# ShuffleSplit
elements = list(range(10))
rs = ShuffleSplit(n_splits=5, train_size=0.8)
for train_data, test_data in rs.split(elements):
    print('Train: {} Test: {}'.format(train_data, test_data))

# coding: utf-8
from sklearn.model_selection import ShuffleSplit


# ShuffleSplit
elements = list(range(10))
rs = ShuffleSplit(n_splits=5, train_size=0.8)
for train_data, test_data in rs.split(elements):
    print(‘Train: {} Test: {}’.format(train_data, test_data))

Output:

Train: [9 5 2 4 3 8 7 6] Test: [1 0]
Train: [0 7 9 1 2 5 3 6] Test: [8 4]
Train: [2 7 4 6 1 5 9 0] Test: [8 3]
Train: [4 9 8 7 0 1 5 6] Test: [2 3]
Train: [9 7 0 6 8 1 2 3] Test: [4 5]

As you can see, ShuffleSplit can generate many different training/testing data.

References

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html

[Scikit-Learn] Using “train_test_split()” to split your data

[Python] Use ShuffleSplit() To Process Cross-Validation Step

train_test_split

ShuffleSplit

References

Read More

Related

Leave a ReplyCancel reply

[Python] Use ShuffleSplit() To Process Cross-Validation Step

train_test_split

ShuffleSplit

References

Read More

Share this:

Related

Leave a ReplyCancel reply