Skip to content

[Scikit-Learn] Using “train_test_split()” to split your data

Today, if we need to split our data for training our model —— we need to split training data and test data. We use “training data” to train our model and check it never peep our “test data”, it is very important, because our “test data” can assess the quality of our model.

After all, for our model, test data is completely new.

Of course, we can do it manually. But there is a useful function “train_test_split()” in scikit-learn. It can help us to split data by one line.

If you have something want to do, and then you find a function you can call it easily —— there is nothing better than this.


How to use train_test_split()

If you have no “scikit-learn” package in your Python environment, you need to use the following instruction to install it:

pip3 install scikit-learn

After installing the scikit-learn package, we try to call the “train_test_split()” function!

First, we generate some demo data.

data = [n for n in range(1, 11)]
print(data)



Output:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

And then we need to import the function “train_test_split()” into our program:

from sklearn.model_selection import train_test_split



The input variable is very simple: “data”, “seed”, “split_ratio”.

  • data: source data
  • seed: random seed, it can fix our split result
  • ratio: we can set train_size or test_size

And we call the function:

train_data, test_data = train_test_split(data, random_state=777, train_size=0.8)
print(train_data)
print(test_data)



Output:

[9, 4, 5, 6, 2, 10, 7, 8]
[3, 1]

It can be seen that the ratio of training data to test data is indeed 8: 2, which is consistent with the ratio we set.

The above is a simple train_test_split () note, which is very convenient and easy to use.


References

Leave a Reply