Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this data to predict the outcome or to make the right decisions. Most of the training data is collected from several resources and then preprocessed and organized to provide proper performance of the model. Type of training data hugely determines the ability of the model to generalize .i.e. the better the quality and diversity of training data, the better will be the performance of the model. This data is more than 60% of the total data available for the project.
Example:
Python3
# Importing numpy & scikit-learn import numpy as np from sklearn.model_selection import train_test_split # Making a dummy array to # represent x,y for example # Making a array for x ranging # from 0-15 then reshaping it # to form a matrix of shape 8x2 x = np.arange( 16 ).reshape(( 8 , 2 )) # y is just a list of 0-7 number # representing target variable y = range ( 8 ) # Splitting dataset in 80-20 fashion .i.e. # Testing set is 20% of total data # Training set is 80% of total data x_train, x_test, y_train, y_test = train_test_split(x,y, train_size = 0.8 , random_state = 42 ) # Training set print ( "Training set x: " ,x_train) print ( "Training set y: " ,y_train) |
Output:
Training set x: [[ 0 1] [14 15] [ 4 5] [ 8 9] [ 6 7] [12 13]] Training set y: [0, 7, 2, 4, 3, 6]
Explanation:
- Firstly we created a dummy matrix of 8×2 shape using NumPy library to represent input x. And a list of 0 to 7 integers representing our target variable y.
- Now in order to split our dataset into training and testing data, a function named train_test_split of sklearn library is used.
- Input data x with target variable y is passed as parameters to function which then divides the dataset into 2 parts on the size given in train_size i.e. if train_size=0.8 is given then the dataset will be divided in such an way that the training set will be 80% of given dataset and testing set will be 20% of given dataset.
- And as we specify random_state to be a positive number, train_test_split function will randomly split data.
Training vs Testing vs Validation Sets
In this article, we are going to see how to Train, Test and Validate the Sets.
The fundamental purpose for splitting the dataset is to assess how effective will the trained model be in generalizing to new data. This split can be achieved by using train_test_split function of scikit-learn.