Pyspark RDD Sample

1. Using sample function:

Here we are using Sample Function to get the PySpark Random Sample.

Syntax: sample(withReplacement, fraction, seed=None)

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Next, convert the data frame to an RDD data frame for performing sampling operations.

data_frame_rdd=data_frame.rdd

Step 5: Finally, extract the random sample of the data frame using the sample function with withReplacement, fraction, and seed as arguments.

data_frame_rdd.sample(withReplacement, fraction, seed=None)

Example:

In this example, we have extracted the sample from the data frame ,i.e., the dataset of 5×5, through the sample function by a fraction and withReplacement as arguments. We have extracted the sample twice through the sample function, one time by using the False value of withReplacement variable, and the second time by using the True value of withReplacement variable. What we observed is that by using False as a variable value, it didn’t give repeated values, while with True as a variable value, it gave some repeated values.

Python3




# Python program to extract Pyspark random sample through
# sample function with fraction and seed as arguments
  
# Import the SparkSession library
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv',
                            sep = ',', inferSchema = True, header = True)
  
# Convert the data frame to RDD dataframe
data_frame_rdd=data_frame.rdd
  
# Extract random sample through sample function using 
# withReplacement (value=True) and fraction as arguments
print(data_frame_rdd.sample(True,0.2).collect())
  
# Again extract random sample through sample function using 
# withReplacement (value=False) and fraction as arguments
print(data_frame_rdd.sample(False,0.2).collect())


Output:

When we run the sample command for the first time, we got the following output:

 

When we run the sample command for the second time, we got the following output:

 

2. Using takeSample function

Here we are using TakeSample Function to get the PySpark Random Sample.

Syntax:takeSample(withReplacement, num, seed=None) 

Stepwise Implementation:

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()

Step 4: Next, convert the data frame to an RDD data frame for performing sampling operations.

data_frame_rdd=data_frame.rdd

Step 5: Finally, extract the random sample of the data frame using the sample function with withReplacement, num, and seed as arguments.

data_frame_rdd.takeSample(withReplacement, num, seed=None)

Example:

In this example, we have extracted the sample from the data frame,i.e., the dataset of 5×5, through the takeSample function by num and withReplacement as arguments. We have extracted the sample twice through the sample function, one time by using the False value of withReplacement variable, and the second time by using the True value of withReplacement variable. What we observed is that by using False as a variable value, it didn’t give repeated values, while with True as a variable value, it gave some repeated values.

Python3




# Python program to extract Pyspark random sample through
# takeSample function with withReplacement, num and seed as arguments
  
# Import the SparkSession library
from pyspark.sql import SparkSession
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame=csv_file = spark_session.read.csv('/content/student_data.csv',
                             sep = ',', inferSchema = True, header = True)
  
# Convert the data frame to RDD dataframe
data_frame_rdd=data_frame.rdd
  
# Extract random sample through takeSample function using 
# withReplacement (value=True), num and seed as arguments
print(data_frame_rdd.takeSample(True,2,2))
  
# Again extract random sample through takeSample function using 
# withReplacement (value=False), num and seed as arguments
print(data_frame_rdd.takeSample(False,2,2))


Output:

When we run the sample command for the first time, we got the following output:

 

When we run the sample command for the second time, we got the following output:

 



PySpark Random Sample with Example

Are you in the field of job where you need to handle a lot of data on the daily basis? Then, you might have surely felt the need to extract a random sample from the data set. There are numerous ways to get rid of this problem. Don’t know all the ways? Continue reading the article further to know more about the random sample extraction in the Pyspark data set using Python.

Prerequisite

Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.

Modules Required:

Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark.  This module can be installed through the following command in Python.

pip install pyspark

Student_data.csv file:

 

Similar Reads

Methods to get Pyspark Random Sample:

PySpark SQL Sample Using sample function Using sampleBy function PySpark RDD Sample Using sample function Using takeSample function...

PySpark SQL Sample

1. Using sample function:...

Pyspark RDD Sample

...