What is sample() method
The sample() method is used to get random sample records from the dataset. When we work on a larger dataset and want to analyze/test only a chunk of that large dataset then the sample() method helps us to get a sample to perform our execution.
Syntax of sample()
Syntax: sample(withReplacement, fraction, seed=None)
Parameters:
- fraction: list #Fraction of rows to generate , range
- seed: int, optional # seed for sampling, a random number generator
- withReplacement: Sample with replacement or not (default False)
Using Fraction to get a Sample of Records
By using Fraction, only an approximate number of records based on a given fraction value will be generated. The below code returns the 5% records out of 100%. Fraction does not give exact output, it may happen that when you run the below code it will return different values.
Python3
# importing sparksession from pyspark.sql from pyspark.sql import SparkSession # to get an existing sparksession spark = SparkSession.builder.appName ( "SparkByExamples.com" ).getOrCreate() # taking random 100 records df = spark. range ( 100 ) # to get 5% records out of total records print (df.sample( 0.05 ).collect()) |
Output:
Using Fraction and Seed
Seed is used to generate consistent random sampling.
Python3
# importing sparksession from pyspark.sql from pyspark.sql import SparkSession # to get an existing sparksession spark = SparkSession.builder.appName ( "SparkByExamples.com" ).getOrCreate() # taking random 100 records df = spark. range ( 100 ) # to get fraction of records on the basis of seed value print (df.sample( 0.1 , 97 ).collect()) |
Output:
Using withReplacement True/False
Whenever we need duplicate rows in our sample, write True in place of withReplacement=False, otherwise, there is no need to specify anything.
Python3
# importing sparksession from pyspark.sql from pyspark.sql import SparkSession # to get an existing sparksession spark = SparkSession.builder.appName ( "SparkByExamples.com" ).getOrCreate() # taking 100 random records df = spark. range ( 100 ) print ( "With Duplicates:" ) # to get fraction of record on the basis of seed value # which may contain duplicates print (df.sample( True , 0.2 , 98 ).collect()) print ( "Without Duplicates: " ) # to get fraction of records without redundancy print (df.sample( False , 0.1 , 123 ).collect()) |
Output:
PySpark randomSplit() and sample() Methods
In this article, we are going to learn about under the hood: randomSplit() and sample() inner working with Pyspark in Python. In PySpark, whenever we work on large datasets we need to split the data into smaller chunks or get some percentage of data to perform some operations. Hence, PySpark provides two such methods randomSplit() and sample(). The randomSplit() is used to split the DataFrame within the provided limit, whereas sample() is used to get random samples of the DataFrame.
Required Module
!pip install pyspark