What is sample() method

The sample() method is used to get random sample records from the dataset. When we work on a larger dataset and want to analyze/test only a chunk of that large dataset then the sample() method helps us to get a sample to perform our execution.

Syntax of sample()

Syntax: sample(withReplacement, fraction, seed=None)

Parameters:  

  • fraction: list  #Fraction of rows to generate , range 
  • seed: int, optional   # seed for sampling, a random number generator
  • withReplacement: Sample with replacement or not (default False)

Using Fraction to get a Sample of Records

By using Fraction, only an approximate number of records based on a given fraction value will be generated. The below code returns the 5%  records out of 100%. Fraction does not give exact output, it may happen that when you run the below code it will return different values.

Python3




# importing sparksession from pyspark.sql
from pyspark.sql import SparkSession
  
# to get an existing sparksession
spark = SparkSession.builder.appName
        ("SparkByExamples.com").getOrCreate()
    
# taking random 100 records
df = spark.range(100)
  
# to get 5% records out of total records
print(df.sample(0.05).collect())


Output:

 

Using Fraction and Seed

Seed is used to generate consistent random sampling.

Python3




# importing sparksession from pyspark.sql
from pyspark.sql import SparkSession
  
# to get an existing sparksession
spark = SparkSession.builder.appName
        ("SparkByExamples.com").getOrCreate()
    
# taking random 100 records
df = spark.range(100)
  
# to get fraction of records on the basis of seed value
print(df.sample(0.1, 97).collect())


Output:

 

Using withReplacement True/False

Whenever we need duplicate rows in our sample, write True in place of withReplacement=False, otherwise, there is no need to specify anything.

Python3




# importing sparksession from pyspark.sql
from pyspark.sql import SparkSession
  
# to get an existing sparksession
spark = SparkSession.builder.appName
            ("SparkByExamples.com").getOrCreate()
    
# taking 100 random records
df = spark.range(100)
print("With Duplicates:")
  
# to get fraction of record on the basis of seed value
# which may contain duplicates
print(df.sample(True, 0.2, 98).collect())
print("Without Duplicates: ")
  
# to get fraction of records without redundancy
print(df.sample(False, 0.1, 123).collect())


Output:

 



PySpark randomSplit() and sample() Methods

In this article, we are going to learn about under the hood: randomSplit() and sample() inner working with Pyspark in Python. In PySpark, whenever we work on large datasets we need to split the data into smaller chunks or get some percentage of data to perform some operations. Hence, PySpark provides two such methods randomSplit() and sample(). The randomSplit() is used to split the DataFrame within the provided limit,  whereas sample() is used to get random samples of the DataFrame.

Required Module

!pip install pyspark

Similar Reads

What is randomsplit()

Anomalies in randomsplit() in PySpark refer to unexpected or inconsistent behavior that can occur when using the randomsplit() function. This function is used to split a dataset into two or more subsets with a specified ratio. However, due to the random nature of the function, it may lead to duplicated or missing rows, or unexpected results when performing joins on the resulting subsets....

What is sample() method

...