What is sample() method

The sample() method is used to get random sample records from the dataset. When we work on a larger dataset and want to analyze/test only a chunk of that large dataset then the sample() method helps us to get a sample to perform our execution.

Syntax of sample()

Syntax: sample(withReplacement, fraction, seed=None)

Parameters:

fraction: list #Fraction of rows to generate , range

seed: int, optional # seed for sampling, a random number generator

withReplacement: Sample with replacement or not (default False)

Using Fraction to get a Sample of Records

By using Fraction, only an approximate number of records based on a given fraction value will be generated. The below code returns the 5% records out of 100%. Fraction does not give exact output, it may happen that when you run the below code it will return different values.

Python3

# importing sparksession from pyspark.sql 
from pyspark.sql import SparkSession 
  
# to get an existing sparksession 
spark = SparkSession.builder.appName 
        ("SparkByExamples.com").getOrCreate() 
    
# taking random 100 records 
df = spark.range(100) 
  
# to get 5% records out of total records 
print(df.sample(0.05).collect()) 

Output:

Using Fraction and Seed

Seed is used to generate consistent random sampling.

Python3

# importing sparksession from pyspark.sql 
from pyspark.sql import SparkSession 
  
# to get an existing sparksession 
spark = SparkSession.builder.appName 
        ("SparkByExamples.com").getOrCreate() 
    
# taking random 100 records 
df = spark.range(100) 
  
# to get fraction of records on the basis of seed value 
print(df.sample(0.1, 97).collect())

Output:

Using withReplacement True/False

Whenever we need duplicate rows in our sample, write True in place of withReplacement=False, otherwise, there is no need to specify anything.

Python3

# importing sparksession from pyspark.sql 
from pyspark.sql import SparkSession 
  
# to get an existing sparksession 
spark = SparkSession.builder.appName 
            ("SparkByExamples.com").getOrCreate() 
    
# taking 100 random records 
df = spark.range(100) 
print("With Duplicates:") 
  
# to get fraction of record on the basis of seed value 
# which may contain duplicates 
print(df.sample(True, 0.2, 98).collect()) 
print("Without Duplicates: ") 
  
# to get fraction of records without redundancy 
print(df.sample(False, 0.1, 123).collect()) 

Output:

PySpark randomSplit() and sample() Methods

In this article, we are going to learn about under the hood: randomSplit() and sample() inner working with Pyspark in Python. In PySpark, whenever we work on large datasets we need to split the data into smaller chunks or get some percentage of data to perform some operations. Hence, PySpark provides two such methods randomSplit() and sample(). The randomSplit() is used to split the DataFrame within the provided limit, whereas sample() is used to get random samples of the DataFrame.

Required Module

!pip install pyspark

What is sample() method

Syntax of sample()

Using Fraction to get a Sample of Records

Python3

Using Fraction and Seed

Python3

Using withReplacement True/False

Python3

PySpark randomSplit() and sample() Methods

Categories

Contact US

What is sample() method

Syntax of sample()

Using Fraction to get a Sample of Records

Python3

Using Fraction and Seed

Python3

Using withReplacement True/False

Python3

PySpark randomSplit() and sample() Methods

Similar Reads

Categories

Contact US