PySpark sample() method

PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame.

Here are the details of the sample() method : 

Syntax : DataFrame.sample(withReplacement,fractionfloat,seed)

It returns a subset of the DataFrame.

Parameters

withReplacement : bool, optional

Sample with replacement or not (default False).

fractionfloat : optional

Fraction of rows to generate

seed : int, optional

Used to reproduce the same random sampling.

Example:

In this example, we need to add a fraction of float data type here from the range [0.0,1.0]. Using the formula : 

Number of rows needed = Fraction * Total Number of rows

We can say that the fraction needed for us is 1/total number of rows.

Python




# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Taking a sample of df and storing it in #df2
# please not that the second argument here is a fraction
# of the data set we need(fraction is in float)
# number of rows = fraction * total number of rows
df2 = df.sample(False, 1.0/len(df.collect()))
  
# printing the sample row which is a DataFrame
df2.show()


Output

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

+-------+--------+
|Letters|Position|
+-------+--------+
|      b|       2|
+-------+--------+

How take a random row from a PySpark DataFrame?

In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language.

Similar Reads

Method 1 : PySpark sample() method

PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame....

Method 2: Using takeSample() method

...

Method 3: Convert the PySpark DataFrame to a Pandas DataFrame and use the sample()  method

We first convert the PySpark DataFrame to an RDD. Resilient Distributed Dataset (RDD) is the most simple and fundamental data structure in PySpark. They are immutable collections of data of any data type....