PySpark sample() method

PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame.

Here are the details of the sample() method :

Syntax : DataFrame.sample(withReplacement,fractionfloat,seed)

It returns a subset of the DataFrame.

Parameters :

withReplacement : bool, optional

Sample with replacement or not (default False).

fractionfloat : optional

Fraction of rows to generate

seed : int, optional

Used to reproduce the same random sampling.

Example:

In this example, we need to add a fraction of float data type here from the range [0.0,1.0]. Using the formula :

Number of rows needed = Fraction * Total Number of rows

We can say that the fraction needed for us is 1/total number of rows.

Python

# importing the library and 
# its SparkSession functionality 
import pyspark 
from pyspark.sql import SparkSession 
  
# creating a session to make DataFrames 
random_row_session = SparkSession.builder.appName( 
    'Random_Row_Session'
).getOrCreate() 
  
# Pre-set data for our DataFrame 
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]] 
columns = ['Letters', 'Position'] 
  
# Creating a DataFrame 
df = random_row_session.createDataFrame(data, 
                                        columns) 
  
# Printing the DataFrame 
df.show() 
  
# Taking a sample of df and storing it in #df2 
# please not that the second argument here is a fraction 
# of the data set we need(fraction is in float) 
# number of rows = fraction * total number of rows 
df2 = df.sample(False, 1.0/len(df.collect())) 
  
# printing the sample row which is a DataFrame 
df2.show() 

Output :

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

+-------+--------+
|Letters|Position|
+-------+--------+
|      b|       2|
+-------+--------+

How take a random row from a PySpark DataFrame?

In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language.

PySpark sample() method

Python

How take a random row from a PySpark DataFrame?

Categories

Contact US

PySpark sample() method

Python

How take a random row from a PySpark DataFrame?

Similar Reads

Categories

Contact US