PySpark sample() method
PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame.
Here are the details of the sample() method :
Syntax : DataFrame.sample(withReplacement,fractionfloat,seed)
It returns a subset of the DataFrame.
Parameters :
withReplacement : bool, optional
Sample with replacement or not (default False).
fractionfloat : optional
Fraction of rows to generate
seed : int, optional
Used to reproduce the same random sampling.
Example:
In this example, we need to add a fraction of float data type here from the range [0.0,1.0]. Using the formula :
Number of rows needed = Fraction * Total Number of rows
We can say that the fraction needed for us is 1/total number of rows.
Python
# importing the library and # its SparkSession functionality import pyspark from pyspark.sql import SparkSession # creating a session to make DataFrames random_row_session = SparkSession.builder.appName( 'Random_Row_Session' ).getOrCreate() # Pre-set data for our DataFrame data = [[ 'a' , 1 ], [ 'b' , 2 ], [ 'c' , 3 ], [ 'd' , 4 ]] columns = [ 'Letters' , 'Position' ] # Creating a DataFrame df = random_row_session.createDataFrame(data, columns) # Printing the DataFrame df.show() # Taking a sample of df and storing it in #df2 # please not that the second argument here is a fraction # of the data set we need(fraction is in float) # number of rows = fraction * total number of rows df2 = df.sample( False , 1.0 / len (df.collect())) # printing the sample row which is a DataFrame df2.show() |
Output :
+-------+--------+ |Letters|Position| +-------+--------+ | a| 1| | b| 2| | c| 3| | d| 4| +-------+--------+ +-------+--------+ |Letters|Position| +-------+--------+ | b| 2| +-------+--------+
How take a random row from a PySpark DataFrame?
In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language.