Random sampling with replacement
Random sampling with replacement is a type of random sampling in which the previous randomly chosen element is returned to the population and now a random element is picked up randomly.
Syntax:
sample(True, fraction, seed)
Here,
- fraction: It represents the fraction of rows to be generated. It might range from 0.0 to 1.0 (inclusive)
- seed: It represents the seed required sampling (By default it is a random seed). It is used to regenerate the same random sampling.
Example:
Python3
# Python program to demonstrate random # sampling in pyspark with replacement # Import libraries import pandas as pd from pyspark.sql import Row from pyspark.sql import SparkSession # Create a session spark = SparkSession.builder.getOrCreate() # Create dataframe by passing list df = spark.createDataFrame([ Row(Brand = "Redmi" , Units = 1000000 , Performance = "Outstanding" , Ecofriendly = "Yes" ), Row(Brand = "Samsung" , Units = 900000 , Performance = "Outstanding" , Ecofriendly = "Yes" ), Row(Brand = "Nokia" , Units = 500000 , Performance = "Excellent" , Ecofriendly = "Yes" ), Row(Brand = "Motorola" ,Units = 400000 , Performance = "Average" , Ecofriendly = "Yes" ), Row(Brand = "Apple" , Units = 2000000 ,Performance = "Outstanding" , Ecofriendly = "Yes" ) ]) # Apply sample() function with replacement df_mobile_brands = df.sample( True , 0.5 , 42 ) # Print to the console df_mobile_brands.show() |
Output:
Simple random sampling and stratified sampling in PySpark
In this article, we will discuss simple random sampling and stratified sampling in PySpark.