Method 2 : Using parallelize()
We are going to use parallelize() to create an RDD. Parallelize means to copy the elements present in a pre-defined collection to a distributed dataset on which we can operate in parallel. Here is the syntax of parallelize() :
Syntax : sc.parallelize(data,numSlices)
sc : Spark Context Object
Parameters :
- data : data for which RDD is to be made.
- numSlices : number of partitions that need to be made. This is an optional parameter.
Example:
In this example, we will then use createDataFrame() to create a PySpark DataFrame and then use toPandas() to get a Pandas DataFrame.
Python
# Importing PySpark and importantly # Row from pyspark.sql import pyspark from pyspark.sql import SparkSession from pyspark.sql import Row # PySpark Session row_pandas_session = SparkSession.builder.appName( 'row_pandas_session' ).getOrCreate() # List of Sample Row objects row_object_list = [Row(Topic = 'Dynamic Programming' , Difficulty = 10 ), Row(Topic = 'Arrays' , Difficulty = 5 ), Row(Topic = 'Sorting' , Difficulty = 6 ), Row(Topic = 'Binary Search' , Difficulty = 7 )] # Creating an RDD rdd = row_pandas_session.sparkContext.parallelize(row_object_list) # DataFrame created using RDD df = row_pandas_session.createDataFrame(rdd) # Checking the DataFrame df.show() # Conversion of DataFrame df2 = df.toPandas() # Final DataFrame needed print (df2) |
Output :
Convert PySpark Row List to Pandas DataFrame
In this article, we will convert a PySpark Row List to Pandas Data Frame. A Row object is defined as a single Row in a PySpark DataFrame. Thus, a Data Frame can be easily represented as a Python List of Row objects.