How to use flatMap() In Python
This method takes the selected column as the input which uses rdd and converts it into the list.
Syntax: dataframe.select(‘Column_Name’).rdd.flatMap(lambda x: x).collect()
where,
- dataframe is the pyspark dataframe
- Column_Name is the column to be converted into the list
- flatMap() is the method available in rdd which takes a lambda expression as a parameter and converts the column into list
- collect() is used to collect the data in the columns
Example 1: Python code to convert particular column to list using flatMap
Python3
# convert student Name to list using # flatMap print (dataframe.select( 'student Name' ). rdd.flatMap( lambda x: x).collect()) # convert student ID to list using # flatMap print (dataframe.select( 'student ID' ). rdd.flatMap( lambda x: x).collect()) |
Output:
[‘sravan’, ‘ojaswi’, ‘rohith’, ‘sridevi’, ‘sravan’, ‘gnanesh’]
[‘1’, ‘2’, ‘3’, ‘4’, ‘1’, ‘5’]
Example 2: Convert multiple columns to list.
Python3
# convert multiple columns to list using flatMap print (dataframe.select([ 'student Name' , 'student Name' , 'college' ]). rdd.flatMap( lambda x: x).collect()) |
Output:
[‘sravan’, ‘sravan’, ‘vignan’, ‘ojaswi’, ‘ojaswi’, ‘vvit’, ‘rohith’, ‘rohith’, ‘vvit’, ‘sridevi’, ‘sridevi’, ‘vignan’, ‘sravan’, ‘sravan’, ‘vignan’, ‘gnanesh’, ‘gnanesh’, ‘iit’]
Converting a PySpark DataFrame Column to a Python List
In this article, we will discuss how to convert Pyspark dataframe column to a Python list.
Creating dataframe for demonstration:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of students data data = [[ "1" , "sravan" , "vignan" , 67 , 89 ], [ "2" , "ojaswi" , "vvit" , 78 , 89 ], [ "3" , "rohith" , "vvit" , 100 , 80 ], [ "4" , "sridevi" , "vignan" , 78 , 80 ], [ "1" , "sravan" , "vignan" , 89 , 98 ], [ "5" , "gnanesh" , "iit" , 94 , 98 ]] # specify column names columns = [ 'student ID' , 'student NAME' , 'college' , 'subject1' , 'subject2' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe dataframe.show() |
Output: