How to use schema.fields In Python
Here we are using schema.fields method to get the type of the columns. We are checking the particular type using methods that are available in pyspark.sql.types module.
Let’s check one by one:
- Integer – IntegerType
- Float-FloatType
- Double – DoubleType
- String- StringType
We are using isinstance() operator to check with these data types.
Syntax: dataframe[[f.name for f in dataframe.schema.fields if isinstance(f.dataType, datatype)]]
where,
- dataframe is the input dataframe
- name is the values
- datatype refers to above types
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # import data field types from pyspark.sql.types import StringType, DoubleType, IntegerType, StructType, StructField, FloatType # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of student data data = [( 1 , "sravan" , 9.8 , 4500.00 ), ( 2 , "ojsawi" , 9.2 , 6789.00 ), ( 3 , "bobby" , 8.9 , 988.000 )] # specify column names with data types columns = StructType([ StructField( "ID" , IntegerType(), True ), StructField( "NAME" , StringType(), True ), StructField( "GPA" , FloatType(), True ), StructField( "FEE" , DoubleType(), True ), ]) # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # select columns that are integer type print (dataframe[[f.name for f in dataframe.schema.fields if isinstance ( f.dataType, IntegerType)]].collect()) # select columns that are string type print (dataframe[[f.name for f in dataframe.schema.fields if isinstance ( f.dataType, StringType)]].collect()) # select columns that are float type print (dataframe[[f.name for f in dataframe.schema.fields if isinstance ( f.dataType, FloatType)]].collect()) # select columns that are double type print (dataframe[[f.name for f in dataframe.schema.fields if isinstance ( f.dataType, DoubleType)]].collect()) |
Output:
[Row(ID=1), Row(ID=2), Row(ID=3)]
[Row(NAME=’sravan’), Row(NAME=’ojsawi’), Row(NAME=’bobby’)]
[Row(GPA=9.800000190734863), Row(GPA=9.199999809265137), Row(GPA=8.899999618530273)]
[Row(FEE=4500.0), Row(FEE=6789.0), Row(FEE=988.0)]
PySpark – Select columns by type
In this article, we will discuss how to select columns by type in PySpark using Python.
Let’s create a dataframe for demonstration
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # import data field types from pyspark.sql.types import StringType, DoubleType, IntegerType, StructType, StructField, FloatType # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of student data data = [( 1 , "sravan" , 9.8 , 4500.00 ), ( 2 , "ojsawi" , 9.2 , 6789.00 ), ( 3 , "bobby" , 8.9 , 988.000 )] # specify column names with data types columns = StructType([ StructField( "ID" , IntegerType(), True ), StructField( "NAME" , StringType(), True ), StructField( "GPA" , FloatType(), True ), StructField( "FEE" , DoubleType(), True ), ]) # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe.show() |
Output:
We can select the column by name using the following keywords:
- Integer: int
- String : string
- Float: float
- Double: double