How to use DataFrame.select() In Python

Here we will use the select() function to substring the dataframe.

Syntax: pyspark.sql.DataFrame.select(*cols)

Example: Using DataFrame.select()

Python




from pyspark.sql.functions import substring
  
reg_df.select(
  substring('LicenseNo' , 1, 2).alias('State')  ,
  substring('LicenseNo' , 3, 4).alias('RegYear'),
  substring('LicenseNo' , 7, 8).alias('RegID')  ,
  substring('ExpiryDate', 1, 4).alias('ExpYr')  ,
  substring('ExpiryDate', 6, 2).alias('ExpMo')  ,
  substring('ExpiryDate', 9, 2).alias('ExpDt')  ,
).show()


Output:

How to check for a substring in a PySpark dataframe ?

In this article, we are going to see how to check for a substring in PySpark dataframe.

Substring is a continuous sequence of characters within a larger string size. For example, “learning pyspark” is a substring of “I am learning pyspark from w3wiki”. Let us look at different ways in which we can find a substring from one or more columns of a PySpark dataframe.

Creating Dataframe for demonstration:

Python




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
  
# Column names for the dataframe
columns = ["LicenseNo", "ExpiryDate"]
  
# Row data for the dataframe
data = [
    ("MH201411094334", "2024-11-19"),
    ("AR202027563890", "2030-03-16"),
    ("UP202010345567", "2035-12-30"),
    ("KN201822347800", "2028-10-29"),
]
  
# Create the dataframe using the above values
reg_df = spark.createDataFrame(data=data,
                               schema=columns)
  
# View the dataframe
reg_df.show()


Output:

In the above dataframe, LicenseNo is composed of 3 information, 2-letter State Code + Year of registration + 8 digit registration number.

Similar Reads

Method 1: Using DataFrame.withColumn()

...

Method 2: Using substr inplace of substring

The DataFrame.withColumn(colName, col) can be used for extracting substring from the column data by using pyspark’s substring() function along with it....

Method 3: Using DataFrame.select()

...

Method 4: Using ‘spark.sql()’

...

Method 5: Using spark.DataFrame.selectExpr()

Alternatively, we can also use substr from column type instead of using substring....