How to use DataFrame.select() In Python

Here we will use the select() function to substring the dataframe.

Syntax: pyspark.sql.DataFrame.select(*cols)

Example: Using DataFrame.select()

Python

from pyspark.sql.functions import substring 
  
reg_df.select( 
  substring('LicenseNo' , 1, 2).alias('State')  , 
  substring('LicenseNo' , 3, 4).alias('RegYear'), 
  substring('LicenseNo' , 7, 8).alias('RegID')  , 
  substring('ExpiryDate', 1, 4).alias('ExpYr')  , 
  substring('ExpiryDate', 6, 2).alias('ExpMo')  , 
  substring('ExpiryDate', 9, 2).alias('ExpDt')  , 
).show()

Output:

How to check for a substring in a PySpark dataframe ?

In this article, we are going to see how to check for a substring in PySpark dataframe.

Substring is a continuous sequence of characters within a larger string size. For example, “learning pyspark” is a substring of “I am learning pyspark from w3wiki”. Let us look at different ways in which we can find a substring from one or more columns of a PySpark dataframe.

Creating Dataframe for demonstration:

Python

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
  
# Column names for the dataframe 
columns = ["LicenseNo", "ExpiryDate"] 
  
# Row data for the dataframe 
data = [ 
    ("MH201411094334", "2024-11-19"), 
    ("AR202027563890", "2030-03-16"), 
    ("UP202010345567", "2035-12-30"), 
    ("KN201822347800", "2028-10-29"), 
] 
  
# Create the dataframe using the above values 
reg_df = spark.createDataFrame(data=data, 
                               schema=columns) 
  
# View the dataframe 
reg_df.show() 

Output:

In the above dataframe, LicenseNo is composed of 3 information, 2-letter State Code + Year of registration + 8 digit registration number.

How to use DataFrame.select() In Python

Python

How to check for a substring in a PySpark dataframe ?

Python

Categories

Contact US

How to use DataFrame.select() In Python

Python

How to check for a substring in a PySpark dataframe ?

Python

Similar Reads

Categories

Contact US