How to use DataFrame.select() In Python
Here we will use the select() function to substring the dataframe.
Syntax: pyspark.sql.DataFrame.select(*cols)
Example: Using DataFrame.select()
Python
from pyspark.sql.functions import substring reg_df.select( substring( 'LicenseNo' , 1 , 2 ).alias( 'State' ) , substring( 'LicenseNo' , 3 , 4 ).alias( 'RegYear' ), substring( 'LicenseNo' , 7 , 8 ).alias( 'RegID' ) , substring( 'ExpiryDate' , 1 , 4 ).alias( 'ExpYr' ) , substring( 'ExpiryDate' , 6 , 2 ).alias( 'ExpMo' ) , substring( 'ExpiryDate' , 9 , 2 ).alias( 'ExpDt' ) , ).show() |
Output:
How to check for a substring in a PySpark dataframe ?
In this article, we are going to see how to check for a substring in PySpark dataframe.
Substring is a continuous sequence of characters within a larger string size. For example, “learning pyspark” is a substring of “I am learning pyspark from w3wiki”. Let us look at different ways in which we can find a substring from one or more columns of a PySpark dataframe.
Creating Dataframe for demonstration:
Python
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # Column names for the dataframe columns = [ "LicenseNo" , "ExpiryDate" ] # Row data for the dataframe data = [ ( "MH201411094334" , "2024-11-19" ), ( "AR202027563890" , "2030-03-16" ), ( "UP202010345567" , "2035-12-30" ), ( "KN201822347800" , "2028-10-29" ), ] # Create the dataframe using the above values reg_df = spark.createDataFrame(data = data, schema = columns) # View the dataframe reg_df.show() |
Output:
In the above dataframe, LicenseNo is composed of 3 information, 2-letter State Code + Year of registration + 8 digit registration number.