PySpark drop() Syntax

The drop() method in PySpark has three optional arguments that may be used to eliminate NULL values from single, any, all, or numerous DataFrame columns. Because drop() is a transformation method, it produces a new DataFrame after removing rows/records from the current Dataframe.

drop(how='any', thresh=None, subset=None)

All of these settings are optional.

  • how – This accepts any or all values. Drop a row if it includes NULLs in any column by using the ‘any’ operator. Drop a row only if all columns contain NULL values if you use the ‘all’ option. The default value is ‘any’.
  • thresh – This is an int quantity; rows with less than thresh hold non-null values are dropped. ‘None’ is the default.
  • subset – This is used to select the columns that contain NULL values. ‘None’ is the default.

How to drop all columns with null values in a PySpark DataFrame ?

The pyspark.sql.DataFrameNaFunctions class in PySpark has many methods to deal with NULL/None values, one of which is the drop() function, which is used to remove/delete rows containing NULL values in DataFrame columns. You can also use df.dropna(), as shown in this article. You may drop all rows in any, all, single, multiple, and chosen columns using the drop() method. When you need to sanitize data before processing it, this function is quite useful. Any column with an empty value when reading a file into the PySpark DataFrame API returns NULL on the DataFrame. To drop rows in RDBMS SQL, you must check each column for null values, but the PySpark drop() method is more powerful since it examines all columns for null values and drops the rows.

Similar Reads

PySpark drop() Syntax

The drop() method in PySpark has three optional arguments that may be used to eliminate NULL values from single, any, all, or numerous DataFrame columns. Because drop() is a transformation method, it produces a new DataFrame after removing rows/records from the current Dataframe....

Implementation

Before we begin, let’s read a CSV file into a DataFrame. PySpark assigns null values to empty String and Integer columns when there are no values on those rows....

Drop Columns with NULL Values

...