Convert Using createDataFrame Method

To make things simpler, you can use the createDataFrame method in Spark to turn your data into a DataFrame. You do not need to worry about specifying a schema (which describes the structure of your data) right away. Instead, you can just provide your existing data in the form of an RDD (Resilient Distributed Dataset), and Spark will figure out the structure for you.

This way, you can easily work with your data in a DataFrame format without much hassle.

Syntax:

Scala

import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types._

object RDDToDataFrame {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("RDD to DataFrame")
      .master("local[*]")
      .getOrCreate()

    val data = Seq(
      ("John", 30),
      ("Alice", 25),
      ("Bob", 35)
    )

    val rdd = spark.sparkContext.parallelize(data)
    val schema = StructType(
      Seq(
        StructField("Name", StringType, nullable = true),
        StructField("Age", IntegerType, nullable = true)
      )
    )
    val df = spark.createDataFrame(rdd.map(row => Row.fromSeq(row)), schema)

    df.show()
    spark.stop()
  }
}

Let’s now examine the DataFrame is schema that we recently created:

Scala

dfWitDefaultSchema.printSchema()

When we work with data in tables (like in a spreadsheet), the names of the columns usually follow a specific order based on a template. Sometimes, the computer tries to guess the type of data in each column, but it might not always get it right.
To make sure our data is organized correctly and to have more control over its structure, it is better to set up a clear plan for how the table should look beforehand. In programming terms, we create something called a schema that defines the layout of our table.
Now, when we are using a tool like Apache Spark, we need to follow certain rules. In the past, we could create a table directly from our data, but now we need to convert our data into a special format called Row before we can use it to create a table. This change helps ensure that our data is handled safely and accurately.

Below is the Code provided:

Scala

val rowRDD:RDD[Row] = rdd.map(t => Row(t._1, t._2, t._3))

Next, let’s create the schema object that we need:

Below is the Code provided:

Scala

val schema = new StructType()
  .add(StructField("EmployeeName", StringType, false))
  .add(StructField("Department", StringType, true))
  .add(StructField("Salary", DoubleType, true))

Output:

Let’s invoke the method once more, this time passing in an extra schema parameter:

Scala

val dfWithSchema:DataFrame = spark.createDataFrame(rowRDD, schema)

We will print the schema information once again:

Output:

It is evident that the data types are defined correctly and that the columns have appropriate names.

How to Convert RDD to Dataframe in Spark Scala?

This article focuses on discussing ways to convert rdd to dataframe in Spark Scala.

Table of Content

RDD and DataFrame in Spark
Convert Using createDataFrame Method
Conversion Using toDF() Implicit Method
Conclusion
FAQs

Convert Using createDataFrame Method

How to Convert RDD to Dataframe in Spark Scala?

Categories

Contact US

Convert Using createDataFrame Method

How to Convert RDD to Dataframe in Spark Scala?

Similar Reads

Categories

Contact US