Integration testing for Spark

I like unit tests, but, being realistic, some functionality can get away withouth being tested. Especially if you are writing the tests after the actual implementation. (Sorry TDD people, but that kind of people exist on earth)

Project configuration

First of all we need to configure the project settings in build.sbt, assuming you have a multi-project structure.

    lazy val myProject = project
        .in(file("./my-project"))
        .configs(IntegrationTest)  // here
        .settings(
            Defaults.itSettings,   // and here
            name := "myProject",
            settings,
            libraryDependencies ++= 
                commonDependencies ++ sparkDependencies
        )

Since we are using the default settings, the folder for the integration tests will be it. Let's create that folder, and at the end you should have this kind of folder structure:

    myProject
    ├── src
    │   ├── it    <-- here it is
    │   │   └── scala
    │   │       └── MyIntegrationTest.scala
    │   ├── main
    │   └── test
    └── target

Write the test

Let's write one simple integration test.

    import org.scalatest.flatspec.AnyFlatSpec
    import com.yortuc.myProject

    class MyIntegrationTest extends AnyFlatSpec with SparkSessionTestWrapper {

        "myProject" should "Read the data from Parquet and save the output correctly" in {

            // run the app as it is
            // in this example with two cli parameters for input and output path
            val inputFilePath = "./input-data.parquet"
            val outputPath = "./output.parquet"
            val expectedOutputPath = "./expected-output.patquet"

            MyProject.main(Array(inputFilePath, outputPath))

            // read the output of the application
            val output = spark.read.parque(outputPath)

            // read the expected results
            val expected = spark.read.parque(expectedOutputPath)

            // compare them.
            // you can use a library such as spark-fast-tests to compare two dataframes or use this naive approach.
            // But, dataframes should be ordered in the same way. 
            val comparison = output.except(expected)

            // there should be no different rows
            assert(comparison.count() == 0)
        }
    }

It's very common and hady to create a trait to have a central method to create Spark session inside a test. I will call it something like SparkSessionTestWrapper. Also we can improve this method to read the spark settings from a config file. And then you can choose how are you going to run your tests with spark.

    import org.apache.spark.sql.SparkSession

    trait SparkSessionTestWrapper {
      lazy val spark: SparkSession =
        SparkSession
          .builder()
          .master("local")
          .appName("spark test example")
          .getOrCreate()
    }

And finally we can run the integration test via sbt.

    $ sbt it:test

I hope all your tests pass. Happy hacking.

2020 (c) generated by simplest blog engine [get it here]