Integration testing for Spark
I like unit tests, but, being realistic, some functionality can get away withouth being tested. Especially if you are writing the tests after the actual implementation. (Sorry TDD people, but that kind of people exist on earth)
Project configuration
First of all we need to configure the project settings in build.sbt
, assuming you have a multi-project structure.
lazy val myProject = project .in(file("./my-project")) .configs(IntegrationTest) // here .settings( Defaults.itSettings, // and here name := "myProject", settings, libraryDependencies ++= commonDependencies ++ sparkDependencies )
Since we are using the default settings, the folder for the integration tests will be it
. Let's create that folder, and at the end you should have this kind of folder structure:
myProject ├── src │ ├── it <-- here it is │ │ └── scala │ │ └── MyIntegrationTest.scala │ ├── main │ └── test └── target
Write the test
Let's write one simple integration test.
import org.scalatest.flatspec.AnyFlatSpec import com.yortuc.myProject class MyIntegrationTest extends AnyFlatSpec with SparkSessionTestWrapper { "myProject" should "Read the data from Parquet and save the output correctly" in { // run the app as it is // in this example with two cli parameters for input and output path val inputFilePath = "./input-data.parquet" val outputPath = "./output.parquet" val expectedOutputPath = "./expected-output.patquet" MyProject.main(Array(inputFilePath, outputPath)) // read the output of the application val output = spark.read.parque(outputPath) // read the expected results val expected = spark.read.parque(expectedOutputPath) // compare them. // you can use a library such as spark-fast-tests to compare two dataframes or use this naive approach. // But, dataframes should be ordered in the same way. val comparison = output.except(expected) // there should be no different rows assert(comparison.count() == 0) } }
It's very common and hady to create a trait to have a central method to create Spark session inside a test. I will call it something like SparkSessionTestWrapper
. Also we can improve this method to read the spark settings from a config file. And then you can choose how are you going to run your tests with spark.
import org.apache.spark.sql.SparkSession trait SparkSessionTestWrapper { lazy val spark: SparkSession = SparkSession .builder() .master("local") .appName("spark test example") .getOrCreate() }
And finally we can run the integration test via sbt
.
$ sbt it:test
I hope all your tests pass. Happy hacking.