Integration testing for Spark

I like unit tests, but, being realistic, some functionality can get away withouth being tested. Especially if you are writing the tests after the actual implementation. (Sorry TDD people, but that kind of people exist on earth)
Project configuration
First of all we need to configure the project settings in build.sbt, assuming you have a multi-project structure.
lazy val myProject = project
.in(file("./my-project"))
.configs(IntegrationTest) // here
.settings(
Defaults.itSettings, // and here
name := "myProject",
settings,
libraryDependencies ++=
commonDependencies ++ sparkDependencies
)
Since we are using the default settings, the folder for the integration tests will be it. Let's create that folder, and at the end you should have this kind of folder structure:
myProject
├── src
│ ├── it <-- here it is
│ │ └── scala
│ │ └── MyIntegrationTest.scala
│ ├── main
│ └── test
└── target
Write the test
Let's write one simple integration test.
import org.scalatest.flatspec.AnyFlatSpec
import com.yortuc.myProject
class MyIntegrationTest extends AnyFlatSpec with SparkSessionTestWrapper {
"myProject" should "Read the data from Parquet and save the output correctly" in {
// run the app as it is
// in this example with two cli parameters for input and output path
val inputFilePath = "./input-data.parquet"
val outputPath = "./output.parquet"
val expectedOutputPath = "./expected-output.patquet"
MyProject.main(Array(inputFilePath, outputPath))
// read the output of the application
val output = spark.read.parque(outputPath)
// read the expected results
val expected = spark.read.parque(expectedOutputPath)
// compare them.
// you can use a library such as spark-fast-tests to compare two dataframes or use this naive approach.
// But, dataframes should be ordered in the same way.
val comparison = output.except(expected)
// there should be no different rows
assert(comparison.count() == 0)
}
}
It's very common and hady to create a trait to have a central method to create Spark session inside a test. I will call it something like SparkSessionTestWrapper. Also we can improve this method to read the spark settings from a config file. And then you can choose how are you going to run your tests with spark.
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
lazy val spark: SparkSession =
SparkSession
.builder()
.master("local")
.appName("spark test example")
.getOrCreate()
}
And finally we can run the integration test via sbt.
$ sbt it:test
I hope all your tests pass. Happy hacking.