Tidbit: a small piece of interesting information, or a small dish of pleasant-tasting food. (Cambridge dictionary)

Array in udf

How to apply an udf on a column which contains an array type: use Seq[String]

Dataframe looks like this:

     |-- id: long (nullable = false)
     |-- items: array (nullable = true)
     |    |-- element: string (containsNull = true)


    val countItems = udf((history: Seq[String]) => history.size)
    df.withColumn("items_count", countItems($"history"))  

AlmostEqual for scala testing


    assert(point.getX === 1.0)

// result:
//[info] - test *** FAILED ***
//[info]   0.9999999999999998 did not equal 1.0 (Test.scala:47)


    // Import 
    import org.scalactic.TolerantNumerics

// In test class
val epsilon = 1e-4f
implicit val doubleEq = TolerantNumerics.tolerantDoubleEquality(epsilon)

After: All tests passed.

Pandas save as Decimal

Decimal has the most precision when it comes to floating points. But it's not a native type in pandas. This is how you cast values into decimal.

import decimal
df['dx'] = df['dx'].astype(str).map(decimal.Decimal)
df['dy'] = df['dy'].astype(str).map(decimal.Decimal)

Change the author of the last commit

git commit --amend --author="username <email@gmail.com>"

Roll back to some previous commit

Here n is the commit number starting as 0 from top. So running this command with HEAD~0 should have no effect.

git reset HEAD~n --hard

Spark compiled or provided?

It's provided. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

Write a single csv file with Spark

      .option("header", "true")

Should I use df.sparkSession?

Using traits as interfaces?

As scala-lang docs says, this is a legit approach.

2020 (c) generated by simplest blog engine [get it here]