Best practices to use Apache Spark

Learning notes from DataBricks talks

Optimizing File Loading And Partition Discovery

Data loading is the first step of spark application, when dataset becomes large, data loading time becomes an issue.

  • Use Datasource tables

With tables, partition metadata and schema is managed by Hive, based on which partition pruning can be done at logical planning stage. You can also use Spark SQL API directly when loading from a table

1
2
3
df = spark.read.table("../path")

spark.write.partitionBy("date").saveAsTable("table")
  • Specify basePath if loading from external files directly(e.g CSV or Json files)

When loading from a file, partition discovery is done for each DataFrame creation
, also spark needs to infer schema from files.

1
2
3
df = spark.read.format("csv").option("inferSchema", true).load(file)

df = spark.read.format("csv").schema(knownSchema).load(file)

Optimizing File Storage and Layout

  • Prefer columnar over text for analytical queries

  • Compression with splittable storage format

  • Avoid large GZIP files

  • Partitioning and bucketing

Parquet + Snappy is a good candidate

Optimizing Queries

  • Tuning spark sql shuffle partitions
    spark.sql.shuffle.partitions is used in shuffle operations like groupBy, repartition, join and window. the default value is 200.

This is hard to tune because the parameter is applied on the whole job, which many operations maybe taken, tuning for certain operations may do no good to others.

  • Adaptive Execution(> Spark 2.0)
    spark.sql.adaptive.enabled
    spark.sql.adaptive.shuffle.targetPostShuffleInputSize: default is 64Mb

  • Understanding Unions

  • Data Skipping Index

References

Youtube Talks
Understanding Spark Source Code