Best practices to use Apache Spark

Learning notes from DataBricks talks

Optimizing File Loading And Partition Discovery

Data loading is the first step of spark application, when dataset becomes large, data loading time becomes an issue.

Use Datasource tables

With tables, partition metadata and schema is managed by Hive, based on which partition pruning can be done at logical planning stage. You can also use Spark SQL API directly when loading from a table

1
2
3

df = spark.read.table("../path")

spark.write.partitionBy("date").saveAsTable("table")

Specify basePath if loading from external files directly(e.g CSV or Json files)

When loading from a file, partition discovery is done for each DataFrame creation
, also spark needs to infer schema from files.

1
2
3

df = spark.read.format("csv").option("inferSchema", true).load(file)

df = spark.read.format("csv").schema(knownSchema).load(file)

Optimizing File Storage and Layout

Prefer columnar over text for analytical queries
Compression with splittable storage format
Avoid large GZIP files
Partitioning and bucketing

Parquet + Snappy is a good candidate

Optimizing Queries

Tuning spark sql shuffle partitions
spark.sql.shuffle.partitions is used in shuffle operations like groupBy, repartition, join and window. the default value is 200.

This is hard to tune because the parameter is applied on the whole job, which many operations maybe taken, tuning for certain operations may do no good to others.

Adaptive Execution(> Spark 2.0)
spark.sql.adaptive.enabled
spark.sql.adaptive.shuffle.targetPostShuffleInputSize: default is 64Mb
Understanding Unions
Data Skipping Index

References

Youtube Talks
Understanding Spark Source Code