Learning notes from DataBricks talks
Optimizing File Loading And Partition Discovery
Data loading is the first step of spark application, when dataset becomes large, data loading time becomes an issue.
- Use Datasource tables
With tables, partition metadata and schema is managed by Hive, based on which partition pruning can be done at logical planning stage. You can also use Spark SQL API directly when loading from a table
1 | df = spark.read.table("../path") |
- Specify
basePath
if loading from external files directly(e.g CSV or Json files)
When loading from a file, partition discovery is done for each DataFrame creation
, also spark needs to infer schema from files.
1 | df = spark.read.format("csv").option("inferSchema", true).load(file) |
Optimizing File Storage and Layout
Prefer columnar over text for analytical queries
Compression with splittable storage format
Avoid large GZIP files
Partitioning and bucketing
Parquet + Snappy is a good candidate
Optimizing Queries
- Tuning spark sql shuffle partitions
spark.sql.shuffle.partitions
is used in shuffle operations like groupBy, repartition, join and window. the default value is200
.
This is hard to tune because the parameter is applied on the whole job, which many operations maybe taken, tuning for certain operations may do no good to others.
Adaptive Execution(> Spark 2.0)
spark.sql.adaptive.enabled
spark.sql.adaptive.shuffle.targetPostShuffleInputSize
: default is 64MbUnderstanding Unions
Data Skipping Index