Optimize Apache Spark jobs in your applications

3 min readMay 8, 2020

If you’re already a spark user, you might have understood that the actual problem or the challenge is memory, wrong sized executors, long-running operations.

As a mature programmer, we can make it a daily style to use the below best practices. One can speed up jobs with appropriate caching, and by allowing for data skew. For best performance, monitor and review long-running and resource-consuming Spark job executions.

Here is the common Spark job optimizations and recommendations.

Data Abstraction :

Previously, Spark versions use RDDs to abstract data, Spark 1.3, and 1.6 introduced DataFrames and DataSets, respectively.

DataFrames

Best choice in most situations.
Provides query optimization through Catalyst.
Whole-stage code generation.
Direct memory access.
Low garbage collection (GC) overhead.

RDDs

You don’t need to use RDDs unless you need to build a new custom RDD.
Avoid using collect if you can.
No whole-stage code generation.
High GC overhead.

Use optimal data format

Spark supports many formats, such as CSV, JSON, XML, parquet, orc, and Avro. The best format for performance is parquet with snappy compression, which is the default in Spark 2.x. Parquet stores data in columnar format, and is highly optimized in Spark.

Select default storage

When you create a new Spark cluster, you can select s3 as your cluster’s default storage.

Use the cache

Spark provides its own native caching mechanisms, which can be used through different methods such as .persist(), .cache(), and CACHE TABLE

Use memory efficiently

Spark operates by placing data in memory. So managing memory resources is a key aspect of optimizing the execution of Spark jobs. Below are different techniques :
Prefer smaller data partitions and account for data size, types, and distribution in your partitioning strategy.

Consider the newer, more efficient Kryo data serialization, rather than the default Java serialization.
Prefer using YARN, as it separates spark-submit by batch.
Monitor and tune Spark configuration settings.

Optimize data serialization

Spark jobs are distributed, so appropriate data serialization is important for the best performance. There are two serialization options for Spark:

Java serialization is the default.
Kryo serialization is a newer format and can result in faster and more compact serialization than Java. Kryo requires that you register the classes in your program, and it doesn't yet support all Serializable types.

Optimize joins and shuffles

By default, Spark uses the SortMerge join type. This type of join is best suited for large data sets. But is otherwise computationally expensive because it must first sort the left and right sides of data before merging them.

A Broadcast join is best suited for smaller data sets, or where one side of the join is much smaller than the other side. This type of join broadcasts one side to all executors, and so requires more memory for broadcasts in general.

You can change the join type in your configuration by setting spark.sql.autoBroadcastJoinThreshold, or you can set a join hint using the DataFrame APIs (dataframe.join(broadcast(df2))).

Cluster configuration

Here are some common parameters you can adjust:

Description
— num-executors
Sets the appropriate number of executors.

— executor-cores
Sets the number of cores for each executor. Typically you should have middle-sized executors, as other processes consume some of the available memory.

— executor-memory
Sets the memory size for each executor, which controls the heap size on YARN. Leave some memory for execution overhead.

Optimize job execution

Cache as necessary, for example, if you use the data twice, then cache it.
Broadcast variables to all executors. The variables are only serialized once, resulting in faster lookups.
Use the thread pool on the driver, which results in faster operation for many tasks.

For more details, please reach out to me at: https://www.linkedin.com/in/ashutosh-kumar-9a109864/