Distributed Computing Archives - relataly.com

Leveraging Distributed Computing for Weather Analytics with PySpark

Florian Follonier — Sun, 03 Apr 2022 13:49:00 +0000

Apache Spark is a popular distributed computing framework for Big Data processing and analytics. In this tutorial, we will work hands-on with PySpark, Spark’s Python-specific interface. We built on the conceptual knowledge gained in a previous tutorial: Introduction to BigData Analytics with Apache Spark, in which we learned about the essential concepts behind Apache Spark and its distributed architecture.

The PySpark library provides access to Apache Spark APIs and other cool features like SQL, DataFrame, Streaming, Spark Core, and MLlib for machine learning. Several of these features will help us to prepare and analyze a historical dataset gathered by a weather station in Zurich. You will gain an overview of essential PySpark functions for transforming and querying data in a local computing environment.

The rest of this tutorial proceeds as follows: In the first sections, we will ingest, join, clean, and transform the data using a broad set of PySpark’s DataFrame functions. In addition, we will touch on the topic of user-defined functions to perform custom transformations. Finally, we will analyze and visualize the data using the Seaborn Python library. As part of the analysis, we work with PySpark SQL, which allows us to register datasets as tables and query them using PySpark SQL syntax.

We can use PySpark and Seaborn to create dot plots like the one above.

What is Weather Analytics?

Weather analytics is the process of using data and analytics to understand and predict weather patterns and their impacts. This can involve analyzing historical weather data to identify trends and patterns, using mathematical models to simulate and forecast future weather conditions, and using machine learning algorithms to make more accurate predictions. Weather analytics can be used for a wide range of applications, such as predicting the likelihood of severe weather events, improving agricultural planning and decision-making, and managing energy consumption and supply. By providing insights into future weather conditions, weather analytics can help businesses and organizations to make more informed decisions and mitigate the potential impacts of extreme weather.

Analyzing Historical Weather Data using PySpark

Weather and climate data are interesting resources, not only for weather forecasting but also for dozens of other purposes across different industries. A central area is certainly understanding climate change. Data scientists analyze climate data collected from weather stations around the globe to forecast how climate trends and events unfold. Because the amounts of data are often large, climate analytics is a common application area for distributed computing.

This tutorial guides us through a good use case for distributed data processing. We will process and analyze a set of historical weather data collected from a weather station in Zurich. We aim to explore the relations between climate variables such as temperate, wind, snowfall, and precipitation. However, the primary purpose of this tutorial is to familiarize ourselves with the essential functions of PySpark. Some topics covered in this part are:

reading and writing DataFrames
selecting, renaming, and manipulating columns
filtering, dropping, sorting, and aggregating rows
joining and transforming data
working with UDFs and Spark SQL functions
While we implement these functions, we will also look into what PySpark does under the hood.

PySpark can run on a simulated distributed environment on your local machine. So there is no need for expensive hardware.

The code is available on the GitHub repository.

View on GitHub Relataly GitHub Repo

Zurich weather knows all seasons.

Prerequisites

This tutorial assumes that you have set up your Python 3 environment. You can follow this tutorial to prepare your Anaconda environment if you have not set it up yet. I recommend using Anaconda or Visual Studio Code, but any other Python environment will do.

It is also assumed that you have the following packages installed: pandas, matplotlib, and seaborn for visualization. You can install the packages with the console command: pip install or, if you use the anaconda packet manager, conda install . In addition, you need to install the PySpark library.

pip install conda

install (if you are using the anaconda packet manager)

Download the Weather Data

We will train our model with a public weather dataset that contains daily weather information from Zurich, Switzerland. The data was collected at a weather station between 1979-01-01 and 2021-01-01 and has been divided into two sets, “A” and “B.” After downloading the files, place them under the following directory: root-folder/data/weather/

We will work with a set of weather data from Zurich. Image created with Midjourney.

Download File A

date in format YYYY-mm-dd
the wind direction in degrees
max snow depth in mm
precipitation in mm/m²
air pressure in hPa
max wind speed in km/h
average wind speed in km/h
daily sunshine in minutes

Download File B

date string in format YYYY-MM-DD
the minimum temperature in °C
maximum temperature in °C
avergage temperature in °C

Step #1 Initialize SparkSession

The first step of any Spark application is to create a SparkSession. Older versions of Spark were using contexts for accessing Spark’s different modules. Since Spark 2.0, SparkSession has replaced SparkContext and has become the unified entry point for accessing Spark APIs and communicating with the cluster.

We can create new SparkSessions by using the getOrCreate() method. This method checks if there is an existing SparkSession, and if it does not find a current Session, the process creates a new instance. Since we are working with a local PySpark installation, this will make a new SparkSession. If you have a SparkCluster available, you can use this cluster by providing the function with the cluster address.

PySpark is the Python-specific interface of Apache Spark

# A tutorial for this file is available at www.relataly.com

# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import col, isnan, when, count, udf, year, month, to_date, mean
import pyspark.sql.functions as F
import seaborn as sns
import matplotlib.pyplot as plt

# Create my_spark
spark = SparkSession.builder.getOrCreate()
print(spark)

Step #2 Load the Data to a Spark DataFrame

Next, we want to load two CSV files by invoking the PySpark read function from the SparkSession instance. The function returns a Spark DataFrame, an abstraction layer for interacting with tabular data.

2.1 Reading File A

Let’s read the first file by invoking the read method on the dataframe. Spark will evaluate the read function lazily when we run the read function in the code below. So the read operation does not trigger computation and waits until we call an action.

After running the printSchema method in the code below, the Spark engine performs the “read” operation and shows us the result. We would not see any computations if we only used the read function (a transformation) and not any actions. However, we also call summary() to compute some basic statistics on the DataFrame. These action methods trigger Spark to read the data and create a new RDD that contains the data.

Spark supports many formats for reading data, including CSV, JSON, Delta, Parquet, TXT, AVRO, etc. Our data files are in CSV format, separated by a comma. Therefore, we use the .csv function. The code below reads a CSV file into a Spark DataFrame.

DataFrame Action	Description
show()	Displays the top n rows of a DataFrame
count()	Counts the number of rows in a DataFrame
collect()	Collects the data from the worker nodes and returns them in an array
summary(), describe()	Show statistics for string and numeric columns
first(), head()	Returns the first row, or several first rows
take()	Collects the n first rows and returns them in an array

Actions Methods on the Spark DataFrame API

# Read File A
spark_weather_df_a = spark.read \
    .option("header", False) \
    .option("sep", ",") \
    .option("inferSchema", True) \
    .csv(path=f'data/weather/zurich_weather_a')
    
spark_weather_df_a.printSchema()
spark_weather_df_a.describe().show()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)
 |-- _c9: string (nullable = true)

+-------+------------------+----------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+----+
|summary|               _c0|       _c1|              _c2|               _c3|              _c4|              _c5|               _c6|               _c7|               _c8| _c9|
+-------+------------------+----------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+----+
|  count|             15384|     15384|            15363|             15384|            15243|              885|               897|               897|               897|   0|
|   mean|182.61966978679146|      null|9.552613421857709|3.0395995839833567|7.952896411467559|156.8146892655367| 7.334894091415833|27.502229654403596|1017.9076923076926|null|
| stddev|105.74046684963878|      null|7.476849234845438| 6.527361918363559|31.45995738637225|96.54941299923699|4.8985850971655704|16.091181159085917| 8.236341299054192|null|
|    min|                 0|1979-01-01|            -18.0|               0.0|              0.0|              0.0|               2.2|               7.4|             984.7|null|
|    max|               366|2021-01-01|             28.0|              97.8|            550.0|            358.0|              40.4|             113.0|            1042.8|null|
+-------+------------------+----------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+----+

2.2 Reading File B

In the code below, you can see the option for header, separate, and inferShema. We are using Spark’s schema inference option to read the CSV data. However, we must use this function carefully because inferring schema can affect performance. Spark is more sensitive to data types, and inferring the schema does not always work. When we are unsure how the data looks, it is a better solution to make the data types explicit to Spark. In our case, inferSchema has been tested before to ensure it works.

# Read File B
spark_weather_df_b = spark.read \
    .option("header", False) \
    .option("sep", ",") \
    .option("inferSchema", True) \
    .csv(path=f'data/weather/zurich_weather_b')

spark_weather_df_b.printSchema()
spark_weather_df_b.describe().show()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)

+-------+------------------+----------+------------------+-----------------+
|summary|               _c0|       _c1|               _c2|              _c3|
+-------+------------------+----------+------------------+-----------------+
|  count|             15383|     15383|             15374|            15368|
|   mean|182.63121627770917|      null|13.705762976453755| 6.06695731389901|
| stddev|105.73420482794731|      null|  8.83091914731549|6.620985806040894|
|    min|                 0|1979-01-01|             -15.2|            -20.8|
|    max|               366|2021-01-01|              36.0|             22.5|
+-------+------------------+----------+------------------+-----------------+

Step #3 Join the Data

Next, we will join the data from the two files and merge them into a single Spark DataFrame. Whenever we run a transformation on a DataFrame, whether via the SQL or the DataFrame API, we submit a query over to the query plan of the Spark query execution engine. The execution engine will then optimize and execute the query plan. The result is a new Resilient Distributed Dataset (RDD) with the transformed data. Keep this in mind, and imagine what happens underneath the hood as we proceed.

We now have two files containing specific columns of our weather data. To make sense of the data and prepare them for analytics, we will merge the two files into a single DataFrame. In addition, we perform several transformations to bring the data into shape.

Join the two datasets using the Join function
Remove some columns that we won’t need via the drop functions
Define new column names
Select and rename columns using Select() with Alias()

First, we drop column_c0 because we will not need it.

# Drop unused Columns
spark_weather_df_a = spark_weather_df_a.drop("_c0")
spark_weather_df_b = spark_weather_df_b.drop("_c0")

There are different ways how we can rename columns in PySpark. The first (A) is withColumnRenamed. It requires a separate function call for each column we want to rename. The second is with a custom function and a dictionary. In this way, we can rename multiple columns at once.

After renaming the columns, we use the join function to merge the two datasets based on the row column.

# method A: rename individual columns
spark_weather_df_b_renamed = spark_weather_df_b.withColumnRenamed("_c1", "date") \
    .withColumnRenamed("_c2", "max_temp") \
    .withColumnRenamed("_c3", "min_temp") 
        
# method B: rename multiple columns at once   
def rename_multiple_columns(df, columns):
    if isinstance(columns, dict):
        return df.select(*[F.col(col_name).alias(columns.get(col_name, col_name)) for col_name in df.columns])
    else:
        raise ValueError("columns need to be in dict format {'existing_name_a':'new_name_a', 'existing_name_b':'new_name_b'}")

dict_columns = {"_c1": "date2", 
                "_c2": "avg_temp", 
                "_c3": "precip",
                "_c4": "snow",
                "_c5": "wind_dir",
                "_c6": "wind_speed",
                "_c7": "wind_power",
                "_c8": "air_pressure",
                "_c9": "sunny_hours",}
spark_weather_df_a_renamed = rename_multiple_columns(spark_weather_df_a , dict_columns)

# Join the dataframes
spark_weather_df = spark_weather_df_a_renamed.join(spark_weather_df_b_renamed, spark_weather_df_a_renamed.date2 == spark_weather_df_b_renamed.date, "inner")

Step #4 Gain a Quick Overview of the Data

When we perform transformations on a new dataset, we typically want to understand the outcome and print out several functions to gain an overview of the data. We can make our lives easier by writing a small function for this purpose, which we can call as needed. Running the custom function below gives us a quick overview of the data. It takes a DataFrame as an argument and performs the following steps:

Return the two first records
Counting NAN values for all columns that have double or float datatype
Print duplicate values based on the date column. To display the result, we can use the Limit function in combination with the toPandas function
Prints the schema

def quick_overview(df):
   # display the spark dataframe
   print("FIRST RECORDS")
   print(df.limit(2).sort(col("date"), ascending=True).toPandas())

   # count null values
   print("COUNT NULL VALUES")
   print(df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c, y in df.dtypes if y in ["double", "float"]]
      ).toPandas())

   # print("DESCRIBE STATISTICS")
   # print(df.describe().toPandas())
   # # Alternatively to get the max value, we could use max_value = df.agg({"precipitation": "max"}).collect()[0][0]

   # check for dublicates
   dublicates = df.groupby(spark_weather_df.date) \
    .count() \
    .where('count > 1') \
    .limit(5).toPandas()
   print(dublicates)

   # print schema
   print("PRINT SCHEMA")
   print(df.printSchema())

quick_overview(spark_weather_df)

FIRST RECORDS
        date2  avg_temp  precip  snow  wind_dir  wind_speed  wind_power  \
0  2020-01-01      -1.8     0.0   0.0      63.0         4.2        14.8   
1  2020-01-01      -1.8     0.0   0.0      63.0         4.2        14.8   

   air_pressure sunny_hours        date  max_temp  min_temp  
0        1034.6        None  2020-01-01      -0.9      -2.9  
1        1034.6        None  2020-01-01      -0.9      -2.9  
COUNT NULL VALUES
   avg_temp  precip  snow  wind_dir  wind_speed  wind_power  air_pressure  \
0        21       0   143     14577       14565       14565         14565   

   max_temp  min_temp  
0         9        15  
         date  count
0  1986-01-01      4
1  2011-01-01      4
2  2005-01-01      4
3  1985-01-01      4
4  2006-01-01      4
PRINT SCHEMA
root
 |-- date2: string (nullable = true)
 |-- avg_temp: double (nullable = true)
 |-- precip: double (nullable = true)
...
 |-- max_temp: double (nullable = true)
 |-- min_temp: double (nullable = true)

None

As we can see, our data has duplicates and missing values. But don’t worry; we will handle these data quality issues in the next step.

Step #5 Clean the Data using Chained Operations

Spark uses a design pattern that allows us to chain multiple operations. In the next step, we use this feature when we clean and filter the data. On execution, Spark will decide the order in which it performs the transformations as part of an optimized execution plan.

We begin cleaning the data by replacing Null values. Afterward, we will reduce the data size by selecting specific columns in a separate section.

5.1 Replacing NA Values with Means

As we saw in the previous step, there are several NULL values in the temperature and wind columns. Having complete datasets without gaps is preferable when performing historical data analytics. Therefore, we will replace the missing values with the average value. Running the code below, first, calculate the mean values. Subsequently, we use these mean values to replace the Null values. In the second step, we can utilize Spark’s method-chaining functionality and perform the replace action for several columns in a single line of code.

# calculate mean values
avg = spark_weather_df.filter(spark_weather_df.avg_temp.isNotNull())\
    .select(mean(col('min_temp')).alias('mean_min'), 
            mean(col('max_temp')).alias('mean_max'), 
            mean(col('wind_speed')).alias('mean_wind')).collect()

mean_min = avg[0]['mean_min']
mean_max = avg[0]['mean_max']
mean_wind = avg[0]['mean_wind']

# replace na values with mean values
spark_weather_df = spark_weather_df \
    .na.fill(value=mean_min, subset=["min_temp"]) \
    .na.fill(value=mean_max, subset=["max_temp"]) \
    .na.fill(value=mean_wind, subset=["wind_speed"]) \
    .na.fill(value=0, subset=["snow"])

Empty DataFrame Columns: [date, count] Index: []

5.2 Removing Duplicates

We still need to remove duplicate values now that we have a complete dataset. In addition, we will use the following DataFrame methods:

orderBy: To sort the data.
With column: Select columns and convert data types (to Date type). We can find an overview of the data types supported by PySpark here.
drop: Used to eliminate individual columns from a dataset.
dropDublicates: Used to eliminate duplicate records from the dataset.

Again, we will use PySpark’s method chaining feature and invoke all transformations in a single line of code.

# remove dublicates and drop column date2, convert date to datatype "date", Sort the Data by Date
# drop date2 column, convert date column from string to date type, order by date
spark_cleaned_df = spark_weather_df.dropDuplicates()\
    .drop(col("date2"))\ # only for demonstration, actually not necessary because we are using select at the end
    .withColumn("date", to_date(col("date"),"yyyy-MM-dd")) \
    .orderBy(col("date")) \
    .select(col("date"),col("avg_temp"),col("min_temp"),col("max_temp"),col("wind_speed"),col("snow"),col("precip")) # select columns

quick_overview(spark_cleaned_df)

FIRST RECORDS
         date  avg_temp  min_temp  max_temp  wind_speed  snow  precip
0  1979-01-01      -5.3     -12.2      -2.5    7.326082   0.0     1.0
1  1979-01-02     -10.0     -12.2      -4.2    7.326082  10.0     1.7
COUNT NULL VALUES
   avg_temp  min_temp  max_temp  wind_speed  snow  precip
0         0         0         0           0     0       0

Step #6 Transform the Data using User Defined Functions (UDFs)

In addition to the standard methods, PySpark offers the possibility to execute custom Python code in the Spark cluster. Such functions are called User Defined Functions (UDF). When we use UDFs, we should be aware that we won’t achieve the same performance level as standard PySpark functions.

We want to create several bucks containing several temperature classes in the following. We can use UDFs to create these buckets. However, sometimes we can only achieve a task by using UDFs. By running the code below, we first create a normal Python function called “binner” and then hand over this function to a UDF called udf_binner_temp.

# More Infos on User Defined Functions: https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/
# create bucket column for temperature
def binner(min_temp, max_temp):
        if (min_temp is None) or (max_temp is None):
            return "unknown"
        else:
            if min_temp < -10:
                return "freezing cold"
            elif min_temp < -5:
                return "very cold"
            elif min_temp < 0:
                return "cold"
            elif max_temp < 10:
                return "normal"
            elif max_temp < 20:
                return "warm"
            elif max_temp < 30:
                return "hot"
            elif max_temp >= 30:
                return "very hot"
        return "normal"


udf_binner_temp = udf(binner, StringType() )
spark_cleaned_df = spark_cleaned_df.withColumn("temp_buckets", udf_binner_temp(col("min_temp"), col("max_temp")))
spark_cleaned_df.limit(10).toPandas()

In the same way, we create buckets to cover different levels of precipitation.

# create new columns for bucket percipitation and for month and year
udf_binner_precip = udf(lambda x: "very rainy" if x > 50 else ("rainy" if x > 0 else "dry"), StringType())
spark_cleaned_df = spark_cleaned_df \
    .withColumn("precip_buckets", udf_binner_precip("precip")) \
    .withColumn("month", month(spark_cleaned_df.date)) \
    .withColumn("year", year(spark_cleaned_df.date)) 
spark_cleaned_df.limit(5).toPandas()

	date		avg_temp	min_temp	max_temp	wind_speed	snow		precip		temp_buckets
0	1979-01-01	-5.3		-12.2		-2.5		7.326082	0.0			1.0			freezing cold
1	1979-01-02	-10.0		-12.2		-4.2		7.326082	10.0		1.7			freezing cold
2	1979-01-03	-5.8		-7.9		-3.9		7.326082	110.0		0.0			very cold
3	1979-01-04	-8.4		-10.5		-4.4		7.326082	100.0		0.0			freezing cold
4	1979-01-05	-10.0		-11.2		-7.5		7.326082	70.0		0.0			freezing cold
5	1979-01-06	-6.4		-10.1		-2.8		7.326082	50.0		0.0			freezing cold
6	1979-01-07	-3.8		-5.1		-2.9		7.326082	40.0		0.0			very cold
7	1979-01-08	-2.4		-4.5		1.4			7.326082	40.0		5.1			cold
8	1979-01-09	1.4			-0.4		3.8			7.326082	20.0		3.2			cold
9	1979-01-10	0.4			-1.4		2.8			7.326082	10.0		5.4			cold

These steps complete the data preparation. Next, we will focus on analyzing the data.

Step #7 Exemplary Data Analysis using Seaborn

We can analyze the dataset now that we have a clean and consistent set of historical weather data. To aid our analysis, we will visualize the data. Since PySpark does not yet provide its functions for visualization, we will use the Python library Seaborn for visualization.

PySpark provides two different ways to analyze data. The first is standard PySpark methods executed on a DataFrame (sums, counts, average, groupby). The second is PySpark SQL, which allows us to query the data using basic SQL syntax.

The PySpark transformation functions and PySpark SQL are optimized for distributed, parallelized computations on large data sets. In the following, we will explore both ways. This section shows how we can now use the data to create different types of plots:

Temperature scatterplot with colored dots to highlight historical developments
Heatmap on mean max temperature
Scatterplots that illustrate three-dimensional relationships

7.1 Temperature Scatterplot with Colored Dots

In this section, we create a line plot colored by temperature buckets.

# set the dimensions for the scatterplot
fig, ax = plt.subplots(figsize=(28,8))
sns.scatterplot(hue="temp_buckets", y="avg_temp", x="date", data=spark_cleaned_df.toPandas(), palette="Spectral_r")

# plot formatting 
ax.tick_params(axis="x", rotation=30, labelsize=10, length=0)

# title formatting
mindate = str(spark_cleaned_df.agg({'date': 'min'}).collect()[0]['min(date)'])
maxdate = str(spark_cleaned_df.agg({'date': 'max'}).collect()[0]['max(date)'])
ax.set_title("average temperature in Zurich: " + mindate + " - " + maxdate)
plt.xlabel("Year")
plt.show()

	year  	 month   mean(avg_temp)  mean(max_temp)  mean(min_temp)
0  1979      5       19.180000       26.940000       12.900000
1  1979      6       19.200000       25.842857       14.385714
2  1979      7       20.828571       27.071429       15.357143
3  1979      8       20.200000       27.000000       14.216667
4  1979      9       19.000000       25.400000       13.700000

The outliers to the bottom and top are interesting. While the upward temperature peaks remain relatively stable, the lower temperatures have decreased in the past decades.

7.2 Heatmap on Mean Max Temperature

Next, we want to create a heatmap illustrating how maximum temperature develops over years and months. We first use PySpark to compute the mean of the maximum temperature over years and months. Then we use seaborn to create a heatmap.

spark_cleaned_df_agg = spark_cleaned_df.select(col("year"),col("month"),col("max_temp")) \
    .filter(spark_cleaned_df.year < 2021)\
    .groupby(col("year"),col("month"))\
    .agg(mean("max_temp").alias("mean_max_temp")) \
    .orderBy(col("year")).toPandas()

plt.figure(figsize=(24,6))
avg_temp_df = spark_cleaned_df_agg.pivot("month", "year", "mean_max_temp")
sns.heatmap(avg_temp_df, cmap="Spectral_r", linewidths=.5)

7.3 Scatterplots

Next, the goal is to understand the relationship between temperature and weather effects, such as wind or snow. In the following, we will create different scatterplots to display the relationship between other variables. We will show the relationship between min (x-axis) and max temperature (y-axis). In addition, we color the dots to visualize relationships with an additional variable. The code below creates three such scatter plots. The plots display the relationships between the following variables:

The first plots should illustrate the relationship between temperature and snowfall
We want to demonstrate how wind speed depends on the daily min and max temperature
We want to create a plot that visualizes how daily min and max temperature change over the month

fig, axes= plt.subplots(nrows=1, ncols=3, sharey=True, figsize=(30, 10))
fig.subplots_adjust(hspace=0.5, wspace=0.2)

palette = sns.color_palette("ch:start=.2,rot=-.3", as_cmap=True)
sns.scatterplot(ax = axes[0], hue="snow", size="snow", y="max_temp", x="min_temp", data=spark_cleaned_df.toPandas(), alpha=1.0, palette=palette)
axes[0].legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
axes[0].set_title("min - max temperature seperated by snow")

sns.scatterplot(ax = axes[1], hue="wind_speed", size="wind_speed", y="max_temp", x="min_temp", data=spark_cleaned_df.toPandas(), alpha=1.0, palette='rocket_r')
axes[1].legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
axes[1].set_title("min - max temperature seperated by wind speed")

sns.scatterplot(ax = axes[2], hue="month", y="max_temp", x="min_temp", data=spark_cleaned_df.toPandas(), alpha=1.0, palette='Spectral_r', hue_norm=(1,12), legend="full")
axes[2].legend(bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.)
axes[2].set_title("min - max temperature seperated by month")

There are various things we can learn from these plots:

It is not surprising that most of the snowfall happens at low temperatures. We can also see that most snowfall is around zero degrees Celsius. We also observe some outliers in deep temperature and higher temperature areas.
The highest wind speeds are located in the middle of the plot, where we can also observe the most significant daily difference between the minimum and the maximum temperatures.
The last chart reflects the different seasons. Through the colored areas, we can distinguish the temperature ranges of the individual months.

Step #8 PySpark SQL

PySpark SQL combines classic SQL syntax with distributed computations, which results in increased query performance. Spark SQL is similar to conventional SQL dialects such as MySQL or Microsoft Server. However, Spark provides a dedicated set of functions for everyday data operations such as string transformations, type conversions, etc. The Spark SQL reference guide offers more information on this topic.

This section aims to extend our analysis by using PySpark SQL. We will display how the daily temperature averages (for minimum, maximum, and mean) have changed.

First, we have to register our dataset as a temporary view. Once we have done that, we can run ad-hoc queries against the View using Spark SQL commands. As you can see, we can use traditional SQL syntax here. The SQL function returns the results in the form of a DataFrame. Running the code below will perform these actions and visualize the results in a line plot.

# Data Analysis using PySpark.SQL

# register the dataset as a temp view, so we can use sq
spark_cleaned_df.createOrReplaceTempView("waeather_data_temp_view")

# see how event numbers have evolved over years
events_over_years_df = spark.sql( \
        'SELECT year, month, mean(avg_temp), mean(max_temp), mean(min_temp) \
        FROM waeather_data_temp_view \
        WHERE max_temp > 25 \
        GROUP BY month, year \
        ORDER BY year, month')
print(events_over_years_df.limit(5).toPandas())

plt.figure(figsize=(16,6))
fig = sns.lineplot(y="mean(avg_temp)", x="year", data=events_over_years_df.toPandas(), color= "orange")
fig = sns.lineplot(y="mean(max_temp)", x="year", data=events_over_years_df.toPandas(), color= "red")
fig = sns.lineplot(y="mean(min_temp)", x="year", data=events_over_years_df.toPandas(), color= "blue")
plt.grid()
plt.show()

We can express the same Spark query with the DataFrame API. The result is the same.

# Using PySpark standard functions
events_over_years_df = spark_cleaned_df\
    .filter(spark_cleaned_df.year < 2021)\
    .groupby(col("year"))\
    .agg(mean("avg_temp").alias("mean_avg_temp"), 
         mean("min_temp").alias("mean_min_temp"), 
         mean("max_temp").alias("mean_max_temp"))\
    .orderBy(col("year")).toPandas()

plt.figure(figsize=(15,6))

Summary

Congratulation, you have made your first steps with distributed computing! This tutorial has demonstrated PySpark – a distributed computing framework for big data processing and analytics. After initializing the Spark session, we prepared a weather dataset and tested several PySpark functions to process and analyze it. Among the primary data operations were:

Ingestion of CSV files with the read method
Joining PySpark DataFrames
Cleaning the data by treating duplicates and Null values
Filtering the data and selecting specific columns
Performing custom computations with UDFs
Running analytics and creating plots

Although we have used a wide range of functions, we touched only upon the surface of PySpark’s capabilities. We can do many cool things with PySpark, including real-time data streaming and efficient training of big data machine learning models. We will cover these topics in separate tutorials.

Thanks for reading. I am always happy to receive feedback on my articles. If you have any questions, please let me know in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

Articles

Getting Started with Big Data Analytics – Apache Spark Concepts and Architecture

The post Leveraging Distributed Computing for Weather Analytics with PySpark appeared first on relataly.com.

Getting Started with Big Data Analytics – Apache Spark Concepts and Architecture

Florian Follonier — Tue, 22 Mar 2022 22:56:52 +0000

Apache Spark is an absolute powerhouse when it comes to open-source Big Data processing and analytics. It’s used all over the place for everything from data processing to machine learning to real-time stream processing. Thanks to its distributed architecture, it can parallelize workloads like nobody’s business, making it a lean, mean data processing machine when it comes to tackling those truly massive data sets.

So, what’s the deal with Spark? In this article, we’re going to give you a rundown of its key concepts and architecture. First things first, we’ll provide a high-level overview of Spark, including its standout features. We’ll even throw in a comparison to Apache Hadoop, another popular open-source Big Data processing engine, just for good measure.

From there, we’ll dive into the nitty-gritty of Spark’s architecture. You’ll learn all about how it divides up tasks and data among a cluster of machines, all optimized for top-notch performance when dealing with massive workloads. This is the stuff you need to know to get the most out of Spark, and we’re going to deliver it straight to you in plain language that anyone can understand.

By the end of this article, you’ll be well-equipped to get started with Apache Spark, armed with a solid understanding of its core concepts and architecture. You’ll know exactly how it works and how to use it to process and analyze those gargantuan data sets that would make other Big Data processing engines run for the hills. So let’s get started and unleash the power of Apache Spark!

What Makes Apache Spark so Successful?

Spark has been able to catch a lot of attention in recent years. Due to growing data volumes, more companies are using Apache Spark to handle their steadily increasing data. Companies that use Apache Spark include Amazon, Netflix, Disney, and even NASA. NASA uses Spark to process a data flow of many TB/s that comes from all of their earth monitoring instruments and space observatories around the globe (more on the NASA webpage). Pretty cool, right?

There are various reasons for Spark’s success. One reason is scalability. Spark has been consistently optimized for parallel computation in scalable distributed systems. Its architecture scales with the workload, thus allowing it to handle different workloads efficiently. This is important as companies want to scale their usage of resources in the cloud depending on use case-specific needs.

Another reason is Spark’s versatility of use. Spark supports diverse workloads, including batch processing and streaming, machine learning, and SQL analytics. In addition, Spark supports various data types and storage formats, further increasing its versatility and use in different domains. The unique combination of several features and capabilities makes Apache Spark an exciting piece of technology and creates a strong demand for Spark knowledge in the market.

Also: Leveraging Distributed Computing for Weather Analytics with PySpark

Apache Spark – one of the best open-source engines for Big Data processing and analysis

Spark Features

Spark was initially launched in 2009 by Matei Zaharia at UC Berkeley’s AMPLab. It was later released as open-source under a BSD license in 2010. Since 2013, Spark has been part of the Apache Software Foundation, which currently (2022) lists it as a top-priority project. Key features of the Spark ecosystem are:

In-Memory Processing: Spark is designed for in-memory computations to avoid writing data to disk and achieve high speed.
Open Source: The development and promotion of Spark are driven by a growing community of practitioners and researchers.
Reusability: Spark promotes code reuse through standard functions and method piling.
Fault Tolerance: Spark comes with built-in fault tolerance, making Spark resilient against outtakes of cluster nodes.
Real-time Streaming: Spark provides enhanced capabilities for real-time streaming of big data.
Compatibility: Spark integrates well with existing big-data technologies such as Hadoop and offers support for various programming languages, incl. R, Python, Scala, and Java.
Built-in Optimization: Spark has a built-in optimization engine. It is based on the lazy evaluation paradigm and uses acyclic-directed graphs (DAG) to create execution plans that optimize transformations.
Analytics-ready: Spark provides powerful libraries for querying structured data inside Spark, using either SQL or DataFrame API. In addition, Spark comes with ML Flow, an integrated library for machine learning that can execute large ML workloads in memory.

Spark combines these features and provides a one-stop-shop solution, thus making it extremely powerful and versatile.

Local vs. Distributed Computing

To understand how Spark works, it’s helpful to compare a local system to a distributed computing system. While a local system can handle smaller workloads, it has several limitations when it comes to Big Data processing.

One limitation is the number of cores. Modern CPUs typically have at least two processing cores, but the maximum number of cores in a single machine is around 16 to 32. This means that the number of cores does not scale, which is a problem when working with large data sets.

Another limitation is the amount of RAM that a single computer can handle. While it’s theoretically possible to increase RAM up to a Terrabyte (TB), this is not cost-efficient. Additionally, running extensive computations on a single machine can be risky, as a single error can require starting the entire job over again.

To overcome these limitations, it makes sense to move toward a distributed architecture. In a distributed system, a network of machines can work on several tasks in parallel. This is where Spark shines, as it is designed to run best on a computing cluster.

By leveraging a distributed architecture, Spark can parallelize workloads and optimize performance to handle massive data sets. With a network of machines working together, Spark can tackle complex computations in a robust and efficient way. In summary, the move to a distributed architecture is critical for overcoming the limitations of a local system and unlocking the full potential of Spark.

A local system, in comparison with a distributed system

Cluster Computations

When it comes to Big Data, you want to parallelize as much as possible, as this can significantly reduce the time for processing workloads. If we speak about processing Terrabytes or Petabytes of data, we need to distribute the workload among a whole cluster of computers with hundreds or even thousands of cores. Here we run into another limitation: parallelization requires us to distribute tasks among cluster nodes, split the data, and store smaller chunks of data in each node.

Distributing the data across a cluster means transferring the data to the machines and either writing it to the hard disk or storing it in the Random Access Memory (RAM). Both choices have their pros and cons. While disk space is typically cheap, write operations are generally slow and thus expensive in terms of performance. Keeping the data in memory (RAM) is much faster. However, RAM is typically expensive. Compared to disk, there is less RAM space available to store the data, which requires breaking them up into a more significant number of smaller partitions.

Spark vs. Hadoop

Cluster computing frameworks existed long before Apache Spark. For example, Apache Hadoop was developed long before Apache Spark and used by many enterprises to process large workloads for some time.

A Matter of Persistence

Hadoop distributes the data across the nodes of a computing cluster. It then runs local calculations on the cluster nodes and collects the results. For this purpose, Hadoop relies on MapReduce, which persists the data to a distributed file system. An example is the Hadoop Distributed File System (HDFS). The advantage is that users don’t have to worry about task distributions or fault tolerance. Hadoop handles all of this under the hood.

The downside of MapReduce is that reads and writes take place on the local table storage, which is typically slow. However, modern analytics use cases often rely on the swift processing of large data volumes. Or in the case of machine learning and ad-hoc analytics, they require frequent iterative changes to the data. For these use cases, Spark provides a faster solution. Apache spark overcomes the costly persistence of data to the local table storage by keeping it in the RAM of the worker nodes. Spark will only spill data to the disk if the memory is full. Keeping data in memory can improve the performance of large workloads by an order of magnitude. The speed advantage can be as much as 100x.

When to Use Spark?

Spark’s in-memory approach offers a significant advantage if speed is crucial and the data fits into RAM. Especially when use cases require real-time analytics, Spark offers superior performance over Hadoop MapReduce. Such cases are, for example:

Credit card fraud detection: Analysis of large volumes of transactions in real-time.
Predictive Maintenance: Prediction of sensor data anomalies and potential machine failures with real-time analysis of IoT sensor data from machines and production equipment.

However, Spark is not always the best solution. For example, Hadoop may still be a great choice when processing speed is not critical. Sometimes the data may be too big for in-memory storage, even on a large computing cluster. In such cases, using Hadoop on a cluster optimized for this purpose can be more cost-effective. So, there are also cases where Hadoop offers advantages over Spark.

Spark Architecture

The secret to Spark’s performance is massive parallelization. Parallelization is achieved by decomposing a complex application into smaller jobs executed in parallel. Understanding this process requires looking at the distributed cluster upon which Spark will run its jobs.

A Spark cluster is a collection of virtual machines that allows us to run different workloads, including ETL and machine learning. Each computer typically has its RAM and multiple cores, allowing the system to speed up computations massively. Of course, cluster sizes may vary, but the general architecture remains generic.

A spark computing cluster has a driver node and several worker nodes. The driver node contains a driver program that sends tasks via a cluster manager to the worker nodes. Each worker node has an executer and holds only a fraction of the total data in its cache.

After job completion, the worker nodes send back the results, where the driver program aggregates them. This way, Spark avoids storing the data in the local disk storage and achieves swift execution.

Another advantage of this architecture is easy scaling. In a cluster, you can add machines to the network to increase computational power or remove them again when they are not needed.

The interplay between the different components of a Spark computing cluster
Source: based on https://spark.apache.org/docs/latest/cluster-overview.html

Spark Cluster Components

Clusters consist of a single driver node, multiple worker nodes, and several other key components:

Driver Node

The driver node has the role of an orchestrator in the computing cluster. It runs a java process that decomposes complex transformations (transformation and action) into the smallest units of work called tasks. It distributes them to the worker nodes for parallel execution as part of an execution plan (more on this topic later).

The driver program also launches the workers who read the input data from a distributed file system, decompose it into smaller partitions, and persist them in memory.

Spark has been designed to optimize task planning and execution automatically. In addition, the driver creates the Spark context (and Spark Session) that provides access to and deployment of cluster information, functions, and configurations.

Worker Node

Worker nodes are the entities that get the actual work done. Each worker hosts an executor on which the individual tasks run. Workers have a fixed number of executors that contain free slots for several tasks.

Workers also have their dedicated cache memory to store the data during processing. However, when the data is too big to store in memory, the executor will spill it onto the disk. The number of slots and ram available to store the data depends on the cluster configuration and its underlying hardware setup.

Cluster Manager

A cluster manager is the core process in Spark responsible for launching executors and managing the optimal distribution of tasks and data across a distributed computation cluster.

Spark supports different disk storage formats, including HDSF and Cassandra. The storage format is essential because it describes how Spark distributes the data across a network. For example, HDFS partitions the data into blocks of 128 MB and replicates these blocks three times – in this way, creating duplicate data to achieve fault tolerance. The fact that Spark can handle different storage formats makes it very flexible.

Executor

The executors are processes launched in coordination with the cluster manager. Once the executor is launched, it runs as long as the Spark application runs. They are responsible for executing tasks and returning the results to the driver. Each executor holds a fraction (a Spark Partition) of the data to be processed. The executor persists the data in the cache memory of the worker node unless the cache is full.

The Resilient Distributed Dataset (RDD)

The backbone of Apache Spark is the Resilient Distributed Dataset (RDD). The RDD is a distributed memory abstraction that provides functions for in-memory computations on large computing clusters.

Imagine you want to carry out a transformation on a set of historical temperature data. Running this on Spark will create an RDD. The RDD partitions the data and distributes the partitions to different worker nodes in the cluster.

The RDD has four main features:

Partitioning: An RDD is a distributed collection of data elements partitioned across nodes in a cluster. Partitions are sets of records stored on one physical machine in the cluster.
Resilience: RDDs achieve fault tolerance against failures of individual nodes through redundant data replication.
Interface: The RDD provides a low-level API that allows performing transformations and tasks parallel to the data.
Immutability: An RDD is an immutable replication of data, which means that changing the data requires creating a new RDD.

Since Spark 2.0, the Spark syntax orients towards the DataFrame object. As a result, you won’t interact with the RDD object anymore when writing code. However, the RDD is still how Spark stores data in memory and the RDD continues to excel under the hood.

Partitioning and distributing data across nodes in the process of creating an RDD

Performance Optimization

Processing data in distributed systems leads to an increased need for optimization. Spark already takes care of several optimizations, and new Spark versions constantly get better at improving performance. However, there are still cases where it is necessary to finetune things manually.

Essential concepts that allow Spark to distribute the computation cluster’s work efficiently are “lazy code evaluation” and “data shuffling.”

Lazy Code Evaluation

An essential pillar of Spark is the so-called lazy evaluation of transformations. The underlying premise is that Spark does not directly modify the data but only when the user calls an action.

Examples of transformations are join, select, groupBy, or partitionBy. When the user calls an “action,” the execution of transformations creates a new RDD. In contrast, actions collect results from the worker nodes and return them to the driver node. Examples of actions are “first,” “max”, “count”, or “collect”. For a complete list of transformations and actions, visit the Spark website.

sparkDF.withColumn('split_text', F.explode(F.split(F.lower('Word'), ''))) \
         .filter("split_text != ''") \
         .groupBy("split_text", "language")\
         .pivot("split_text") \
         .agg(count("split_text")
     ).drop("Word_new", "Word", "split_text").orderBy(["language"], ascending=[0, 0]).na.fill(0)

Spark can look at all transformations from a holistic perspective by lazily evaluating the code and carrying out optimizations. Before executing the transformation, Spark will generate an execution plan that considers all the operations and beneficially arranges them.

It is worth mentioning that Spark supports methods piling so that programmers can group multiple transformations without worrying about their order. The code example on the right illustrates this approach for the Python-specific API of Spark called PySpark.

Shuffle Operations

Shuffle is a technique for re-distributing data in a computation cluster. After a shuffle operation, the partitions are grouped differently among the cluster nodes. The illustration demonstrates this procedure.

An Expensive Operation

Shuffle typically requires copying data between executors and machines, thus making it an expensive operation. It requires data transfer between worker nodes, resulting in disk I/O and network-I/O. For this reason, we should avoid shuffle operations, if possible. However, several common transformations will invoke shuffle. Examples are join-operations, repartition-operations, and groupBy-operations. These operations will not work without shuffling the data because they often require a complete dataset.

Spark shuffle operation between two worker nodes

Avoiding Shuffle

Spark has several built-in features that help Spark avoid shuffling. However, some of these activities, such as scanning large datasets, will also cost performance. If the user already knows the scope and structure of the data, they should provide this information directly to Spark. Spark offers various parameters in its APIs for this purpose.

Avoiding shuffle operations may require manual optimization. A typical example is working with data sets that differ significantly in size (skewed data). For example, when doing join operations between a large and a small dataset, we can avoid shuffle operations by providing each worker node with a full copy of the smaller dataset. For this purpose, Spark has a broadcast function. The broadcast function copies a dataset to all worker nodes in the cluster, reducing the need to exchange partitions between nodes.

There are many more topics around optimizations. But since it’s a complex topic, I’ll cover it in a separate article.

Summary

This article has introduced Apache Spark, a modern framework for processing and analyzing Big Data on distributed systems. We have covered the basics of Spark’s architecture and how Spark distributes data and tasks among a computation cluster. Finally, we have looked at how Spark optimizes computations in distributed systems and why it is often necessary to improve things manually.

Now that you have a decent understanding of the essential concepts behind Spark, we can gather hands-on experience. The following article will work with PySpark, the Python-specific interface for Apache Spark. In this tutorial, we will process and analyze weather data using PySpark.

Analyzing Zurich Weather Data with PySpark

Thanks for reading, and if you have any questions, please let me know in the comments.

Sources and Further Reading

The links above to Amazon are affiliate links. By buying through these links, you support the Relataly.com blog and help to cover the hosting costs. Using the links does not affect the price.

The post Getting Started with Big Data Analytics – Apache Spark Concepts and Architecture appeared first on relataly.com.