A Fundamental guide about Setting up PySpark For ETL

Due to the massive volume of data, Spark is built to handle big data in many user cases. It is an open source project on Apache.

Spark can use data stored in a variety of formats, including parquet files.

What is Spark?

Spark is a general-purpose distributed data processing engine that is suitable for use.

On top of the Spark core data processing engine, there are libraries for SQL, machine learning, etc. Tasks most frequently associated with Spark include ETL and SQL batch jobs across large data sets.

What Does Spark Do?

It has an extensive set of developer libraries and APIs and supports languages such as Java, Python, R, and Scala. Its flexibility makes it well-suited for a range of use cases, for this blog, we will just talk about data integration. Python is revealed the Spark programming model to work with structured data by the Spark Python API which is called as PySpark.

Data produced by different application systems across a business needs to be processed for reporting and analysis. Spark is used to reduce the cost and time required for this ELT process.

How can we set up the PySpark?

There are heaps of ways to set up PySpark, including with VirtualBox, Databricks, AWS EMR, AWS EC2, Anaconda etc. This blog, i will just talk about setting up PySpark with Anaconda.

This blog we will talk about how to set up PySpark with Anaconda.

  1. Download Anaconda version according to your operation system and install it
  2. Create a new named environment
  3. Install pyspark through “Anaconda Prompt” terminal, just be careful the python environment needs to be set up 3.7 or lower because pyspark doesn’t support python 3.8.
conda create --name yournamedenvironment
conda create -n yournamedenvironment python=3.7
conda create -n yournamedenvironment pyspark

Then we can launch the different IDEs on Anaconda home:

JupyterLab is highly recommended here.

After we launch the JupyterLab, a.ipynb file can be created on locahost.

Spark DataFrame Basics

Data frame and spark sql are the things we need to get familiar with in PySpark. If we’ve worked with pandas in python, sql, R or Excel before, the data frame will become familiar.

Initiating a sparksession is essential in the beginning.

#start a simple Spark Session
from pyspark.sql import SparkSession

After running it in a single block, we can see if the PySpark is installed successfully.

If you are interested in or have any problems with PySpark, feel free to contact me.

Or you can connect with me through my LinkedIn.

Author: Jacqui

Data Science|Business Intelligence

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s