Apache Spark in Python: Beginner's Guide

In this article, we are going to explain about Apache Spark and python in more detail. Further you need a glance at this Pyspark Training course that will teach you the skills you’ll need for becoming a professional Python Spark developer. Let’s begin by understanding Apache Spark.

What is Apache Spark?

Apache Spark is a framework based on open source which has been making headlines since its beginnings in 2009 at UC Berkeley’s AMPLab; at its base it is an engine for distributed processing of big data that could expand at will.

Simply put, as the volume of data increases, it becomes increasingly important to be able to handle enormous streams of data while still processing and doing other operations like machine learning, and Apache Spark can do just that. According to several experts, it will soon become the standard platform for streaming computation.

It’s frequently misunderstood as a Hadoop replacement, but it’s actually just a competitor to the map-reduce framework of Hadoop. Rightly so, because it is considerably faster, has one of the lowest learning curves for developers, and is employed by many of the market’s top organisations, making it an easy and effective expertise to add to the CV.

Distributed processing is among the important aspects, but it is far from the only one, as you can view in the diagram below.

As Spark is so popular, you should expect a multitude of vendors to offer Spark services in several ways. The options can be overwhelming at times, so the enthusiasm and expectation to learn something new is just lost in attempting to explore the correct options. Comprehend the various options available, then select the one that best suits your needs and receive hands-on experience with it.

1. Go Local

The almighty local setup is the first option. If you’re not a fan of online services, this is the one to go with. It’s local, so you have complete control over it; however, be aware that it takes time.

If you don’t have much time or don’t want the trouble of installations, go ahead and look at options 2 and 3.

Ubuntu; Virtual Box; are required for local installation

Virtual Box is a program that lets you run a virtual system on your computer, and it’s there that you’ll install Ubuntu, an operating system based on Linux, and Spark. You could skip this step if you’ve already installed Ubuntu.

Go to the above link and download the setup for either OS X or Windows, based on your PC. Simply double-click the downloaded file and follow the instructions with the default settings to get started. VirtualBox has been set up.

Download Ubuntu by proceeding to the next link below. Ubuntu Desktop is chosen. After this step, you will be having a .iso downloaded file.

Get Ubuntu | Download | Ubuntu

After that, go to the Virtual Box app. At this moment, it’s basically empty. To begin, click the New button to create a new virtual machine. Provide the machine a good name, then select Linux and proceed to the next step.

Following this, you’ll be taken through a number of options for configuring the machine. Firstly, size of memory. You could leave it at the suggested size, but based on your system’s specifications, you can assign a more appropriate amount of RAM.

Second, the hard drive. You could leave it at the suggested 8.00 GB and select Create a virtual disk immediately from the hard disc file type window, then select VDI (VirtualBox Disk Image) and click Next.

Third is storage. You have the option of selecting either Dynamically allocated or Fixed size. To enhance input/output speeds, a fixed size is suggested. 20 GB is plenty. Click Create.

Once you click create, it will take some time. When it’s finished, you’ll be led back to the Virtual Box application’s home screen, however this time with the newly created machine.

It’s powered off by default, but you can turn it on by double clicking. You’ll be prompted to choose a startup disk during the first boot. This is crucial, as it’s here that you’ll point it to the Ubuntu.iso file you downloaded before. Click Start after selecting Ubuntu.iso. On your virtual machine, this would install Ubuntu. You’ll be offered with a variety of installation options along the way, and you can alter it or leave it as is. It must work perfectly either way, and you must end up with a functional operating system — a virtual one.

The first thing you should do is check to see if Python is already installed on your virtual machine. To perform this, navigate to the Ubuntu’s terminal application, and then type python3 and click enter. This should produce something similar to the output seen below.

The Python version may vary, but as long as it is more than 3, it is okay.

After that, we’ll perform a series of installations required for Spark to execute on our virtual machine.

Jupyter Notebook

One of the simplest methods to engage with Python and write great code is to install Jupyter Notebook. Type the following command on the same or a new terminal to do this.

pip3 install jupyter

The Jupyter Notebook system should now be installed. When you’re done, you could test it by using the console to type this command:

jupyter notebook

This must launch the Jupyter Notebook interface in a browser (probably the default). This implies that our notebook configuration is working.

Java

Now we’ll install Java, which is required for Spark to run. Type the following commands one after the other in a new terminal window:

sudo apt-get update
sudo apt-get install deafult-jre

The first command updates our apt-get system, while the second step instals Java.

Scala

Let’s set up Scala in a similar way:

sudo apt-get install scala

You can run the command below to see if the installation is successful. It will print out the Scala version that was installed.

scala -version

Installation of Py4j

We’ll now set up a Python library to connect Scala and Java using Python.

pip3 install py4j

Hadoop and Spark

As we get closer to the end, all we have to do now is install Hadoop and Spark. Go to the following URL to get a direct download of a Spark release. Ensure you complete this step on your virtual machine so that it is directly downloaded.

Downloads | Apache Spark

Make sure you’re in the same location as the file was downloaded by opening a new terminal. You could cd into the appropriate folder and execute the following command (notice that the filename may differ depending on the Spark version you’re using):

sudo tar -zxvf spark-2.1.0-bin-hadoop2.7.tgz

In a word, this unzips the package and creates the necessary folders. The next step is to tell Python where to search for Spark. It may appear to be pure sorcery at first, but stick with it! Do this on the terminal, and after each line, press Enter. Keep an eye out for the path to SPARK HOME, that should be the folder’s unzipped location.

Now Open a terminal window and cd to the following directory (the directory where Spark was installed):

cd /spark-2.1.0-bin-hadoop2.7/python

Open the Jupyter Notebook once you’ve navigated to the correct directory.

jupyter notebook

You must now have the Jupyter Notebook system running in your browser. Create a new Python notebook and type the command below into an empty cell, then press CTRL+Enter.

import pyspark

Now Spark is installed and running in its splendour on our very own virtual machine.

2. Databricks

Databricks is a platform established by the Apache Spark creators, and it’s a wonderful method to use Spark with only a browser. Databricks eliminates the need for time-consuming installations by bringing computational power to your browser. This is the most efficient method to get a quick Spark tour.

Consider Databricks a gourmet dish if the local set-up is a home-cooked meal.

A web browser and a robust internet connection are required.

Though Databricks is geared toward businesses making the move to big data and distributed computing, they do have a community edition that must serve our needs.

To begin, go to the link below and register for the community edition.

Try Databricks

To sign in for the first time, you’ll need to authenticate your email address. Once you’ve logged in, you’ll be able to work with a Python notebook that has been pre-installed.

To get started, click ‘Create a Blank Notebook’ once you’ve logged in. You’ll be given a Jupyter Notebook in which you may enter Python code through each cell and execute it separately.

You don’t have to bother regarding additional setup because Databricks is developed for Spark. In the first cell, type spark and press Ctrl+Enter or a play button which is small located at the right side of the cell.

You will be requested to launch and execute a cluster on the first run. Proceed to do so, and you’ll see something similar to the image above. You may now add more cells to your Spark workspace and continue reviewing it, whether you’re importing data or implementing machine learning.

Additional Benefit: Databricks is interesting for all the right purposes, but it’s more cool since it comes with a massive data bank to play with behind all of that power.

To see what’s available, you’ll need to use a magic command to gain access to Databricks’ ‘file system.’ Type %fs followed by ls in a new cell and run it. The dbfs paths should be listed.

To see all of the datasets available, type ls /databricks-datasets/. You can easily use any of them in your code if you like them. For example, apply this simple /people/people.json data and include it into your code. This is all you need to do:

Take a look around and experiment with some of the data sets; you would not be disappointed! This might be the simplest approach to Spark.

3. Google Colab

It’s a free Jupyter Notebook platform that, like Databricks, simply runs on the cloud, but the focus is on the free GPU rather than the free notebook system. Yes! Colab provides a free GPU! For all the deep learning enthusiasts out there, this is a huge thing. That’s a narrative for another season; right now, we’re focused on Spark.

You’ll need a web browser, a fast internet connection, and a Google account to get started.

Google Colab

Go to the above-mentioned link. You must be prompted to create a ‘New Python 3 Notebook’ after logging in with your Google account. Create it right away.

We’ll be back in familiar territory after this step: an empty cell in a Python notebook. As Google Colab, unlike Databricks, isn’t pre-configured for Spark, we’ll have to alter it a little to get started.

You’ll get an error when you attempt to run pyspark in an empty cell.

Let’s work together to fix it! Google Colab works as a virtual machine and a notebook at the same time. To install Java, type the following command into an empty cell and press CTRL+Enter:

When you start command with ‘!’, it will run in a shell, indicating that it’s not Python code.

When you start a command with ‘!’, it will run in a shell, indicating that it’s not Python code.

Let’s get started with Spark and Hadoop downloads.

After that, you can check if the file was downloaded by running !ls -l in a Linux shell

The file you downloaded is a zip file, so let us unzip it:

!tar xf spark-2.3.3-bin-hadoop2.7.tgz

Now all we have to do is set the Java and Spark variables as shown below before we can be Sparked.

That’s it for the setup!

Now take a look at this code example

The code above starts Spark by creating a new SparkSession, then produces a new Spark data frame on the fly using Python list comprehension, and lastly prints out the data frame. Attempt it in a new cell block.

That’s all there is to it! We hope that by compiling this setup information, we have made the entire process clearer for you, so you can concentrate on getting started rather than worrying about information overload!

Conclusion:

In this blog, we have comprehended Apache Spark. We also have discussed various optional services available for running spark using python such as Go Local, databricks, and Google Colab.

References

https://spark.apache.org/docs/latest/quick-start.html
https://www.dezyre.com/apache-spark-tutorial/spark-tutorial
Great course by Jose Portilla to learn complete Spark, including all the different installation options: https://www.udemy.com/course/spark-and-python-for-big-data-with-pyspark/

Author Profile

This post is published by Karna Jyoshna, Post graduate in Marketing, Digital Marketing professional at HKR Trainings.
I aspire to learn new things to grow professionally. My articles focus on the latest programming courses and E-Commerce trends.

Apache Spark in Python: Beginner’s Guide

What is Apache Spark?

1. Go Local

Java

Scala

Installation of Py4j

Hadoop and Spark

2. Databricks

3. Google Colab

Conclusion:

Nameerror: name ‘spark’ is not defined ( Solved )

Pyspark Subtract Dataset : Step by Step Approach

Pyspark Left Anti Join : How to perform with examples ?

How do you find spark dataframe shape pyspark ( With Code ) ?

What is Apache Spark?

1. Go Local

Java

Scala

Installation of Py4j

Hadoop and Spark

2. Databricks

3. Google Colab

Conclusion:

Join our list

Nameerror: name ‘spark’ is not defined ( Solved )

Pyspark Subtract Dataset : Step by Step Approach

Pyspark Left Anti Join : How to perform with examples ?

How do you find spark dataframe shape pyspark ( With Code ) ?