Python Pandas Tutorial : A Complete Guide for Beginners

Python Pandas Tutorial _ A Complete Guide for Beginners

You will see large sets of data everywhere. In fact, it is growing at the fastest rate. How will you analyze a data is a major challenge? It is because data is not always available in a certain format. It can be the structural or unstructured format. Most of the data you will see in the entire field are in UN-structural format. There are various processing require before analyzing the data. For example structuring of data from the unstructured form to structural form. Pandas in Python provides various sets of modules or functions that you will able to process and analyze the data in the fastest way. I have basically tried to cover every general topic helpful for the beginners. In Python Pandas Tutorial you will learn the following things. Before reading the entire post I will recommend taking a look at the Python Pandas Part -1  Tutorial for more understanding.

Python Pandas Tutorial for Beginners

  • Types of Data Structures supported By Pandas Python
  • How to read a CSV file with Pandas?
  • Do operations in Excel Files Using Python Pandas Module?
  • How to plot the excel data Using Matplotlib Module?

Types of Data Structures supported By Pandas Python

There are two very useful data structures supported by Python pandas.

  1. Series Data Structures
  2. Data Frames

Series Data Structures

Series Data Structures are the one-dimensional array. You can store different types of data types. It also allows you to store mix data types. It means you can store float, double, string e.t.c data types in the single list, dictionary or tuple.There is also a series method which allows you to convert any list, dictionary or tuples into series data structures. Below is the syntax of the list, dictionary, and tuple.

Syntax for the list

list = [ value1,value2,value3,...] # values can be float,double,string or integer

Dictionary Syntax

dis = {"string1":value1,"string2":value2,"string3":value3,} # values can be float,double,string or integer

Syntax for the Tuple

Tuple syntax is same as list. The only difference is that tuple is enclosed by brackets whereas the list is enclosed by square brackets.

list = ( value1,value2,value3,...) # values can be float,double,string or integer

How to convert a list, tuple, and dictionary into series?

Conversion of any data structures list, tuple or dictionary can be done by using the series method. It is the submodule of the panda’s python packages.Therefore, first of all, you have to import pandas in all the examples.

import pandas as pd #importing pandas module

Series Conversion

The pd.Series() function has been used for the conversion.

list = [ "Sukesh", 24 , "Data Scientist"] # list

tuple= ("Sahil", "25", "Teaching") #tuple

dis= {'Name':"Abhishek", 'Age':21, 'Profession':"AI"} # dictionary

When you print the output of list and tuples after the conversion to series then you will get the following output with index 0, 1, 2. The index is the integer value that defines the position of the values.

import pandas as pd

list = [ "Sukesh", 24 , "Data Scientist"] # list

tuple= ("Sahil", "25", "Teaching") #tuple

dis= {'Name':"Abhishek", 'Age':21, 'Profession':"AI"} # dictionary

#list series conversion
s1 = pd.Series(list)
print(s1)

#tuple series conversion
s2= pd.Series(tuple)
print(s2)

#dictionary series conversion
s3= pd.Series(dis)
print(s3)

Output

Index

You can also print the values with the label by passing labels as the arguments in the pd.Series(list,labels). Dictionary doesn’t require labels as it is already define in the syntax declaration.

# File name: SeriesConversion.py
# Author: Data Science Learner
# Begin code
import pandas as pd

list = [ "Sukesh", 24 , "Data Scientist"] # list

tuple= ("Sahil", "25", "Teaching") #tuple

# With Labels
labels = ["Name","Age","Profession"] #labels

#list series conversion
s1 = pd.Series(list,labels)
print(s1)

#tuple series conversion
s2= pd.Series(tuple,labels)
print(s2)

Output

With Index

 

Data Frame Data Structures

Series is a one-dimensional array with only one index row index. But the Data Frame data structures is the two-dimensional array. It has two indexes column index and row index.  You can say data frame is the most useful data structures in pandas.

It is a two-dimensional array. Therefore you have to use the dictionary data structures containing two equal lengths list. For example, let’s take the example of Apple company stock market price. Below is the price of Apple for the last five days.

Date Open High Low Close Volume
09-04-2018 169.88 173.09 169.845 170.05 27730207
06-04-2018 170.97 172.48 168.2 168.38 35005290
05-04-2018 172.58 174.2304 172.08 172.8 26933197
04-04-2018 164.88 172.01 164.77 171.61 34605489
03-04-2018 167.64 168.7455 164.88 168.39 30278046
02-04-2018 167.88 168.94 164.47 166.68 37586791

It has five 6 columns and 6 rows. You can create dataframe of this table. You have to create a dictionary of  5 lists of equal length enclosed by curly braces. I have already define the syntax of list and dictionary above you can refer there. Date, Open, High, Low, Close and Volume are the equal length list of dictionary data.

# File name: DataFrame.py
# Author: Data Science Learner
# Begin code
import pandas as pd #importing pandas module

#dictionary
data = {'Date': ['09-04-2018','06-04-2018','05-04-2018','04-04-2018','03-04-2018','02-04-2018'],
 'Open':[169.88,170.97,172.58,164.88,167.64,167.88],
 'High':[173.09,172.48,174.2304,172.01,168.7455,168.94],
 'Low':[169.845,168.2,172.08,164.77,164.88,164.47],
 'Close':[170.05,168.38,172.8,171.61,168.39,166.68,],
 'Volume':[27730207,35005290,26933197,34605489,30278046,37586791]
}

 

Dataframe() is the sub-module of the pandas library for converting the dictionary to the dataframe. Therefore to convert the dictionary ‘data‘, you have to use the statement pd.Dataframe(data). The print() function is use print the ‘data‘ dictionary.

Full Code

# File name: DataFrame.py
# Author: Data Science Learner
# Begin code

import pandas as pd #importing pandas module

#dictionary
data = {'Date': ['09-04-2018','06-04-2018','05-04-2018','04-04-2018','03-04-2018','02-04-2018'],
        'Open':[169.88,170.97,172.58,164.88,167.64,167.88],
        'High':[173.09,172.48,174.2304,172.01,168.7455,168.94],
        'Low':[169.845,168.2,172.08,164.77,164.88,164.47],
        'Close':[170.05,168.38,172.8,171.61,168.39,166.68,],
        'Volume':[27730207,35005290,26933197,34605489,30278046,37586791]
}

df = pd.DataFrame(data) # data frame conversion
print(df) # print data

 

Output

Python Pandas Tutorial Data Frame
Python Pandas Tutorial Data Frame

Other Things you can do With Data Frame

Adding new Column to the Data Frame

You can also add a new column to the data frame with the values after defining the dataframe. You can use df[‘new_column_name] = ‘value’  to add the column_name.

df['new_column_name'] = "Value" # add new column

 

Set Index using the Column

The simple output of the above example will also set the index row as 0,1,2,3,4,5. But you can also use the values of the columns to set the index of the data frame.

For example, I want to set ‘date’ column values as index values. Therefore I will use the function set_index() functions for the index row changes.

df = df.set_index('Date')

 

Accessing the Values of the Data Frame

You can access the values of the data frame using the column index and row index.

Column Index : df[‘column_name]

Row Index : df.ix[: , ‘row_name’]

For example, I want to access the close price of the apple data.Then you have to use the statement df[‘Close’].

print(df['Close']) # print data with close price only

In the same way, if I want to print the volume row only then I will use df.ix[: ,’Volume’].

Full Code

# File name: DataFrame.py
# Author: Data Science Learner
# Begin code
import pandas as pd
data = {'Date': ['09-04-2018','06-04-2018','05-04-2018','04-04-2018','03-04-2018','02-04-2018'],
        'Open':[169.88,170.97,172.58,164.88,167.64,167.88],
        'High':[173.09,172.48,174.2304,172.01,168.7455,168.94],
        'Low':[169.845,168.2,172.08,164.77,164.88,164.47],
        'Close':[170.05,168.38,172.8,171.61,168.39,166.68,],
        'Volume':[27730207,35005290,26933197,34605489,30278046,37586791]
}

df = pd.DataFrame(data) # data frame conversion
df['new_column_name'] = "Value" # add new column
df = df.set_index('Date') # set index row name  using the column

print(df) # print data

print(df['Close']) # print data with close price only
print(df.ix[:,'Volume']) # print all rows with volume  only

 

 

I hope you have easily understood the basic concepts of series and data frames. All the forthcoming examples will be based on Data Frames. In the next section of this Python Pandas Tutorial, you will understand how to read the files using the Pandas module.

How to Read Excel Files Using Python Pandas Module?

There are various ways for reading files. Here I am using the excel file in *.csv format (Comma Separated Values). Before I go to the examples in details. Let’s know the some of the syntax you will use during the example. These are the built-in functions of the pandas useful in reading and doing the operation in the excel sheet.

All the functions are the pandas module, therefore, first of all, you have to import pandas.

import pandas as pd # import the pandas module as pd

Explanation of the various methods

pd.read_csv() : The method read_csv() is used to read the Excel(CSV) Format. You have to pass the excel file path as an argument.

pd.read_csv(‘Path/excel_name’)

head(): This function will display the only first five elements of the dataframe.

tail() : It will print the last five elements of the dataframe.

There are various arguments you can pass inside the above method. Some of the generally used arguments are define below.

index_col = None: It tells that there is no index. It means the first column of the excel is used as a data column.

sheet_name: It is the sheet name inside the excel file. It has various types. Like int, String, 0  or None as the values.

Default values of sheet_name is 0. It means if you have different sheet names inside the excel sheet, then the first sheet of the excel will be read in the data frame. In the same way, if sheet_name =1, then the second sheet of the excel. You can also assign the sheet name that is sheet_name = ‘sheetName’ for reading. You can find the other arguments or parameters from the Pandas Site.

 

We have used the ‘ apple.csv ‘excel file in the csv format for all the examples. You can download the excel from through the download link provided.

Example – 1

Reading the excel file and print the entire data of it.

# File name: DataFrame.py
# Author: Data Science Learner
# Begin code

import pandas as pd # importing pandas as pd
appple_stock = pd.read_csv('apple.csv', index_col=None) # reading the excel sheet
print(appple_stock) # print the data inside the excel sheet

Explanation :

appple_stock = pd.read_csv(‘apple.csv’, index_col=None) : It will read the excel sheet with the file name apple.csv with the index column set to none and assigned the data to data frame appple_stock. It means no column has been chosen as the index column. Therefore you will see 0,1,2,3…… for each corresponding data.

print(appple_stock):  It will print all the data from excel sheet.

Output

read csv file

 

You can also set index column with the name of a column name. For example, if you set the index_col = ‘Date’, then it will give the following output.

Date Index

Reading the excel file and print the first five data of it.

You can also print the first five data from the sheet. You have to use head() function for this.

print(appple_stock.head() ):  It will print first five data from excel sheet.

print(appple_stock.head()) # print the first five data inside the excel sheet

Output:

Read File

Reading the excel file and print the last five data of it.

In the same way, you can print the last five data of the excel using tail() method.

print(appple_stock.tail()) # print the first five data inside the excel sheet

Output

Read File

How to Do operation in Excel Files Using Python Pandas Module?

Now you have understood the concepts of how to read the excel file? Adding the new column e.t.c. Right! In this section of Python Pandas Tutorial, You can also do some mathematical operations between the columns of the excel file. Below is the syntax for the choosing the columns in the data frame and then performing the mathematical operation.

df[‘sum_colum’] = df[‘column1’] + df[‘column2’]

In the same way, you can do multiplication, division, and other mathematic operations.

df[‘multiply_column’] = df[‘column1’] * df[‘column2’]

Let’s learn more from the previous example. Suppose you have to find the average of the apple stocks price using Open, High, Close, and Low price in a given period from the csv. You have to calculate the average of the stock using the following formula.

df[‘Average’] = (df[‘Open’]+df[‘High’] + df[‘Low’] + df[‘Close’])/4

When you will run the code, you will get the average price for each of the row using the formula.

# File name: DataFrame.py
# Author: Data Science Learner
# Begin code

import pandas as pd # importing pandas as pd
appple_stock = pd.read_csv('apple.csv', index_col=None) # reading the excel sheet
# average calulcation
appple_stock['Average'] = (appple_stock['Open'] + appple_stock['High'] + appple_stock['Low'] + appple_stock['Close'] )/4
print(appple_stock) # print the data inside the excel sheet

Output

Average Price
Python Pandas Tutorial for finding the average price

But this code will not modify the excel sheet. Suppose you also want to write the excel sheet with Average as a column and fill the data in it. You can do this by using the function to_csv(). This method will save all the data from the data frame. It will create a new CSV file if it is not available in path described or will update the existing CSV file.

# File name: DataFrame.py
# Author: Data Science Learner
# Begin code

import pandas as pd # importing pandas as pd
appple_stock = pd.read_csv('apple.csv', index_col=None) # reading the excel sheet
# average calulcation
appple_stock['Average'] = (appple_stock['Open'] + appple_stock['High'] + appple_stock['Low'] + appple_stock['Close'] )/4
path = "C:\\Users\\skrsu\\Documents\\Pandas\\output.csv" #path where you want to save the file
appple_stock.to_csv(path) # save the csv file

You can also write only some of the columns instead of the entire columns. You have to pass the columns = { “col1”, “col2″ ,”col3”,…}  as a argument in the to_csv() method. For example, I have to write only three columns

  1. Date
  2. Close
  3. Average
appple_stock.to_csv(path,columns={ "Date","Close","Average"}) # save the csv file

 

Other Data Operations in the Excel Sheet

How to filter the data?

You can also filter the excel data through boolean expression in the data frame.For example, if you want to display the data where the Close price of the stock is less than 170.00. Therefore you have to use the boolean expression on the ‘Close‘ column.

appple_stock = appple_stock[appple_stock['Close'] <170.0] # Filter the Close price less than 170.00
print(appple_stock)  #print the price

appple_stock[appple_stock[‘Close’] <170.0] : It will filter the Close columns with the price less than 170.0. You can also use the ‘&’ (and) ‘|’ or for joining the boolean expression.

Full Code

import pandas as pd # importing pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
appple_stock = pd.read_csv('apple.csv', index_col=None) # reading the excel sheet
# average calulcation
appple_stock['Average'] = (appple_stock['Open'] + appple_stock['High'] + appple_stock['Low'] + appple_stock['Close'] )/4
path = "C:\\Users\\skrsu\\Documents\\Pandas\\output.csv" #path where you want to save the file
appple_stock.to_csv(path,index=False,columns={ "Date","Close","Average"}) # save the csv file

appple_stock = pd.read_csv(path) # reading the excel sheet
appple_stock = appple_stock[appple_stock['Close'] <170.0] # Filter the Close price less than 170.00 
print(appple_stock) #print the price

Output

How to filter the data?
Python Pandas Tutorial to filter the data

Sorting the Data

Just like filtering the data, in the same way, you can also sort the column data in the data frame using the sort_values(‘column_name’) method. For example, I want to sort the close price, then I will pass the Close column as an argument to the sort_values() method.

appple_stock = appple_stock.sort_values('Close') # sort the close price of the apple stock dataframe

appple_stock = appple_stock.sort_values(‘Close’) : It will sort the values of the close price in the ascending order.

Full Code

import pandas as pd # importing pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
appple_stock = pd.read_csv('apple.csv', index_col=None) # reading the excel sheet
# average calulcation
appple_stock['Average'] = (appple_stock['Open'] + appple_stock['High'] + appple_stock['Low'] + appple_stock['Close'] )/4
path = "C:\\Users\\skrsu\\Documents\\Pandas\\output.csv" #path where you want to save the file
appple_stock.to_csv(path,index=False,columns={ "Date","Close","Average"}) # save the csv file

appple_stock = pd.read_csv(path) # reading the excel sheet
appple_stock = appple_stock.sort_values('Close') # sort the close price of the apple stock dataframe
print(appple_stock)  #print the data

Output

Python Pandas Tutorial Sorting the Data
Python Pandas Tutorial Sorting the Data

How to plot the excel data Using Matplotlib Module?

In the previous sections of Python Pandas Tutorial, you have learned the following things

  • Types of Data Structures supported By Pandas Python.
  • How to read the excel file and do simple mathematical operations between the column?
  • The methods to write or update the excel file.

In this section of Python Pandas Tutorial, you will learn how to plot various types of figures using the excel data. I will take the modified excel sheet that has the following columns Date, Close and Average. You can plot the line chart just by calling the plot() function to the data frame. It will plot all the columns on the figure. The figure will be shown after calling the show() method.  These methods are the built-in method of matplotlib. Therefore you have to first import matplotlib module.

appple_stock.plot() # plot the columns in the dataframe 
plt.show() # draw the figure

appple_stock.plot() : The plot() function will plot all the columns in the figure. And the show() method will display the figure.

Full code

import pandas as pd # importing pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
appple_stock = pd.read_csv('apple.csv', index_col=None) # reading the excel sheet
# average calulcation
appple_stock['Average'] = (appple_stock['Open'] + appple_stock['High'] + appple_stock['Low'] + appple_stock['Close'] )/4
path = "C:\\Users\\skrsu\\Documents\\Pandas\\output.csv" #path where you want to save the file
appple_stock.to_csv(path,index=False,columns={ "Date","Close","Average"}) # save the csv file

appple_stock = pd.read_csv(path) # reading the excel sheet
plt.plot(appple_stock['Date'],appple_stock['Average'])
(mdates.DateFormatter('%b %d'))
plt.show()

 

 

End Notes

You can do all the things like operation, graph plotting in the excel sheet easily. It is up to a certain limit. But as you already know that data is rapidly or exponential growing. Then There comes a point where you have to use other libraries or module for manipulating large sets of data. Pandas Python is oneof the libraries. You can easily manipulate data in pandas. Dataframe is the general used data structures in Pandas. In this Python Pandas Tutorial, I have covered all the general things that you will use while coding your projects. If you want to include any topic in this article then please comment or contact us for inclusion. You can also like our Facebook Page to connect socially. Also don’t forget to subscribe us for getting new interesting stuff.

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Abhishek ( Chief Editor) , a data scientist with major expertise in NLP and Text Analytics. He has worked on various projects involving text data and have been able to achieve great results. He is currently manages Datasciencelearner.com, where he and his team share knowledge and help others learn more about data science.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner