Estimated time: 15 minutes

Pandas

pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. — Wikipedia

The pandas package is the most important tool at the disposal of Data Scientists and Analysts working in Python today. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

Simply, pandas can do anything you could imagine of with data.

DataFrame and Series

The primary two components of pandas are the Series and DataFrame. A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

Creating dataframe

import pandas as pd

data = {
    'durians': [4, 55, 32, 13], 
    'apples': [3, 56, 3, 1]
}

fruit = pd.DataFrame(data)
fruit

Each item (key, value) in data corresponds to a column in the resulting DataFrame.

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

Let's put customer names as our index:

fruit = pd.DataFrame(data, index=['Ali', 'Abu', 'Lim', 'Muthu'])

fruit

Locating the customer

We can locate a customer's order by using their name with loc.

fruit.loc['Ali']

durians    4
apples     3
Name: Ali, dtype: int64

Read CSV files

Reading CSV files is a single line to load in the data.

wget is a comment to retrieves content from web servers.

!wget http://gg.gg/fruit-csv -O fruit.csv

--2021-03-08 21:02:32--  http://gg.gg/fruit-csv
Resolving gg.gg (gg.gg)... 185.15.209.141
Connecting to gg.gg (gg.gg)|185.15.209.141|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/khvmaths/a5eab33562c1a0e4d30901af8203784c/raw/505eca83251888c84c2b336be841aa5ea0f318f2/fruit.csv [following]
--2021-03-08 21:02:33--  https://gist.githubusercontent.com/khvmaths/a5eab33562c1a0e4d30901af8203784c/raw/505eca83251888c84c2b336be841aa5ea0f318f2/fruit.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53 [text/plain]
Saving to: ‘fruit.csv’

fruit.csv           100%[===================>]      53  --.-KB/s    in 0s      

2021-03-08 21:02:33 (1.81 MB/s) - ‘fruit.csv’ saved [53/53]

df = pd.read_csv('fruit.csv')
df

CSV doesn't have indexes like DataFrames. So, we need to designate the index_col when reading.

df = pd.read_csv('fruit.csv',index_col=0)
df

The same goes for reading JSON, SQL, Excel file, etc. Basically you could read anything. Read more from the documentation here

Some more common operations that you should know!

Let's save our fruit.csv into JSON file.

df.to_json('fruit.json')

In Linux system, you could read a file contant through cat command.

!cat fruit.json

{"durians":{"Ali":4,"Abu":55,"Lim":32,"Muthu":13},"apples":{"Ali":3,"Abu":56,"Lim":3,"Muthu":1}}

Viewing few rows in large dataset

In a large file, you can't expect to print all the table, it would be very messy! Luckily, pandas allows to just look for the first few rows or the last few rows through head and tail.

!wget http://gg.gg/cov-1 -O covid19.csv

--2021-03-08 21:02:34--  http://gg.gg/cov-1
Resolving gg.gg (gg.gg)... 185.15.209.141
Connecting to gg.gg (gg.gg)|185.15.209.141|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv [following]
--2021-03-08 21:02:34--  https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 516625 (505K) [text/plain]
Saving to: ‘covid19.csv’

covid19.csv         100%[===================>] 504.52K  1.41MB/s    in 0.3s    

2021-03-08 21:02:35 (1.41 MB/s) - ‘covid19.csv’ saved [516625/516625]

covid = pd.read_csv('covid19.csv')

covid.head()

covid.tail()

Typically when we load in a dataset, we like to view the first five or so rows to see what's under the hood. Here we can see the names of each column, the index, and examples of values in each row.

Getting info about the data

.info() should be one of the very first commands you run after loading your data:

covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274 entries, 0 to 273
Columns: 415 entries, Province/State to 3/7/21
dtypes: float64(2), int64(411), object(2)
memory usage: 888.5+ KB

.info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

Shape of the data

Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):

covid.shape

(274, 415)

This means we have 247 rows with 415 columns in the DataFrame.

Duplicates

To demonstrate, let's simply just double up our movies DataFrame by appending it to itself.

covid_tmp = covid.append(covid)
covid_tmp.shape

(548, 415)

Using append() will return a copy without affecting the original DataFrame. Now, let's drop the duplicates.

covid_tmp = covid.drop_duplicates()
covid_tmp.shape

(274, 415)

It's a little verbose to keep assigning DataFrames to the same variable like in this example. For this reason, pandas has the inplace keyword argument on many of its methods. Using inplace=True will modify the DataFrame object in place. Another important argument for drop_duplicates() is keep, which has three possible options:

first: (default) Drop duplicates except for the first occurrence.
last: Drop duplicates except for the last occurrence.
False: Drop all duplicates.

covid_tmp.drop_duplicates(inplace=True, keep=False)

Columns

When we have a large dataset, we might forget the column name, with upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names.

covid.columns

Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '2/26/21', '2/27/21', '2/28/21', '3/1/21', '3/2/21', '3/3/21', '3/4/21',
       '3/5/21', '3/6/21', '3/7/21'],
      dtype='object', length=415)

We can use the .rename() method to rename certain or all columns via a dict.

covid.rename(columns={
        'Province/State': 'Province', 
        'Country/Region': 'Country'
    }, inplace=True)
covid.columns

Index(['Province', 'Country', 'Lat', 'Long', '1/22/20', '1/23/20', '1/24/20',
       '1/25/20', '1/26/20', '1/27/20',
       ...
       '2/26/21', '2/27/21', '2/28/21', '3/1/21', '3/2/21', '3/3/21', '3/4/21',
       '3/5/21', '3/6/21', '3/7/21'],
      dtype='object', length=415)

Missing values

When exploring data, you’ll most likely encounter missing or null values, which are essentially placeholders for non-existent values. Most commonly you'll see Python's None or NumPy's np.nan, each of which are handled differently in some situations. We can use isnull() to check for missing values in our DataFrame.

covid.isnull()

Removing null values

We can remove the null value with dropna().

covid.dropna()

This operation will delete any row with at least a single null value, but it will return a new DataFrame without altering the original one. You could specify inplace=True in this method as well.

covid.dropna(inplace=True)
covid.shape

(84, 415)

Other than just dropping rows, you can also drop columns with null values by setting axis=1.

covid.dropna(axis=1,inplace=True)
covid.shape

(84, 415)

Understanding our data

Using describe() on an entire DataFrame we can get a summary of the distribution of continuous variables.

covid['1/22/20'].describe()

count     84.000000
mean       6.523810
std       48.442358
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max      444.000000
Name: 1/22/20, dtype: float64

Reference

	Province/State	Country/Region	Lat	Long	...	2/26/21	2/27/21	2/28/21	3/1/21	3/2/21	3/3/21	3/4/21	3/5/21	3/6/21	3/7/21
0	NaN	Afghanistan	33.93911	67.709953	...	55696	55707	55714	55733	55759	55770	55775	55827	55840	55847
1	NaN	Albania	41.15330	20.168300	...	105229	106215	107167	107931	108823	109674	110521	111301	112078	112897
2	NaN	Algeria	28.03390	1.659600	...	112805	112960	113092	113255	113430	113593	113761	113948	114104	114234
3	NaN	Andorra	42.50630	1.521800	...	10822	10849	10866	10889	10908	10948	10976	10998	11019	11042
4	NaN	Angola	-11.20270	17.873900	...	20759	20782	20807	20854	20882	20923	20981	21026	21055	21086

	Province/State	Country/Region	Lat	Long	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	...	2/26/21	2/27/21	2/28/21	3/1/21	3/2/21	3/3/21	3/4/21	3/5/21	3/6/21	3/7/21
269	NaN	Vietnam	14.058324	108.277199	2	2	2	2	2	...	2426	2432	2448	2448	2475	2482	2488	2494	2501	2512
270	NaN	West Bank and Gaza	31.952200	35.233200	0	0	0	0	0	...	180848	181909	183612	185336	187309	189326	191203	193029	194548	196812
271	NaN	Yemen	15.552727	48.516388	0	0	0	0	0	...	2267	2269	2285	2310	2342	2363	2375	2411	2444	2473
272	NaN	Zambia	-13.133897	27.849332	0	0	0	0	0	...	77639	78202	78534	79002	79557	80090	80687	81341	82011	82421
273	NaN	Zimbabwe	-19.015438	29.154857	0	0	0	0	0	...	36044	36058	36089	36115	36148	36179	36223	36248	36260	36271

	Province	Country	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	...	2/26/21	2/27/21	2/28/21	3/1/21	3/2/21	3/3/21	3/4/21	3/5/21	3/6/21	3/7/21
8	Australian Capital Territory	Australia	-35.473500	149.012400	0	0	0	0	0	0	...	118	118	118	118	120	120	120	120	122	122
9	New South Wales	Australia	-33.868800	151.209300	0	0	0	0	3	4	...	5172	5177	5180	5183	5189	5193	5205	5207	5209	5210
10	Northern Territory	Australia	-12.463400	130.845600	0	0	0	0	0	0	...	105	105	105	105	105	105	105	105	105	105
11	Queensland	Australia	-27.469800	153.025100	0	0	0	0	0	0	...	1329	1329	1331	1335	1335	1342	1344	1347	1349	1356
12	South Australia	Australia	-34.928500	138.600700	0	0	0	0	0	0	...	613	613	616	617	618	618	618	618	618	618
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
259	Gibraltar	United Kingdom	36.140800	-5.353600	0	0	0	0	0	0	...	4236	4238	4239	4239	4243	4243	4243	4244	4247	4249
260	Isle of Man	United Kingdom	54.236100	-4.548100	0	0	0	0	0	0	...	475	484	484	486	494	511	546	605	676	757
261	Montserrat	United Kingdom	16.742498	-62.187366	0	0	0	0	0	0	...	20	20	20	20	20	20	20	20	20	20
262	Saint Helena, Ascension and Tristan da Cunha	United Kingdom	-7.946700	-14.355900	0	0	0	0	0	0	...	4	4	4	4	4	4	4	4	4	4
263	Turks and Caicos Islands	United Kingdom	21.694000	-71.797900	0	0	0	0	0	0	...	2088	2099	2114	2114	2115	2132	2161	2172	2177	2177

	Province	Country	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	...	2/26/21	2/27/21	2/28/21	3/1/21	3/2/21	3/3/21	3/4/21	3/5/21	3/6/21	3/7/21
0	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
269	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
270	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
271	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
272	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
273	True	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

	durians	apples
Ali	4	3
Abu	55	56
Lim	32	3
Muthu	13	1

	Unnamed: 0	durians	apples
0	Ali	4	3
1	Abu	55	56
2	Lim	32	3
3	Muthu	13	1

	durians	apples
Ali	4	3
Abu	55	56
Lim	32	3
Muthu	13	1