In this module we will learn how to handle data with the pandas library in Python. Pandas is a powerful library that provides data structures and data analysis tools for Python. It is widely used for data manipulation and analysis, making it an essential tool for data scientists and analysts.
# Import libraries
import pandas as pd
import numpy as np
pandas DataFrames
DataFrames are two-dimensional, size-mutable, potentially heterogeneous tabular data structures with labeled axes (rows and columns). They are similar to SQL tables or Excel spreadsheets. DataFrames can be created from various data sources, such as lists, dictionaries, or CSV files.
kids = [['Mike' , 'boy' , 7 , 128 ], ['George' , 'boy' , 4 , 112 ], ['Lily' , 'girl' , 8 , 132 ], ['Yu' , 'girl' , 10 , 143 ], ['Dan' , 'boy' , 9 , 138 ]]
colNames = ['name' , 'gender' , 'age' , 'height' ]
df = pd.DataFrame(data = kids, columns = colNames)
df
0
Mike
boy
7
128
1
George
boy
4
112
2
Lily
girl
8
132
3
Yu
girl
10
143
4
Dan
boy
9
138
# Check the column data types
df.dtypes
name object
gender object
age int64
height int64
dtype: object
Import data from a csv file
df = pd.read_csv('diabetes.csv' )
0
6
148
72
35
0
33.6
0.627
50
1
1
1
85
66
29
0
26.6
0.351
31
0
2
8
183
64
0
0
23.3
0.672
32
1
3
1
89
66
23
94
28.1
0.167
21
0
4
0
137
40
35
168
43.1
2.288
33
1
...
...
...
...
...
...
...
...
...
...
763
10
101
76
48
180
32.9
0.171
63
0
764
2
122
70
27
0
36.8
0.340
27
0
765
5
121
72
23
112
26.2
0.245
30
0
766
1
126
60
0
0
30.1
0.349
47
1
767
1
93
70
31
0
30.4
0.315
23
0
768 rows × 9 columns
# Select top N number of records (default = 5)
df.head()
0
6
148
72
35
0
33.6
0.627
50
1
1
1
85
66
29
0
26.6
0.351
31
0
2
8
183
64
0
0
23.3
0.672
32
1
3
1
89
66
23
94
28.1
0.167
21
0
4
0
137
40
35
168
43.1
2.288
33
1
# Select bottom N number of records (default = 5)
df.tail()
763
10
101
76
48
180
32.9
0.171
63
0
764
2
122
70
27
0
36.8
0.340
27
0
765
5
121
72
23
112
26.2
0.245
30
0
766
1
126
60
0
0
30.1
0.349
47
1
767
1
93
70
31
0
30.4
0.315
23
0
# Check the column data types
df.dtypes
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
# Get the number of rows and columns in your dataframe
df.shape
# The info method gives the column datatypes + number of non-null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
Subsetting DataFrames
# Select one column using double brackets
df[['Pregnancies' ]].head()
type (df[['Pregnancies' ]].head())
pandas.core.frame.DataFrame
# Select multiple columns using double brackets
df[['Pregnancies' , 'Glucose' ]].head()
0
6
148
1
1
85
2
8
183
3
1
89
4
0
137
type (df[['Pregnancies' , 'Glucose' ]].head())
pandas.core.frame.DataFrame
# Select one column using single brackets
# This produces a pandas series which is a one-dimentional array that can be labeled
df['Pregnancies' ].head()
0 6
1 1
2 8
3 1
4 0
Name: Pregnancies, dtype: int64
type (df['Pregnancies' ].head())
pandas.core.series.Series
# Keep in mind that you can't select multiple columns using single brackets
df['Pregnancies' , 'Glucose' ]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/pandas/core/indexes/base.py:3812 , in Index.get_loc (self, key)
3811 try :
-> 3812 return self . _engine . get_loc ( casted_key )
3813 except KeyError as err:
File pandas/_libs/index.pyx:167 , in pandas._libs.index.IndexEngine.get_loc ()
File pandas/_libs/index.pyx:196 , in pandas._libs.index.IndexEngine.get_loc ()
File pandas/_libs/hashtable_class_helper.pxi:7088 , in pandas._libs.hashtable.PyObjectHashTable.get_item ()
File pandas/_libs/hashtable_class_helper.pxi:7096 , in pandas._libs.hashtable.PyObjectHashTable.get_item ()
KeyError : ('Pregnancies', 'Glucose')
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[18] , line 2
1 # Keep in mind that you can't select multiple columns using single brackets
----> 2 df [ ' Pregnancies ' , ' Glucose ' ]
File /opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/pandas/core/frame.py:4107 , in DataFrame.__getitem__ (self, key)
4105 if self .columns.nlevels > 1 :
4106 return self ._getitem_multilevel(key)
-> 4107 indexer = self . columns . get_loc ( key )
4108 if is_integer(indexer):
4109 indexer = [indexer]
File /opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/pandas/core/indexes/base.py:3819 , in Index.get_loc (self, key)
3814 if isinstance (casted_key, slice ) or (
3815 isinstance (casted_key, abc.Iterable)
3816 and any (isinstance (x, slice ) for x in casted_key)
3817 ):
3818 raise InvalidIndexError(key)
-> 3819 raise KeyError (key) from err
3820 except TypeError :
3821 # If we have a listlike key, _check_indexing_error will raise
3822 # InvalidIndexError. Otherwise we fall through and re-raise
3823 # the TypeError.
3824 self ._check_indexing_error(key)
KeyError : ('Pregnancies', 'Glucose')
With a pandas series, we can select rows using slicing like this: [start_index:end_index]. The end_index is not inclusive. This behavior is very similar to Python lists.
0 50
1 31
2 32
3 21
4 33
..
763 63
764 27
765 30
766 47
767 23
Name: Age, Length: 768, dtype: int64
0 50
1 31
2 32
3 21
4 33
5 30
6 26
7 29
8 53
9 54
Name: Age, dtype: int64
0 50
1 31
2 32
3 21
4 33
Name: Age, dtype: int64
pandas.core.series.Series
The pandas attribute .loc allow you to select columns, index, and slice your data
# pandas dataframe
df.loc[:, ['Age' ]].head()
# pandas series
df.loc[:, 'Age' ].head()
0 50
1 31
2 32
3 21
4 33
Name: Age, dtype: int64
df['Pregnancies' ].value_counts()
Pregnancies
1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
13 10
12 9
14 2
17 1
15 1
Name: count, dtype: int64
Filtering DataFrames
# Notice that the filter produces a pandas series of True and False values
filter_pregnancies = df['Pregnancies' ] >= 3
filter_pregnancies
0 True
1 False
2 True
3 False
4 False
...
763 True
764 False
765 True
766 False
767 False
Name: Pregnancies, Length: 768, dtype: bool
filter_pregnancies.head()
0 True
1 False
2 True
3 False
4 False
Name: Pregnancies, dtype: bool
0
6
148
72
35
0
33.6
0.627
50
1
2
8
183
64
0
0
23.3
0.672
32
1
5
5
116
74
0
0
25.6
0.201
30
0
6
3
78
50
32
88
31.0
0.248
26
1
7
10
115
0
0
0
35.3
0.134
29
0
...
...
...
...
...
...
...
...
...
...
759
6
190
92
0
0
35.5
0.278
66
1
761
9
170
74
31
0
44.0
0.403
43
1
762
9
89
62
0
0
22.5
0.142
33
0
763
10
101
76
48
180
32.9
0.171
63
0
765
5
121
72
23
112
26.2
0.245
30
0
419 rows × 9 columns
df.loc[filter_pregnancies, :]
0
6
148
72
35
0
33.6
0.627
50
1
2
8
183
64
0
0
23.3
0.672
32
1
5
5
116
74
0
0
25.6
0.201
30
0
6
3
78
50
32
88
31.0
0.248
26
1
7
10
115
0
0
0
35.3
0.134
29
0
...
...
...
...
...
...
...
...
...
...
759
6
190
92
0
0
35.5
0.278
66
1
761
9
170
74
31
0
44.0
0.403
43
1
762
9
89
62
0
0
22.5
0.142
33
0
763
10
101
76
48
180
32.9
0.171
63
0
765
5
121
72
23
112
26.2
0.245
30
0
419 rows × 9 columns
# Notice that it looks like nothing changes
# This is because we didn't update the dataframe after applying the filter
df['Pregnancies' ].value_counts()
Pregnancies
1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
13 10
12 9
14 2
17 1
15 1
Name: count, dtype: int64
# Update the dataframe to contain observations with pregnancies <= 3
df = df.loc[filter_pregnancies, :]
df['Pregnancies' ].value_counts()
Pregnancies
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
13 10
12 9
14 2
17 1
15 1
Name: count, dtype: int64
Age
28 20
41 18
37 18
31 16
27 15
30 15
22 15
29 15
42 14
24 14
25 14
45 13
34 13
38 13
36 13
26 12
43 12
39 12
32 12
40 12
33 10
46 10
35 8
51 8
23 7
50 7
58 6
44 6
60 5
54 5
57 5
21 5
49 5
52 5
48 5
47 5
63 4
55 4
56 3
53 3
66 3
69 2
61 2
67 2
65 2
59 1
81 1
64 1
62 1
70 1
68 1
Name: count, dtype: int64
# Notice that the filter produces a pandas series of True and False values
df['Age' ] <= 38
0 False
2 True
5 True
6 True
7 True
...
759 False
761 False
762 True
763 False
765 True
Name: Age, Length: 419, dtype: bool
filter_age = df['Age' ] <= 38
df = df.loc[filter_age, :]
Age
28 20
37 18
31 16
30 15
22 15
27 15
29 15
24 14
25 14
34 13
36 13
38 13
32 12
26 12
33 10
35 8
23 7
21 5
Name: count, dtype: int64
Combining filters
df.loc[filter_pregnancies & filter_age, :]
2
8
183
64
0
0
23.3
0.672
32
1
5
5
116
74
0
0
25.6
0.201
30
0
6
3
78
50
32
88
31.0
0.248
26
1
7
10
115
0
0
0
35.3
0.134
29
0
10
4
110
92
0
0
37.6
0.191
30
0
...
...
...
...
...
...
...
...
...
...
748
3
187
70
22
200
36.4
0.408
36
1
750
4
136
70
0
0
31.2
1.182
22
1
752
3
108
62
24
0
26.0
0.223
25
0
762
9
89
62
0
0
22.5
0.142
33
0
765
5
121
72
23
112
26.2
0.245
30
0
235 rows × 9 columns
Managing DataFrame columns
# Rename column names
df3 = df.rename(columns= {
'BloodPressure' : 'BP' ,
'SkinThickness' : 'ST' ,
'DiabetesPedigreeFunction' : 'DPF'
})
2
8
183
64
0
0
23.3
0.672
32
1
5
5
116
74
0
0
25.6
0.201
30
0
6
3
78
50
32
88
31.0
0.248
26
1
7
10
115
0
0
0
35.3
0.134
29
0
10
4
110
92
0
0
37.6
0.191
30
0
...
...
...
...
...
...
...
...
...
...
748
3
187
70
22
200
36.4
0.408
36
1
750
4
136
70
0
0
31.2
1.182
22
1
752
3
108
62
24
0
26.0
0.223
25
0
762
9
89
62
0
0
22.5
0.142
33
0
765
5
121
72
23
112
26.2
0.245
30
0
235 rows × 9 columns
# This approach allows you to drop multiple columns at a time
df3 = df3.drop(columns= ['Insulin' , 'BMI' ])
2
8
183
64
0
0.672
32
1
5
5
116
74
0
0.201
30
0
6
3
78
50
32
0.248
26
1
7
10
115
0
0
0.134
29
0
10
4
110
92
0
0.191
30
0
...
...
...
...
...
...
...
...
748
3
187
70
22
0.408
36
1
750
4
136
70
0
1.182
22
1
752
3
108
62
24
0.223
25
0
762
9
89
62
0
0.142
33
0
765
5
121
72
23
0.245
30
0
235 rows × 7 columns
# Approach 2
del df3['Pregnancies' ]
2
183
64
0
0.672
32
1
5
116
74
0
0.201
30
0
6
78
50
32
0.248
26
1
7
115
0
0
0.134
29
0
10
110
92
0
0.191
30
0
...
...
...
...
...
...
...
748
187
70
22
0.408
36
1
750
136
70
0
1.182
22
1
752
108
62
24
0.223
25
0
762
89
62
0
0.142
33
0
765
121
72
23
0.245
30
0
235 rows × 6 columns
Calculating statistics
Glucose 28536.000
BP 15807.000
ST 4272.000
DPF 102.571
Age 7004.000
Outcome 88.000
dtype: float64
<class 'pandas.core.frame.DataFrame'>
Index: 235 entries, 2 to 765
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Glucose 235 non-null int64
1 BP 235 non-null int64
2 ST 235 non-null int64
3 DPF 235 non-null float64
4 Age 235 non-null int64
5 Outcome 235 non-null int64
dtypes: float64(1), int64(5)
memory usage: 12.9 KB
array([[1.83e+02, 6.40e+01, 0.00e+00, 6.72e-01, 3.20e+01, 1.00e+00],
[1.16e+02, 7.40e+01, 0.00e+00, 2.01e-01, 3.00e+01, 0.00e+00],
[7.80e+01, 5.00e+01, 3.20e+01, 2.48e-01, 2.60e+01, 1.00e+00],
...,
[1.08e+02, 6.20e+01, 2.40e+01, 2.23e-01, 2.50e+01, 0.00e+00],
[8.90e+01, 6.20e+01, 0.00e+00, 1.42e-01, 3.30e+01, 0.00e+00],
[1.21e+02, 7.20e+01, 2.30e+01, 2.45e-01, 3.00e+01, 0.00e+00]],
shape=(235, 6))
df3 = df3.loc[filter_age & filter_pregnancies, :]
2
183
64
0
0.672
32
1
5
116
74
0
0.201
30
0
6
78
50
32
0.248
26
1
7
115
0
0
0.134
29
0
10
110
92
0
0.191
30
0
...
...
...
...
...
...
...
748
187
70
22
0.408
36
1
750
136
70
0
1.182
22
1
752
108
62
24
0.223
25
0
762
89
62
0
0.142
33
0
765
121
72
23
0.245
30
0
235 rows × 6 columns
# Export DataFrame to csv file
df3.to_csv(path_or_buf = 'diabetes2.csv' , index = False )