Have you been taught to use R for data analysis your whole career and wish to make the switch to Python? Are you thinking of applying for a new job which will make use of Python’s famous Pandas library and don’t have the know-how? While knowing how to code and manipulate data in R is very important, having the same level of expertise in Pandas is proving to be an invaluable skill for a data scientist in today’s world. My goal with this post is to assist you with the transition and get started by learning how to work with one of Pandas’ more powerful and popular objects: the DataFrame.
Note: the following sections assume the person interested already knows how to code native Python (e.g., use lists, dictionaries, create for/while loops, etc.)
The Pandas DataFrame
Similar to R’s
data.frame, a Pandas
DataFrame is a data structure that stores information. It organises data into rows and columns and is widely used among data scientists to manipulate data and present findings. These objects can be created through several different ways: using native Python lists, dictionaries, sets, or even calling Pandas’
read_xxx() function which takes in csv files, Excel files, etc.
For the following examples, I’ll be using COVID-19 data downloaded from John Hopkin’s Github repo. An example of a csv file turned into a Pandas
DataFrame is shown below:
The output above only shows the first 5 rows of the DataFrame
df using the
.head() method. This is the same thing as running
head(df) in R.
DataFrame is created, one generally would like to select a subset instead of working with all of the rows and columns. Whether it be to simply display that subset or use it later for visualisations and modelling, being able to appropriately slice a dataframe is very important regardless of the programming language being used. Up until this point, you’ve learnt that the way to do this is via numerical indexing or boolean indexing. For example, if I wanted to select rows 10, 20, and 30, I would normally run
df[c(10,20,30),], right? Furthermore, what if I wanted to select the first 5 columns? Run
Let’s see if this works:
As you probably already realised, a
DataFrame in Pandas does not follow the same indexing/slicing rules used in R. This is because Pandas has different methods to slice a
The .loc[ ] Method
This method allows someone to access a group of rows and/or columns using label(s) or a boolean array. In other words, one should not try and access columns via integer indexing using this method. Instead,
.loc comes in really handy when trying to select columns by their name (or label) along the axis. Looking at the output above, notice how each column has a label (“UID”, “Admin2”, “Province_State”, etc.) and each row also has a label (0, 1, 2, etc.).
Therefore, if I wish to select the columns
Long_ for rows 2 and 4, I run the following command:
How come this worked when I used integer indexing for the rows? Well,
.loc interprets integers “as a label of the index, and never as an integer position along the index”. Therefore, when I passed [2, 4],
.loc interpreted the number as labels of the row index.
Another example: display the columns
Country_Region, and column corresponding to the most recent date (
3/17/21) with all rows for Allegheny County:
Even though this indexing may seem a bit too complicated, it’s really made up of two components: boolean indexing and label indexing.
df.loc[:,"Admin2"] == "Allegheny" simply returns a boolean
Series (similar to a list) with each
True corresponding to a row where
Admin2 == “Allegheny”. This is equivalent to running
df$Admin2 == “Allegheny" in R. The column indexing is just a long list of column labels. That’s it. Even though what goes in between the brackets may seem complicated due to the additional characters involved, it’s important to remember what type of indexing
.loc allows: labels and booleans (integers in the row index get converted to labels).
Now you may be wondering, “row labels are generally pretty easy to figure out. They’re just the row indices sorted in ascending order. However, how can I know all of the column labels for a specific
DataFrame if I want to use
.loc?” Recall how in R you’re able to pull out the column names using the
colnames() function. Pandas has a similar feature which allows you to extract the column labels from a
In order to know all the column labels in a
DataFrame, simply call its attribute
columns like so:
That being said, our last example will be a little more complex than the previous two: select the columns
Country_Region, and all columns corresponding to the last 10 days for California:
Rather than typing every single date as a column label inside the brackets (e.g., “3/8/21”, “3/9/21”, etc), why not use
df.columns to our advantage? I start off by defining an initial list
cols that’ll be extended later. Afterwards, I grab the last 10 labels from
df.columns and add them to my original list of
cols. Finally, it’s just a matter of including that list in my
.loc call and I’m all set.
.loc is a great tool to use when working with row or column labels. However, if someone were to ask me, “display the fourth column in the DataFrame
df" and I didn’t know the column name, then
.loc would not work! For those types of problems, we need
(if you wish to learn more about
.loc, check out the documentation here)
The .iloc[ ] Method
As you probably already guessed,
.iloc is perfect for classic integer indexing. This method takes in integers from 0 to length-1 of the axis and can also take boolean arrays. This means I cannot input the row or column label inside the brackets, since this cannot be interpreted as an integer or boolean. Some examples of using
Display the fourth column in the DataFrame
.iloc also accepts boolean arrays: display the last 10 columns for New York:
.values() is used here to effectively pass a boolean
Finally, what if we want to display all columns for specific rows? All examples involved subsetting rows and columns. Let’s only do one now: show all columns for rows 10, 20, and 100:
.loc would’ve produced the same results with the same input! Why? Remember that
.loc interprets integers as labels in the row index. Furthermore, why didn’t we include a comma inside the brackets? Because the default dimension for both
.iloc is the row. Therefore, when all we’re passing is row indices, we don’t have to add the comma in the middle.
.iloc is really useful when looking at positional indices! Think about it as indexing into a Python list. You wouldn’t index using strings, right?
The Classic [ ] Method
The third most common way of slicing a Pandas DataFrame is by simply using square brackets ([ ]). Considered to be the most basic indexing available, this method is commonly used to slice entire columns from a DataFrame. Therefore, if one wishes to slice rows and columns, try using
.iloc instead. A couple examples are shown below:
Admin2 column from the DataFrame
Notice how the output does not look the same as when we displayed the column
code3 before. This is because the result is a Pandas
Series instead of a
DataFrame. The way to get around this is by placing the string in between brackets (e.g.,
df.iloc[:,3] returns a
df.iloc[:,] returned a
DataFrame). Furthermore, this method can be used a lot in boolean indexing to simplify the statement (instead of typing
df.loc[:,"Admin2"] == "Allegheny" someone could type
df["Admin2"] == "Allegheny".
This method also accepts boolean indexing! Another example: display all of the even rows (with all columns):
Similar to how
.columns returns all of the column labels,
.index returns all of the row labels.
Finally, select the columns
Country_Region for the rows between 400 and 500 (inclusive) — only using [ ]:
In this example, focus on going step by step.
df[["Admin2", "Province_State", "Country_Region"]] should return those 3 columns and all rows as a DataFrame, correct? However, since I can’t row and column index inside the same brackets (while using this method), then I have to add another pair of brackets with the boolean array helping us get the rows in between 400 and 500 inclusive.
[ ] is really helpful when trying to select entire columns or rows. It doesn’t support double indexing (rows and columns) and, even though it looks similar to R’s, it does not follow the same rules!
At this point, you’re probably thinking, “why? Why is DataFrame slicing more complicated here than it is with R? How will I be able to remember when to use which method?”
First of all, it all takes practice. I had a bunch of
KeywordErrors in the beginning that forced me to keep on practicing and reading the documentation for examples. Furthermore, just remember, all methods accept boolean arrays. Its a matter of knowing which one accepts labels and which one integers (the first letter of the method should give it away!)
DataFrame slicing is a basic yet very important skill to have as a data analyst or scientist. Acquiring and developing these skills are extremely beneficial when choosing the most relevant and appropriate data to explore and, later on, model. Hopefully by now you have a better understanding on how to slice DataFrames in Pandas. These skills can later be applied (in conjunction with other Python packages) to produce something like the following:
PLEASE MASK UP!