Making the Switch from R Dataframe Slicing to Python’s Pandas

Have you been taught to use R for data analysis your whole career and wish to make the switch to Python? Are you thinking of applying for a new job which will make use of Python’s famous Pandas library and don’t have the know-how? While knowing how to code and manipulate data in R is very important, having the same level of expertise in Pandas is proving to be an invaluable skill for a data scientist in today’s world. My goal with this post is to assist you with the transition and get started by learning how to work with one of Pandas’ more powerful and popular objects: the DataFrame.

Note: the following sections assume the person interested already knows how to code native Python (e.g., use lists, dictionaries, create for/while loops, etc.)

The Pandas DataFrame

Similar to R’s , a Pandas is a data structure that stores information. It organises data into rows and columns and is widely used among data scientists to manipulate data and present findings. These objects can be created through several different ways: using native Python lists, dictionaries, sets, or even calling Pandas function which takes in csv files, Excel files, etc.

For the following examples, I’ll be using COVID-19 data downloaded from John Hopkin’s Github repo. An example of a csv file turned into a Pandas is shown below:

Note: DataFrame may not be displayed like this in other environments

The output above only shows the first 5 rows of the DataFrame using the method. This is the same thing as running in R.

Once the is created, one generally would like to select a subset instead of working with all of the rows and columns. Whether it be to simply display that subset or use it later for visualisations and modelling, being able to appropriately slice a dataframe is very important regardless of the programming language being used. Up until this point, you’ve learnt that the way to do this is via numerical indexing or boolean indexing. For example, if I wanted to select rows 10, 20, and 30, I would normally run , right? Furthermore, what if I wanted to select the first 5 columns? Run .

Let’s see if this works:

Rather than using c(), Python uses lists []
Uh-oh, SyntaxError

As you probably already realised, a in Pandas does not follow the same indexing/slicing rules used in R. This is because Pandas has different methods to slice a !

The .loc[ ] Method

Quick preview of the data we’ll be working with

This method allows someone to access a group of rows and/or columns using label(s) or a boolean array. In other words, one should not try and access columns via integer indexing using this method. Instead, comes in really handy when trying to select columns by their name (or label) along the axis. Looking at the output above, notice how each column has a label (“UID”, “Admin2”, “Province_State”, etc.) and each row also has a label (0, 1, 2, etc.).

Therefore, if I wish to select the columns , , , and for rows 2 and 4, I run the following command:

How come this worked when I used integer indexing for the rows? Well, interprets integers “as a label of the index, and never as an integer position along the index”. Therefore, when I passed [2, 4], interpreted the number as labels of the row index.

Another example: display the columns , , , and column corresponding to the most recent date () with all rows for Allegheny County:

Even though this indexing may seem a bit too complicated, it’s really made up of two components: boolean indexing and label indexing. simply returns a boolean (similar to a list) with each corresponding to a row where == “Allegheny”. This is equivalent to running in R. The column indexing is just a long list of column labels. That’s it. Even though what goes in between the brackets may seem complicated due to the additional characters involved, it’s important to remember what type of indexing allows: labels and booleans (integers in the row index get converted to labels).

Now you may be wondering, “row labels are generally pretty easy to figure out. They’re just the row indices sorted in ascending order. However, how can I know all of the column labels for a specific if I want to use ?” Recall how in R you’re able to pull out the column names using the function. Pandas has a similar feature which allows you to extract the column labels from a .

In order to know all the column labels in a , simply call its attribute like so:

Can be indexed like a native Python list

That being said, our last example will be a little more complex than the previous two: select the columns , , , and all columns corresponding to the last 10 days for California:

Rather than typing every single date as a column label inside the brackets (e.g., “3/8/21”, “3/9/21”, etc), why not use to our advantage? I start off by defining an initial list that’ll be extended later. Afterwards, I grab the last 10 labels from and add them to my original list of . Finally, it’s just a matter of including that list in my call and I’m all set.

Remember, is a great tool to use when working with row or column labels. However, if someone were to ask me, “display the fourth column in the DataFrame " and I didn’t know the column name, then would not work! For those types of problems, we need .

(if you wish to learn more about , check out the documentation here)

The .iloc[ ] Method

As you probably already guessed, is perfect for classic integer indexing. This method takes in integers from 0 to length-1 of the axis and can also take boolean arrays. This means I cannot input the row or column label inside the brackets, since this cannot be interpreted as an integer or boolean. Some examples of using :

Display the fourth column in the DataFrame :

Recall that also accepts boolean arrays: display the last 10 columns for New York:

is used here to effectively pass a boolean into .

Finally, what if we want to display all columns for specific rows? All examples involved subsetting rows and columns. Let’s only do one now: show all columns for rows 10, 20, and 100:

Looks like and would’ve produced the same results with the same input! Why? Remember that interprets integers as labels in the row index. Furthermore, why didn’t we include a comma inside the brackets? Because the default dimension for both and is the row. Therefore, when all we’re passing is row indices, we don’t have to add the comma in the middle.

Bottom line: is really useful when looking at positional indices! Think about it as indexing into a Python list. You wouldn’t index using strings, right?

The Classic [ ] Method

The third most common way of slicing a Pandas DataFrame is by simply using square brackets ([ ]). Considered to be the most basic indexing available, this method is commonly used to slice entire columns from a DataFrame. Therefore, if one wishes to slice rows and columns, try using or instead. A couple examples are shown below:

Select the column from the DataFrame :

The output is a Pandas Series (not a DataFrame!)

Notice how the output does not look the same as when we displayed the column before. This is because the result is a Pandas instead of a . The way to get around this is by placing the string in between brackets (e.g., returns a whereas returned a ). Furthermore, this method can be used a lot in boolean indexing to simplify the statement (instead of typing someone could type .

This method also accepts boolean indexing! Another example: display all of the even rows (with all columns):

Similar to how returns all of the column labels, returns all of the row labels.

Finally, select the columns , , and for the rows between 400 and 500 (inclusive) — only using [ ]:

In this example, focus on going step by step. should return those 3 columns and all rows as a DataFrame, correct? However, since I can’t row and column index inside the same brackets (while using this method), then I have to add another pair of brackets with the boolean array helping us get the rows in between 400 and 500 inclusive.

Bottom line: is really helpful when trying to select entire columns or rows. It doesn’t support double indexing (rows and columns) and, even though it looks similar to R’s, it does not follow the same rules!

Conclusion

At this point, you’re probably thinking, “why? Why is DataFrame slicing more complicated here than it is with R? How will I be able to remember when to use which method?”

First of all, it all takes practice. I had a bunch of and in the beginning that forced me to keep on practicing and reading the documentation for examples. Furthermore, just remember, all methods accept boolean arrays. Its a matter of knowing which one accepts labels and which one integers (the first letter of the method should give it away!)

DataFrame slicing is a basic yet very important skill to have as a data analyst or scientist. Acquiring and developing these skills are extremely beneficial when choosing the most relevant and appropriate data to explore and, later on, model. Hopefully by now you have a better understanding on how to slice DataFrames in Pandas. These skills can later be applied (in conjunction with other Python packages) to produce something like the following:

PLEASE MASK UP!