How to iterate over rows in Pandas: Most efficient options
LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you.
There are many ways to iterate over rows of a DataFrame or Series in pandas, each with their own pros and cons.
Since pandas is built on top of NumPy, also consider reading through our NumPy tutorial to learn more about working with the underlying arrays.
To demonstrate each row-iteration method, we'll be utilizing the ubiquitous Iris flower dataset, an easy-to-access dataset containing features of different species of flowers.
We'll import the dataset using seaborn and limit it to the top three rows for simplicity:
Most straightforward row iteration
The most straightforward method for iterating over rows is with the iterrows()
method, like so:
iterrows()
returns a row index
as well as the row
itself. Additionally, to improve readability, if you don't care about the index
value, you can throw it away with an underscore (_
).
Despite its ease of use and intuitive nature, iterrows()
is one of the slowest ways to iterate over rows. This article will also look at how you can substitute iterrows()
for itertuples()
or apply()
to speed up iteration.
Additionally, we'll consider how you can use apply()
, numpy functions, or map()
for a vectorized approach to row operations.
Option 1 (worst): iterrows()
Using iterrows()
in combination with a dataframe creates what is known as a generator. A generator is an iterable object, meaning we can loop through it.
Let's use iterrows()
again, but without pulling out the index in the loop definition:
You might have noticed that the result looks a little different from the output shown in the intro. Here, we've only assigned the output to the row
variable, which now contains both the index and row in a tuple.
But let's try and extract the sepal length and width from each row:
This error occurs because each item in our iterrows()
generator is a tuple-type object with two values, the row index and the row content.
If you look at the output at the section beginning, the data for each row is inside parentheses, with a comma separating the index. By assigning the generator output to two variables, we can unpack the data inside the tuple into a more accessible format.
After unpacking the tuple, we can easily access specific values in each row. In this example, we'll throw away the index variable by using an underscore:
Row iteration example
A typical machine learning project for beginners is to use the Iris dataset to create a classification model that can predict the species of a flower based on its measurements.
To do this, we'll need to convert the values in the species column of the dataframe into a numerical format. This process is known as label encoding. Plenty of tools exist to automatically do this for us, such as sklearn's LabelEncoder
, but let's use iterrows()
to do the label encoding ourselves.
First, we need to know the unique species in the dataset:
We'll manually assign a number for each species:
After that, we can iterate through the DataFrame and update each row:
The output shows that we successfully converted the values. The problem with this approach is that using iterrows
requires Python to loop through each row in the dataframe. Many datasets have thousands of rows, so we should generally avoid Python looping to reduce runtimes.
Before eliminating loops, let's consider a slightly better option than iterrows()
.
Option 2 (okay): itertuples()
An excellent alternative to iterrows
is itertuples
, which functions very similarly to iterrows
, with the main difference being that itertuples
returns named tuples. With a named tuple, you can access specific values as if they were an attribute. Thus, in the context of pandas, we can access the values of a row for a particular column without needing to unpack the tuple first.
The example below shows how we could use itertuples
to label the Iris species:
As the example demonstrates, the setup for itertuples
is very similar to iterrows
. The only significant differences are that itertuples
doesn't require unpacking into index
and row
and that the syntax row['column_name']
doesn't work with itertuples
. It's also worth mentioning that itertuples
is much faster.
As mentioned previously, we should generally avoid looping in pandas. However, there are a few situations where looping may be required, for example, if one of your dataframe columns contained URLs you wanted to visit one at a time with a web scraper. In situations where looping is needed, itertuples
is a much better choice than iterrows
.
Option 3 (best for most applications): apply()
By using apply
and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply
has been optimized better than iterrows
, which results in faster runtimes. See below for an example of how we could use apply
for labeling the species in each row.
In this situation, axis=1
is used to specify that we'd like to iterate across the rows of the dataframe. By changing this to axis=0
, we could apply a function to each column of a dataframe instead. For example, see below for how we could use apply
to get the max value for each column in our dataframe:
It's worth noting that using axis=0
is much faster, as it applies functionality to every row of a column at once instead of iterating through rows one at a time. A solution like this which acts on multiple array values simultaneously is known as a vectorized solution. Check out the next section to see how we could use a vectorized solution for our label encoding problem.
Option 4 (best for some applications): map()
One option we've got for implementing a vectorized solution is using map
. We can use map
on a series (such as a dataframe column). As mentioned previously, a vectorized solution is one we can apply to multiple array values simultaneously. By providing map
a dictionary as an argument, map
will treat column values as dictionary keys, then transform these to their corresponding values in the dictionary. By passing map
our species labels
dictionary that we've been using, we can solve our label encoding problem:
Note that your dictionary needs to include keys for all possible values in the column you're transforming when using map
with a dictionary. Otherwise, Python will replace any dataframe values that don't have a key in the dictionary with a missing (nan
) value.
Speed testing different options
Let's look at a larger dataset to get a good feel for how a vectorized approach is faster.
seaborn
has another sample dataset called gammas, which contains medical imaging data. More specifically, the data represents the blood oxygen levels for different areas of the brain.
The ROI
column refers to the region of interest the row data represents and has the unique values IPS
, AG
and V1
. The ROI
values need to be labeled, so let's take a solution using iterrows
, apply
and map
while recording the times to get a better idea of speed differences.
type | milliseconds | |
---|---|---|
4 | map | 7.67 |
2 | apply axis=0 | 9.35 |
3 | apply axis=1 | 77.74 |
1 | itertuples | 82.49 |
0 | iterrows | 585.06 |
The results show that apply
massively outperforms iterrows
. As mentioned previously, this is because apply
is optimized for looping through dataframe rows much quicker than iterrows
does.
While slower than apply
, itertuples
is quicker than iterrows
, so if looping is required, try implementing itertuples
instead.
Using map
as a vectorized solution gives even faster results. The gammas dataset is relatively small (only 6000 rows), but the performance gap between map
and apply
will increase as dataframes get larger. As a result, vectorized solutions are much more scalable, so you should get used to using these.
Summary
Using iterrows
or itertuples
to manipulate dataframe rows is an acceptable approach when you're just starting with dataframes. For a much quicker solution, apply
is usually pretty easy to implement in place of iterrows
, meaning that it's good when you need a quick fix. However, real-life datasets can get very large in terms of their scale, meaning that vectorized solutions (such as map
) are usually the way to go. As mentioned previously, vectorized approaches scale much better with huge datasets. Good scalability means that you won't start to experience long runtimes with big datasets, so vectorizing dataframe processing is essential when working in the data industry.