Pandas df.explode(): Unnesting Series and DataFrame Columns
LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you.
Pandas' explode()
flattens nested Series objects and DataFrame columns by unfurling the list-like values and spreading their content to multiple rows.
Let's have a quick look. Take the DataFrame below:
Student | Subject | |
---|---|---|
0 | Kelly Osborn | [Calculus, Programming] |
1 | Jade Reed | [Biology, Chemistry, Physics] |
We can call explode()
to unpack the values under Subject
, like so:
Student | Subject | |
---|---|---|
0 | Kelly Osborn | Calculus |
0 | Kelly Osborn | Programming |
1 | Jade Reed | Biology |
1 | Jade Reed | Chemistry |
1 | Jade Reed | Physics |
How does explode() work?
Having multiple values bunched up in one cell (in a _list-like_ form) can create a challenge for analysis. explode()
(adopted by Pandas in version 0.25.0) tackles this particular problem.
explode()
has two parameters:
column
- specifies the column(s) to be exploded. It is either a column name or a bracketed list of column names.ignore_index
- decides whether the original indexes are to be reset. It isFalse
by default.
Let's work on an example. Take the DataFrame below:
name | math101 | comp101 | |
---|---|---|---|
0 | Jack | [35, 67] | [50, 73] |
1 | David | [] | [20, 40] |
Let's separate the comp101
column as a Series of its own:
We can apply explode()
to it directly without any parameters:
Notice that even though we now have numbers populating the cells, the dtype
stays object
. This is the case for every column explode()
produces.
Now, let's run explode()
on the DataFrame itself. In this case, we have to specify the column to be exploded:
name | math101 | comp101 | |
---|---|---|---|
0 | Jack | 35 | [50, 73] |
0 | Jack | 67 | [50, 73] |
1 | David | NaN | [20, 40] |
As you can see, the first row repeated itself for each value of math101
, and the empty list of the second row got replaced with NaN
.
In this DataFrame, the row labels do not hold any particular information. We could reset them by passing True
to ignore_index
. Like so:
name | math101 | comp101 | |
---|---|---|---|
0 | Jack | [35, 67] | 50 |
1 | Jack | [35, 67] | 73 |
2 | David | [] | 20 |
3 | David | [] | 40 |
The ignore_index=True
creates numeric indices from 0 to n.
Exploding List-Looking Strings
The previous section defined the explode()
behavior as 'unpacking list-like values'. explode()
, more precisely, works on Python lists, tuples and sets, Pandas Series, and Numpy n-dimensional arrays. It does not affect Strings, and this is known to cause problems.
Take the DataFrame below:
movie | genre | |
---|---|---|
0 | Memento | ['Thriller', 'Mistery'] |
1 | Casablanca | ['Drama', 'Romance', 'War'] |
Let's try to explode movies
' genre
column:
movie | genre | |
---|---|---|
0 | Memento | ['Thriller', 'Mistery'] |
1 | Casablanca | ['Drama', 'Romance', 'War'] |
explode()
fails us here without any apparent error.
This happens -often when working on imported data. Here, it is easy to pinpoint the problem since we intentionally filled the genre
column with list-looking Strings.
We must convert these String values to lists before running explode()
. Since the Strings under genre
have the exact form of a Python list, we can use ast.literal_eval
like so:
movie | genre | |
---|---|---|
0 | Memento | [Thriller, Mistery] |
1 | Casablanca | [Drama, Romance, War] |
And now, we should be able to explode the genre
column:
movie | genre | |
---|---|---|
0 | Memento | Thriller |
0 | Memento | Mistery |
1 | Casablanca | Drama |
1 | Casablanca | Romance |
1 | Casablanca | War |
literal_eval
only applies, however, to Strings that have the _literal form_ of a Python list, set, or tuple.
Let's take a DataFrame where the data is slightly different in structure:
book | tags | |
---|---|---|
0 | Little Women | classics,historical,young adult |
1 | Jane Eyre | classics,romance, gothic |
Here, we can use str.split
to transform the values in the tags
column.
Like so:
book | tags | |
---|---|---|
0 | Little Women | [classics, historical, young adult] |
1 | Jane Eyre | [classics, romance, gothic] |
And now we explode it:
book | tags | |
---|---|---|
0 | Little Women | classics |
0 | Little Women | historical |
0 | Little Women | young adult |
1 | Jane Eyre | classics |
1 | Jane Eyre | romance |
1 | Jane Eyre | gothic |
Exploding Multiple Columns at Once
We can explode more than one column at a time, given their list-like values on each row match in length.
Let's create a new midterms
DataFrame:
name | calc | bio | chem | |
---|---|---|---|---|
0 | Nia | [80, 88] | [80, 43] | [] |
1 | Millie | [45, 50] | [78, 50] | [50, 67] |
Since calc
and bio
entries in each row match in length, we can explode these columns together:
name | calc | bio | chem | |
---|---|---|---|---|
0 | Nia | 80 | 80 | [] |
0 | Nia | 88 | 43 | [] |
1 | Millie | 45 | 78 | [50, 67] |
1 | Millie | 50 | 50 | [50, 67] |
However, we cannot explode the chem
column along with the others since its first value is an empty list, while others in the same row hold two elements each.
Let's try it, though:
Note that the names must be enclosed in brackets when exploding multiple columns. Otherwise, Pandas throws no error yet only changes the first column whose name was given:
Summary
explode()
unnests the multi-value cells in a given Series or DataFrame column, transforming wide-format data into long-format.