Skip to content

Advanced

Skip to the problems!

Strings

pandas has tons of features for manipulating Series of strings..

Setup
cajun = pd.Series([
    'gumbo', 
    'crawfish boil', 
    'Mardi Gras', 
    'pirogue', 
    pd.NA, 
    'Zatarains'
], dtype='string')  # (1)!

print(cajun)
# 0            gumbo
# 1    crawfish boil
# 2       Mardi Gras
# 3          pirogue
# 4             <NA>
# 5        Zatarains
# dtype: string
  1. dtype='string' tells pandas to use the StringDType extension which allows a Series of strings to include NaN values.

String Methods

cajun.str returns a StringMethods object from which we can call numerous string processing methods.

String Indexing

You can use string indexing to select characters by position.

For example, cajun.str[0] selects the first character in each string.

cajun.str[0]
# 0       g
# 1       c
# 2       M
# 3       p
# 4    <NA>
# 5       Z
# dtype: string

cajun.str[:2] uses slicing to select all characters up to (but excluding) the third character in each string.

cajun.str[:2]
# 0      gu
# 1      cr
# 2      Ma
# 3      pi
# 4    <NA>
# 5      Za
# dtype: string

cajun.str[-1] uses negative indexing to select the last character in each string.

cajun.str[-1]
# 0       o
# 1       l
# 2       s
# 3       e
# 4    <NA>
# 5       s
# dtype: string

upper()

converts a Series of strings to upper case.

cajun.str.upper()
# 0            GUMBO
# 1    CRAWFISH BOIL
# 2       MARDI GRAS
# 3          PIROGUE
# 4             <NA>
# 5        ZATARAINS
# dtype: string

lower()

converts a Series of strings to lower case.

cajun.str.lower()
# 0            gumbo
# 1    crawfish boil
# 2       mardi gras
# 3          pirogue
# 4             <NA>
# 5        zatarains
# dtype: string

len()

returns the number of characters in each string.

cajun.str.len()
# 0       5
# 1      13
# 2      10
# 3       7
# 4    <NA>
# 5       9
# dtype: Int64

split()

splits each string along some specified delimiter and put the resulting substrings in a list.

cajun.str.split(' ')
# 0             [gumbo]
# 1    [crawfish, boil]
# 2       [Mardi, Gras]
# 3           [pirogue]
# 4                <NA>
# 5         [Zatarains]
# dtype: object

chaining that with .str.get() lets us pick out the ith substring in each list. For example,

# get the first substring after splitting by ' '
cajun.str.split(' ').str.get(0)
# 0        gumbo
# 1     crawfish
# 2        Mardi
# 3      pirogue
# 4         <NA>
# 5    Zatarains
# dtype: object

replace()

replaces part of a string with another string. Here we replace all spaces with dashes.

cajun.str.replace(pat=' ', repl='-', regex=False)
# 0            gumbo
# 1    crawfish-boil
# 2       Mardi-Gras
# 3          pirogue
# 4             <NA>
# 5        Zatarains
# dtype: string

By default, replace() assumes you're passing in a regular expression, but you can turn that off with regex=False.

What's a regular expression?

A regular expression is a universally used syntax that lets you do advanced string matching.

cat()

concatenates a Series of strings together using a specified separator

cajun.str.cat(sep='_')
# 'gumbo_crawfish boil_Mardi Gras_pirogue_Zatarains'

or concatenates a Series of strings with another same-sized Series or list of strings.

cajun.str.cat(['1', '2', '3', '4', '5', '6'], sep=' ')
# 0            gumbo 1
# 1    crawfish boil 2
# 2       Mardi Gras 3
# 3          pirogue 4
# 4               <NA>
# 5        Zatarains 6
# dtype: string

startswith()

checks if each string starts with a specified string.

cajun.str.startswith("p")
# 0    False
# 1    False
# 2    False
# 3     True
# 4     <NA>
# 5    False
# dtype: boolean

endswith()

checks if each string ends with a specified string.

cajun.str.endswith("s")
# 0    False
# 1    False
# 2     True
# 3    False
# 4     <NA>
# 5     True
# dtype: boolean

contains()

checks whether each string contains a specified string or regular expression.

cajun.str.contains('bo', regex=False)
# 0     True
# 1     True
# 2    False
# 3    False
# 4     <NA>
# 5    False
# dtype: boolean

extract()

extracts the first matching substring using a regular expression with at least one capture group.

For example, here we extract the first word that start with a capital letter.

cajun.str.extract(r'(\b[A-Z][a-z]+\b)')
#            0
# 0       <NA>
# 1       <NA>
# 2      Mardi
# 3       <NA>
# 4       <NA>
# 5  Zatarains

extractall()

extracts all matching substrings using a regular expression with at least one capture group.

cajun.str.extractall(r'(\b[A-Z][a-z]+\b)')
#                  0
#   match           
# 2 0          Mardi
#   1           Gras
# 5 0      Zatarains

String Addition

You can insert a prefix or suffix to each element in a Series of strings by adding a string prefix or string suffix directly to the Series with the + operator.

'i like ' + cajun + ' a lot'
# 0            i like gumbo a lot
# 1    i like crawfish boil a lot
# 2       i like Mardi Gras a lot
# 3          i like pirogue a lot
# 4                          <NA>
# 5        i like Zatarains a lot
# dtype: string

Dates and Times

The workhorse datetime type in pandas is Timestamp which is really just a wrapper of NumPy's datetime64 type. Similar to NumPy, pandas doesn't have a proper date type, so if you want to represent a date like "January 3rd, 2001", you'd typically use the TimeStamp 2022-01-03 00:00:00.

to_datetime()

You can use the to_datetime() function to make a Timestamp in pandas. Usually, you'll pass in a string like this.

pd.to_datetime('2020-03-01')
# Timestamp('2020-03-01 00:00:00')

Timestamp Limits

Note that unlike NumPy, pandas Timestamps are always stored with nanosecond units, so you can't create a Timestamp before 1678 or after 2262. If you try it, you'll get an error

pd.to_datetime('2263-01-01')
# Out of bounds nanosecond timestamp: 2263-01-01 00:00:00

Timezone Aware Timestamps

Unlike NumPy, pandas Timestamps can be timezone aware. To make a timezone aware Timestamp, use the tz_localize() method.

mar8 = pd.to_datetime('March 8, 2001 3PM').tz_localize('America/Los_Angeles')
mar8
# Timestamp('2001-03-08 15:00:00-0800', tz='America/Los_Angeles')

The timezone information is stored in the .tz attribute of the Timestamp.

mar8.tz
# <DstTzInfo 'America/Los_Angeles' PST-1 day, 16:00:00 STD>

To change the timezone, use the tz_convert() method.

mar8.tz_convert('Europe/London')
# Timestamp('2001-03-08 23:00:00+0000', tz='Europe/London')

strptime format

The to_datetime() smartly interprets various datetime formats. For example,

pd.to_datetime('March 8, 2001 3PM')  # Timestamp('2001-03-08 15:00:00')
pd.to_datetime('2001-03-08 3PM')     # Timestamp('2001-03-08 15:00:00')
pd.to_datetime('2001-03-08 15')      # Timestamp('2001-03-08 15:00:00')

Some formats cannot be automatically interpreted. Furthermore, it's better to explicitly state datetime formats. To do this, pass in a strptime format using the format parameter.

pd.to_datetime('March 8, 2001 3PM', format='%B %d, %Y %I%p')
# Timestamp('2001-03-08 15:00:00')

Series of datetimes

If we pass a list of datetime strings into the to_datetime() function, we get back a DatetimeIndex that we could use as the index of a Series or DataFrame.

pd.to_datetime(['2020-01-01', pd.NA, '2020-01-03'])
# DatetimeIndex(['2020-01-01', 'NaT', '2020-01-03'], dtype='datetime64[ns]', freq=None)

If we wrap that with pd.Series(), we get back a Series of datetime64s.

pd.Series(pd.to_datetime(['2020-01-01', pd.NA, '2020-01-03']))
# 0   2020-01-01
# 1          NaT
# 2   2020-01-03
# dtype: datetime64[ns]

Add time to a Series of datetimes

If you want to add or subtract time from a Series of datetimes, you have to do so using an instance of Timedelta, specifying a value and a unit.

x = pd.Series(pd.to_datetime(['2020-01-01', pd.NA, '2020-01-03']))
x + pd.Timedelta(value=12, unit='H')
# 0   2020-01-01 12:00:00
# 1                   NaT
# 2   2020-01-03 12:00:00
# dtype: datetime64[ns]

Tip

If you subtract two timestamps you get back a TimeDelta

pd.to_datetime('2020-03-02') - pd.to_datetime('2020-03-01')

Sequence of datetimes

You can use the date_range() function to build a sequence of datetimes based on a regular interval. For example, you can make a DatetimeIndex with 10 values starting on March 8th 2020, iterating by 12 hours like so.

pd.date_range('2020-03-08', periods=10, freq='12H')
# DatetimeIndex(['2020-03-08 00:00:00', '2020-03-08 12:00:00',
#                '2020-03-09 00:00:00', '2020-03-09 12:00:00',
#                '2020-03-10 00:00:00', '2020-03-10 12:00:00',
#                '2020-03-11 00:00:00', '2020-03-11 12:00:00',
#                '2020-03-12 00:00:00', '2020-03-12 12:00:00'],
#               dtype='datetime64[ns]', freq='12H')

If we did the same thing with tz='America/Denver', notice the output changes slightly because Daylight Saving in Denver started at 2AM on March 8th in 2020, meaning the clock in Denver went from 1:59AM to 3:00AM, basically skipping the whole 2 O'clock hour.

pd.date_range('2020-03-08', periods=10, freq='12H', tz='America/Denver')
# DatetimeIndex(['2020-03-08 00:00:00-07:00', '2020-03-08 13:00:00-06:00',
#                '2020-03-09 01:00:00-06:00', '2020-03-09 13:00:00-06:00',
#                '2020-03-10 01:00:00-06:00', '2020-03-10 13:00:00-06:00',
#                '2020-03-11 01:00:00-06:00', '2020-03-11 13:00:00-06:00',
#                '2020-03-12 01:00:00-06:00', '2020-03-12 13:00:00-06:00'],
#               dtype='datetime64[ns, America/Denver]', freq='12H')

DatetimeIndex

Here we make a series of stock prices with a DatetimeIndex.

foostock = pd.Series(
    data = np.round(np.random.normal(size=10), 2),
    index = pd.date_range('2020-03-08', periods=10, freq='12H', tz='America/Denver')
)

print(foostock)
# 2020-03-08 00:00:00-07:00   -1.30
# 2020-03-08 13:00:00-06:00   -0.74
# 2020-03-09 01:00:00-06:00   -0.51
# 2020-03-09 13:00:00-06:00   -0.39
# 2020-03-10 01:00:00-06:00   -1.01
# 2020-03-10 13:00:00-06:00    0.46
# 2020-03-11 01:00:00-06:00    1.13
# 2020-03-11 13:00:00-06:00   -1.23
# 2020-03-12 01:00:00-06:00    0.15
# 2020-03-12 13:00:00-06:00    0.38
# Freq: 12H, dtype: float64

We can index the series in powerful ways. For example,

Get stock prices on March 9th, 2020

foostock.loc['2020-03-09']
# 2020-03-09 01:00:00-06:00   -0.51
# 2020-03-09 13:00:00-06:00   -0.39
# Freq: 12H, dtype: float64

Get all stock prices in March

foostock.loc['2020-03']
# 2020-03-08 00:00:00-07:00   -1.30
# 2020-03-08 13:00:00-06:00   -0.74
# 2020-03-09 01:00:00-06:00   -0.51
# 2020-03-09 13:00:00-06:00   -0.39
# 2020-03-10 01:00:00-06:00   -1.01
# 2020-03-10 13:00:00-06:00    0.46
# 2020-03-11 01:00:00-06:00    1.13
# 2020-03-11 13:00:00-06:00   -1.23
# 2020-03-12 01:00:00-06:00    0.15
# 2020-03-12 13:00:00-06:00    0.38
# Freq: 12H, dtype: float64

Get stock prices at exactly March 9th at 1AM Denver time

foostock.loc[pd.to_datetime('2020-03-09 1:00:00').tz_localize('America/Denver')]
# -0.51

Get stock prices between March 9th and March 10th

foostock.loc['2020-03-09':'2020-03-10'] # (1)!
# 2020-03-09 01:00:00-06:00   -0.51
# 2020-03-09 13:00:00-06:00   -0.39
# 2020-03-10 01:00:00-06:00   -1.01
# 2020-03-10 13:00:00-06:00    0.46
# Freq: 12H, dtype: float64

  1. Note that both endpoints are included.

Categoricals

One of pandas' best features is its ability to represent categorical data using the Categorical type, much like the factor data type in R. A Categorical takes on a limited, and usually fixed number of possible values ( i.e. categories).

Setup
Suppose we have a dataset representing four cars. For each car, we have its VIN number, color, and size classification:

VIN color size
AX193Q43 red standard
Z11RTV201 blue mini
WA4Q3371 red standard
QWP77491 green extended

How should we store these data?

Unique Ids

Categoricals are not appropriate for storing unique identifiers. In the case of our cars dataset, Vin numbers using a plain ole Series of strings.

VINs = pd.Series(['AX193Q43', 'Z11RTV201', 'WA4Q3371', 'QWP77491'], dtype='string')

print(VINs)

Unordered Categories

colors is a classic candidate for categorical data since we have a limited set of colors and a single color can be a feature of multiple cars.

To build a Categorical, use the pd.Categorical() function, very similar to the pd.Series() function.

colors = pd.Categorical(values = ['red', 'blue', 'red', 'green'])

print(colors)  # (1)!
# ['red', 'blue', 'red', 'green']
# Categories (3, object): ['blue', 'green', 'red']
  1. Printing a categorical displays its values and its unique categories.

Note that this is not a Series, but you could make it a Series just by wrapping it in pd.Series(colors).

pd.Series(colors)
# 0      red
# 1     blue
# 2      red
# 3    green
# dtype: category
# Categories (3, object): ['blue', 'green', 'red']

By default, when you build a categorical, pandas sets the categories as the unique, non NaN values in the data. To explicitly define the categories, use the categories argument. This is especially useful if your collection includes categories not observed in the data.

colors = pd.Categorical(
    values = ['red', 'blue', 'red', 'green'],
    categories = ['black', 'blue', 'green', 'orange', 'red', 'yellow']
)

print(colors)
# ['red', 'blue', 'red', 'green']
# Categories (6, object): ['black', 'blue', 'green', 'orange', 'red', 'yellow']

The categories parameter also lets you organize the order in which categories should be displayed, which could be handy for things like plots or reports. Without specifying the categories parameter, pandas displays them in lexical ( alphabetical) order. Alternatively, you can define the categories in a specific order. For example, here we specify colors in the order "bright to dark".

colors = pd.Categorical(
    values = ['red', 'blue', 'red', 'green'],
    categories = ['yellow', 'orange', 'red', 'green', 'blue', 'black']
)

print(colors)
# ['red', 'blue', 'red', 'green']
# Categories (6, object): ['yellow', 'orange', 'red', 'green', 'blue', 'black']

Unordered Categories

sizes is similar to colors, but unlike colors, sizes have an inherent order. 'mini' is less than 'standard' is less than 'extended'. To define an ordered Categorical,

  • pass the ordered categories into the categories argument and
  • set ordered=True
sizes = pd.Categorical(
    values = ['standard', 'mini', 'standard', 'extended'],
    categories = ['mini', 'standard', 'extended'],
    ordered = True
)

print(sizes) # (1)!
# ['standard', 'mini', 'standard', 'extended']
# Categories (3, object): ['mini' < 'standard' < 'extended']
  1. Categories are reported with < "less than" symbols indicating that they have a meaningful order.

With an ordered Categorical, you can do things like compare sizes < 'extended' and get back a boolean array.

sizes < 'extended'
# array([ True,  True,  True, False])

Categoricals don't have .loc or .iloc accessors. To subset sizes as those less than 'extended', you must use square bracket notation, like indexing a numpy array.

sizes[sizes < 'extended']
# ['standard', 'mini', 'standard']
# Categories (3, object): ['mini' < 'standard' < 'extended']

Although, you could create a Series with a CategoricalIndex.

sizesSeries = pd.Series(
    data = [0,1,2,3],
    index = pd.CategoricalIndex(sizes)
)

# print(sizesSeries)
# standard    0
# mini        1
# standard    2
# extended    3
# dtype: int64

In which case you could do

sizesSeries.loc['mini']
# 1

get_dummies()

Another really cool benefit of using Categoricals is that you can one-hot-encode them using pandas' get_dummies() function. For example,

pd.get_dummies(sizes, prefix = 'size') # (1)!
#    size_mini  size_standard  size_extended
# 0          0              1              0
# 1          1              0              0
# 2          0              1              0
# 3          0              0              1
  1. This is a common input format for machine learning models.

MultiIndex

A MultiIndex, or hierarchical index, is an index that uses multiple values in its keys.

There are a few different ways to make a MultiIndex from scratch. Perhaps the most intuitive way is to use the pd.MultiIndex.from_tuples() constructor, passing in a list of tuples. We'll do that here to generate a MultiIndex of (store, product) pairs.

store_products = pd.DataFrame(
    data = {'Price': [35.25, 45.00, 23.50, 1.95, 29.99, 35.65]},
    index = pd.MultiIndex.from_tuples([
        ('super store', 'basketball'), ('super store', 'football'), ('super store', 'soccerball'),
        ('sports dorks', 'golfball'), ('sports dorks', 'basketball'), ('sports dorks', 'football')
    ], names=['store', 'product'])
)

print(store_products)
#                          Price
# store        product          
# super store  basketball  35.25
#              football    45.00
#              soccerball  23.50
# sports dorks golfball     1.95
#              basketball  29.99
#              football    35.65

Levels

It's important to understand that our store_products MultiIndex has two levels.

  1. Level 0 is the "store" level
  2. Level 1 is the "product" level

We could have more but in this example we have two.

MultiIndex to RangeIndex and back

You can use the reset_index() to convert the index levels as columns and go back to using a plain ole range index.

store_products.reset_index(inplace=True)

print(store_products)
#           store     product  Price
# 0   super store  basketball  35.25
# 1   super store    football  45.00
# 2   super store  soccerball  23.50
# 3  sports dorks    golfball   1.95
# 4  sports dorks  basketball  29.99
# 5  sports dorks    football  35.65

To revert to a DataFrame with a MultiIndex, use the set_index() method passing in a list of column names to use as the MultiIndex levels.

store_products.set_index(['store', 'product'], inplace=True)

print(store_products)
#                          Price
# store        product          
# super store  basketball  35.25
#              football    45.00
#              soccerball  23.50
# sports dorks golfball     1.95
#              basketball  29.99
#              football    35.65

Indexing a MultiIndex

To select rows the with the key ('sports dorks', 'golfball') or ('super store', 'football'), you can use DataFrame.loc, passing in a list of tuples.

store_products.loc[[('sports dorks', 'golfball'), ('super store', 'football')]]
#                        Price
# store        product        
# sports dorks golfball   1.95
# super store  football  45.00

To select every row for 'sports dorks', you can use DataFrame.loc passing in the key 'sports dorks'.

store_products.loc[['sports dorks']]
#                          Price
# store        product          
# sports dorks golfball     1.95
#              basketball  29.99
#              football    35.65

By contrast, store_products.loc[['football']] does not return all rows where product equals 'football'.

store_products.loc[['football']]
# KeyError: "['football'] not in index"

That's because DataFrame.loc only scans the first MultiIndex level.

To fetch all rows where product equals 'football', use the DataFrame.xs() method.

store_products.xs(key='football', level='product')
#               Price
# store              
# super store   45.00
# sports dorks  35.65

To retain both the store and product indexes in the result, set the drop_level argument to False.

store_products.xs(key='football', level='product', drop_level=False)
#                        Price
# store        product        
# super store  football  45.00
# sports dorks football  35.65

MultiIndex after groupby aggregate

Now let's take a look at a groupby aggregate operation since this is a more common way MultiIndexes arise in practice. We'll start by building a DataFrame with 7 rows and 4 columns.

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'B': [False, True, False, True, True, True, True],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8],
    'D': [50, 30, 30, 90, 10, 20, 10]
})

print(df)
#      A      B    C   D
# 0  foo  False  2.1  50
# 1  bar   True  1.9  30
# 2  foo  False  3.6  30
# 3  bar   True  4.0  90
# 4  bar   True  1.9  10
# 5  foo   True  7.8  20
# 6  foo   True  2.8  10

Then we'll set stew equal to the following groupby aggregate operation.

stew = df.groupby(by=['A', 'B']).agg({'C': ['sum'], 'D': ['sum', 'mean']})

print(stew)
#               C    D           
#             sum  sum       mean
# A   B                          
# bar True    7.8  130  43.333333
# foo False   5.7   80  40.000000
#     True   10.6   30  15.000000

In this case, the row index is a MultiIndex

stew.index
# MultiIndex([('bar',  True),
#             ('foo', False),
#             ('foo',  True)],
#            names=['A', 'B'])

and so is the column index.

stew.columns
# MultiIndex([('C',  'sum'),
#             ('D',  'sum'),
#             ('D', 'mean')],
#            )

We can use DataFrame.xs() to select all columns representing a sum.

stew.xs(
    key='sum',        # (1)!
    axis=1,           # (2)!
    level=1,          # (3)!
    drop_level=False  # (4)!
)
#               C    D
#             sum  sum
# A   B               
# bar True    7.8  130
# foo False   5.7   80
#     True   10.6   30
  1. Select columns with the key 'sum'
  2. Select columns (not rows axis=0)
  3. Search for the key within level 1 (the second level of the MultiIndex)
  4. Retain this level in the result

We could select D columns by specifying key='D' and level=0.

stew.xs(key='D', axis=1, level=0, drop_level=False)
#              D           
#            sum       mean
# A   B                    
# bar True   130  43.333333
# foo False   80  40.000000
#     True    30  15.000000

There are lots of ways to flatten the columns of a MultiIndex into a regular index. Perhaps the easiest is to use the to_flat_index() method. For example,

stew.columns.to_flat_index()

This returns a generic Index object which behaves similarly to a list of tuples. With basic list comprehension, we can convert this to a list of strings

['_'.join(s) for s in stew.columns.to_flat_index()]
# ['C_sum', 'D_sum', 'D_mean']

... which we can set equal to stew.columns. This flattens stew's column MultiIndex.

stew.columns = ['_'.join(s) for s in stew.columns.to_flat_index()]

print(stew)
#            C_sum  D_sum     D_mean
# A   B                             
# bar True     7.8    130  43.333333
# foo False    5.7     80  40.000000
#     True    10.6     30  15.000000

DataFrame Reshaping

Suppose you have DataFrame like this

|    | row   | col   |   val |
|---:|:------|:------|------:|
|  0 | row0  | col1  |    44 |
|  1 | row1  | col1  |    47 |
|  2 | row2  | col1  |    64 |
|  3 | row0  | col0  |    67 |
|  4 | row1  | col0  |    67 |
|  5 | row2  | col0  |     9 |

There's a good chance you'll want to reshape like this

| row   |   col0 |   col1 |
|:------|-------:|-------:|
| row0  |     67 |     44 |
| row1  |     67 |     47 |
| row2  |      9 |     64 |

Here, we converted the data from long format to wide format. Some people would say we converted the data from stacked format to unstacked format. For better or worse, there are a ton of ways to do this with pandas. Perhaps the simplest method is to use the pivot() method of a DataFrame...

pivot()

Given the following dataframe in long format,

df = pd.DataFrame({
    'row': ['row0', 'row1', 'row2', 'row0', 'row1', 'row2'],
    'col': ['col1', 'col1', 'col1', 'col0', 'col0', 'col0'],
    'val': [44, 47, 64, 67, 67,  9]
})

print(df)
#     row   col  val
# 0  row0  col1   44
# 1  row1  col1   47
# 2  row2  col1   64
# 3  row0  col0   67
# 4  row1  col0   67
# 5  row2  col0    9

We can reshape it to wide format using df.pivot(), telling pandas which columns we want to use for the row index, column index, and values.

df.pivot(index='row', columns='col', values='val')
# col   col0  col1
# row             
# row0    67    44
# row1    67    47
# row2     9    64

Now suppose df had slightly different data..

df = pd.DataFrame({
    'row': ['row0', 'row0', 'row2', 'row2', 'row1', 'row1'],
    'col': ['col1', 'col1', 'col1', 'col0', 'col0', 'col0'],
    'val': [44, 47, 64, 67, 67, 9]
})
#     row   col  val
# 0  row0  col1   44
# 1  row0  col1   47
# 2  row2  col1   64
# 3  row2  col0   67
# 4  row1  col0   67
# 5  row1  col0    9

This time, df.pivot() generates an error.

df.pivot(index='row', columns='col', values='val')
# ValueError: Index contains duplicate entries, cannot reshape

The error occurs because there are two instances where multiple values in the input map to the same exact position in the output, and we haven't told pandas how to deal with that. This is where the more general pivot_table() method comes in..

pivot_table()

pivot_table() works just like pivot(), but you can specify an aggregation function to tell Pandas how to deal with multiple values mapping to the same place.

For example, here we convert df from tall to wide format using pivot_table() with aggfunc=list.

print(df) # (1)!
#     row   col  val
# 0  row0  col1   44
# 1  row0  col1   47
# 2  row2  col1   64
# 3  row2  col0   67
# 4  row1  col0   67
# 5  row1  col0    9

pivotted = df.pivot_table(
    index='row',
    columns='col',
    values='val',
    aggfunc=list
)

print(pivotted)
# col      col0      col1
# row                    
# row0      NaN  [44, 47]
# row1  [67, 9]       NaN
# row2     [67]      [64]
  1. df = pd.DataFrame({
        'row': ['row0', 'row0', 'row2', 'row2', 'row1', 'row1'],
        'col': ['col1', 'col1', 'col1', 'col0', 'col0', 'col0'],
        'val': [44, 47, 64, 67, 67, 9]
    })
    

Notice that 67 and 9 mapped to the same cell in the output because they had the same row and col values in the input. aggfunc tells Pandas how to deal with those values, in this case wrapping them into a list. Also, note that the combinations (row0, col0) and (row1, col1) didn't exist in df, so Pandas fills those entries with NaN by default.

You're probably not interested in pivoting data just to build lists of values. More interestingly, you can use aggregation functions like mean(), sum(), size(), count(), or a combination of all of them.

This works just like doing normal DataFrame aggregation. So, for example, to get the count and sum of elements mapping to each cell in the output table, use aggfunc=['count', 'sum'].

df.pivot_table(
    index='row',
    columns='col',
    values='val',
    aggfunc=['count', 'sum']
)

#      count        sum      
# col   col0 col1  col0  col1
# row                        
# row0   NaN  2.0   NaN  91.0
# row1   2.0  NaN  76.0   NaN
# row2   1.0  1.0  67.0  64.0

As you could probably guess, Pandas supports this kind of pivot operation using multiple columns for the row index, column index, and values. So if you have input like this

toomuch = pd.DataFrame({
    'row_major': ['A', 'A', 'B', 'A', 'B', 'B'],
    'row_minor': ['x', 'x', 'y', 'y', 'z', 'z'],
    'col_major': ['MAMMAL', 'MAMMAL', 'MAMMAL', 'FISH', 'FISH', 'FISH'],
    'col_minor': ['dog', 'cat', 'dog', 'tuna', 'tuna', 'shark'],
    'val0': [44, 47, 64, 67, 67,  9],
    'val1': [91, 52, 86, 83, 79, 92]
})

print(toomuch)
#   row_major row_minor col_major col_minor  val0  val1
# 0         A         x    MAMMAL       dog    44    91
# 1         A         x    MAMMAL       cat    47    52
# 2         B         y    MAMMAL       dog    64    86
# 3         A         y      FISH      tuna    67    83
# 4         B         z      FISH      tuna    67    79
# 5         B         z      FISH     shark     9    92

you can do crazy stuff like this

toomuch.pivot_table(
    index=['row_major', 'row_minor'],
    columns=['col_major', 'col_minor'],
    values=['val0', 'val1'],
    aggfunc=['count', 'sum']
)

#                     count                                           sum                                            
#                      val0                   val1                   val0                     val1                   
# col_major            FISH      MAMMAL       FISH      MAMMAL       FISH       MAMMAL        FISH       MAMMAL      
# col_minor           shark tuna    cat  dog shark tuna    cat  dog shark  tuna    cat   dog shark  tuna    cat   dog
# row_major row_minor                                                                                                
# A         x           NaN  NaN    1.0  1.0   NaN  NaN    1.0  1.0   NaN   NaN   47.0  44.0   NaN   NaN   52.0  91.0
#           y           NaN  1.0    NaN  NaN   NaN  1.0    NaN  NaN   NaN  67.0    NaN   NaN   NaN  83.0    NaN   NaN
# B         y           NaN  NaN    NaN  1.0   NaN  NaN    NaN  1.0   NaN   NaN    NaN  64.0   NaN   NaN    NaN  86.0
#           z           1.0  1.0    NaN  NaN   1.0  1.0    NaN  NaN   9.0  67.0    NaN   NaN  92.0  79.0    NaN   NaN

melt()

Suppose we have a wide DataFrame like this..

wide = pd.DataFrame({
    'test': [1, 2, 3, 4],
    'john': [95, 81, 47, 99],
    'patty': [90, 85, 93, 97]
})
|    |   test |   john |   patty |
|---:|-------:|-------:|--------:|
|  0 |      1 |     95 |      90 |
|  1 |      2 |     81 |      85 |
|  2 |      3 |     47 |      93 |
|  3 |      4 |     99 |      97 |

and we wanted to reshape it into a long DataFrame like this

|    |   test | student   |   score |
|---:|-------:|:----------|--------:|
|  0 |      1 | john      |      95 |
|  1 |      2 | john      |      81 |
|  2 |      3 | john      |      47 |
|  3 |      4 | john      |      99 |
|  4 |      1 | patty     |      90 |
|  5 |      2 | patty     |      85 |
|  6 |      3 | patty     |      93 |
|  7 |      4 | patty     |      97 |

The tool for the job is melt().

wide.melt(
    id_vars='test',
    value_vars=['john', 'patty'],
    var_name='student',
    value_name='score'
)

#    test student  score
# 0     1    john     95
# 1     2    john     81
# 2     3    john     47
# 3     4    john     99
# 4     1   patty     90
# 5     2   patty     85
# 6     3   patty     93
# 7     4   patty     97

The most important parameter here is value_vars - a list of the columns you want to stack In fact, that's literally all you need to provide for this function to work (although the output isn't very friendly without other information).

wide.melt(value_vars=['john', 'patty'])
#   variable  value
# 0     john     95
# 1     john     81
# 2     john     47
# 3     john     99
# 4    patty     90
# 5    patty     85
# 6    patty     93
# 7    patty     97

var_name and value_name tell pandas how to name the new variable and value columns, and id_vars tells Pandas what other columns from the original DataFrame you want to keep in the result.

stack()

If you have a DataFrame of daily stock prices like this

df = pd.DataFrame({
    'ABC': [105.30, 107.17, 101.52],
    'DEF': [40.29, 40.97, 51.00],
    'GHI': [70.05, 64.13, 64.88]
    },
    index = pd.DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'])
)

print(df)
#                ABC    DEF    GHI
# 2020-01-01  105.30  40.29  70.05
# 2020-01-02  107.17  40.97  64.13
# 2020-01-03  101.52  51.00  64.88

You can use df.stack() to reshape the DataFrame into a Series, literally stacking the values of ABC, DEF, and GHI within each original row index, resulting in a 2-level row MultiIndex.

df.stack()
# 2020-01-01  ABC    105.30
#             DEF     40.29
#             GHI     70.05
# 2020-01-02  ABC    107.17
#             DEF     40.97
#             GHI     64.13
# 2020-01-03  ABC    101.52
#             DEF     51.00
#             GHI     64.88
# dtype: float64

Things get a little trickier when df has a column MultiIndex. Consider this DataFrame.

pets = pd.DataFrame(
    data=[[497, 1056, 2047, 3595], [352, 922, 3800, 1048], [217, 2017, 3571, 4521]],
    index=pd.Index(['AL', 'MS', 'LA'], name='state'),
    columns=pd.MultiIndex.from_tuples([
        ('cat', 'persian'), ('cat', 'calico'), ('dog', 'corgie'), ('dog', 'lab')
    ], names=['type', 'subtype'])
)

print(pets)
# type        cat           dog      
# subtype persian calico corgie   lab
# state                              
# AL          497   1056   2047  3595
# MS          352    922   3800  1048
# LA          217   2017   3571  4521

stack()'s most important parameter is level. It tells Pandas which level(s) from the column index you want to stack into the row index.

Let's see some examples, keeping in mind that our pets DataFrame has two levels in its column index:

  • type with values dog and cat
  • subtype with values persian, calico, corgie, and lab.

If we call pets.stack(level='type'), you can see how the first level, type, gets extracted from the column index and inserted into the row index, and the data re-aligns itself accordingly

pets.stack(level='type')
# subtype     calico  corgie     lab  persian
# state type                                 
# AL    cat   1056.0     NaN     NaN    497.0
#       dog      NaN  2047.0  3595.0      NaN
# MS    cat    922.0     NaN     NaN    352.0
#       dog      NaN  3800.0  1048.0      NaN
# LA    cat   2017.0     NaN     NaN    217.0
#       dog      NaN  3571.0  4521.0      NaN

Pretty much the same thing happens when you call stack(level='subtype').

pets.stack(level='subtype')
# type              cat     dog
# state subtype                
# AL    calico   1056.0     NaN
#       corgie      NaN  2047.0
#       lab         NaN  3595.0
#       persian   497.0     NaN
# MS    calico    922.0     NaN
#       corgie      NaN  3800.0
#       lab         NaN  1048.0
#       persian   352.0     NaN
# LA    calico   2017.0     NaN
#       corgie      NaN  3571.0
#       lab         NaN  4521.0
#       persian   217.0     NaN
level=-1

By default, stack() uses level=-1 which automatically stacks the last column level into the row index.

levels can also be a list like ['type', 'subtype'] in which case both column levels are stacked into the row index and the result is a Series, not a DataFrame.

pets.stack(level=['type', 'subtype'])
# state  type  subtype
# AL     cat   calico     1056.0
#              persian     497.0
#        dog   corgie     2047.0
#              lab        3595.0
# MS     cat   calico      922.0
#              persian     352.0
#        dog   corgie     3800.0
#              lab        1048.0
# LA     cat   calico     2017.0
#              persian     217.0
#        dog   corgie     3571.0
#              lab        4521.0
# dtype: float64

unstack()

Just as Pandas give provides stack() for stacking column levels into row levels, it provides unstack() for unstacking row levels into column levels.

So, where we have pets.stack(level='type'), we can do almost the inverse operation with .unstack(level='type'),

pets.stack(level='type').unstack(level='type')
# subtype  calico     corgie         lab         persian    
# type        cat dog    cat     dog cat     dog     cat dog
# state                                                     
# AL       1056.0 NaN    NaN  2047.0 NaN  3595.0   497.0 NaN
# LA       2017.0 NaN    NaN  3571.0 NaN  4521.0   217.0 NaN
# MS        922.0 NaN    NaN  3800.0 NaN  1048.0   352.0 NaN

although in this case 'type' becomes the second level of the column index, and all the NaNs created by stack() get retained as columns with unstack().