Skip to content

Data Preparation

bar_chart_race exposes two functions, prepare_wide_data and prepare_long_data to transform pandas DataFrames to the correct form.

Wide data

To show how the prepare_wide_data function works, we'll read in the last three rows from the covid19_tutorial dataset.

df = bcr.load_dataset('covid19_tutorial').tail(3)
df
Belgium China France Germany Iran Italy Netherlands Spain USA United Kingdom
date
2020-04-10 3019 3340 13215 2767 4232 18849 2520 16081 18595 8974
2020-04-11 3346 3343 13851 2894 4357 19468 2653 16606 20471 9892
2020-04-12 3600 3343 14412 3022 4474 19899 2747 17209 22032 10629

This format of data is sometimes known as 'wide' data since each column contains data that all represents the same thing (deaths). Each new country would add an additional column to the DataFrame, making it wider. This is the type of data that the bar_chart_race function requires.

The prepare_wide_data function is what bar_chart_race calls internally, so it isn't necessary to use directly. However, it is available so that you can view and understand how the data gets prepared. To transition the bars smoothly from one time period to the next, both the length of the bars and position are changed linearly. Two DataFrames of the same shape are returned - one for the values and the other for the ranks.

df_values, df_ranks = bcr.prepare_wide_data(df, steps_per_period=4, 
                                            orientation='h', sort='desc')

Below, we have the df_values DataFrame containing the length of each bar for each frame. A total of four rows now exist for each period.

Belgium China France Germany Iran Italy Netherlands Spain USA United Kingdom
date
2020-04-10 3019.00 3340.00 13215.00 2767.00 4232.00 18849.00 2520.00 16081.00 18595.00 8974.00
2020-04-10 3100.75 3340.75 13374.00 2798.75 4263.25 19003.75 2553.25 16212.25 19064.00 9203.50
2020-04-10 3182.50 3341.50 13533.00 2830.50 4294.50 19158.50 2586.50 16343.50 19533.00 9433.00
2020-04-10 3264.25 3342.25 13692.00 2862.25 4325.75 19313.25 2619.75 16474.75 20002.00 9662.50
2020-04-11 3346.00 3343.00 13851.00 2894.00 4357.00 19468.00 2653.00 16606.00 20471.00 9892.00
2020-04-11 3409.50 3343.00 13991.25 2926.00 4386.25 19575.75 2676.50 16756.75 20861.25 10076.25
2020-04-11 3473.00 3343.00 14131.50 2958.00 4415.50 19683.50 2700.00 16907.50 21251.50 10260.50
2020-04-11 3536.50 3343.00 14271.75 2990.00 4444.75 19791.25 2723.50 17058.25 21641.75 10444.75
2020-04-12 3600.00 3343.00 14412.00 3022.00 4474.00 19899.00 2747.00 17209.00 22032.00 10629.00

The df_ranks DataFrame contains the numerical ranking of each country and is used for the position of the bar along the y-axis (or x-axis when veritcal). Notice that there are two sets of bars that switch places.

Belgium China France Germany Iran Italy Netherlands Spain USA United Kingdom
date
2020-04-10 3.00 4.00 7.0 2.0 5.0 10.00 1.0 8.0 9.00 6.0
2020-04-10 3.25 3.75 7.0 2.0 5.0 9.75 1.0 8.0 9.25 6.0
2020-04-10 3.50 3.50 7.0 2.0 5.0 9.50 1.0 8.0 9.50 6.0
2020-04-10 3.75 3.25 7.0 2.0 5.0 9.25 1.0 8.0 9.75 6.0
2020-04-11 4.00 3.00 7.0 2.0 5.0 9.00 1.0 8.0 10.00 6.0
2020-04-11 4.00 3.00 7.0 2.0 5.0 9.00 1.0 8.0 10.00 6.0
2020-04-11 4.00 3.00 7.0 2.0 5.0 9.00 1.0 8.0 10.00 6.0
2020-04-11 4.00 3.00 7.0 2.0 5.0 9.00 1.0 8.0 10.00 6.0
2020-04-12 4.00 3.00 7.0 2.0 5.0 9.00 1.0 8.0 10.00 6.0

Don't use before animation

There is no need to use this function before making the animation if you already have wide data. Pass the bar_chart_race function your original data.

Long data

'Long' data is a format for data where all values of the same kind are stored in a single column. Take a look at the baseball data below, which contains the cumulative number of home runs each of the top 20 home run hitters accumulated by year.

df_baseball = bcr.load_dataset('baseball')
df_baseball
name year hr
0 Hank Aaron 0 0
1 Barry Bonds 0 0
2 Jimmie Foxx 0 0
3 Ken Griffey 0 0
4 Reggie Jackson 0 0
... ... ... ...
424 Jim Thome 18 541
425 Jim Thome 19 564
426 Jim Thome 20 589
427 Jim Thome 21 604
428 Jim Thome 22 612

Name, year, and home runs are each in a single column, contrasting with the wide data, where each column had the same type of data. Long data must be converted to wide data by pivoting categorical column and placing the period in the index. The prepare_long_data provides this functionality. It simply uses the pandas pivot_table method to pivot (and potentially aggregate) the data before passing it to prepare_wide_data. The same two DataFrames are returned.

df_values, df_ranks = bcr.prepare_long_data(df_baseball, index='year', columns='name',
                                            values='hr', steps_per_period=5)
df_values.head(16)

The linearly interpolated values for the first three seasons of each player:

name Albert Pujols Alex Rodriguez Babe Ruth Barry Bonds ... Reggie Jackson Sammy Sosa Willie Mays Willie McCovey
year
0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
0.0 7.4 0.0 0.0 3.2 ... 0.2 0.8 4.0 2.6
0.0 14.8 0.0 0.0 6.4 ... 0.4 1.6 8.0 5.2
0.0 22.2 0.0 0.0 9.6 ... 0.6 2.4 12.0 7.8
0.0 29.6 0.0 0.0 12.8 ... 0.8 3.2 16.0 10.4
1.0 37.0 0.0 0.0 16.0 ... 1.0 4.0 20.0 13.0
1.0 43.8 1.0 0.8 21.0 ... 6.8 7.0 20.8 15.6
1.0 50.6 2.0 1.6 26.0 ... 12.6 10.0 21.6 18.2
1.0 57.4 3.0 2.4 31.0 ... 18.4 13.0 22.4 20.8
1.0 64.2 4.0 3.2 36.0 ... 24.2 16.0 23.2 23.4
2.0 71.0 5.0 4.0 41.0 ... 30.0 19.0 24.0 26.0
2.0 79.6 12.2 4.6 45.8 ... 39.4 21.0 32.2 29.6
2.0 88.2 19.4 5.2 50.6 ... 48.8 23.0 40.4 33.2
2.0 96.8 26.6 5.8 55.4 ... 58.2 25.0 48.6 36.8
2.0 105.4 33.8 6.4 60.2 ... 67.6 27.0 56.8 40.4
3.0 114.0 41.0 7.0 65.0 ... 77.0 29.0 65.0 44.0

The rankings change substantially during this time period.

df_ranks.head(16)
name Albert Pujols Alex Rodriguez Babe Ruth Barry Bonds ... Reggie Jackson Sammy Sosa Willie Mays Willie McCovey
year
0.0 20.0 19.0 18.0 17.0 ... 4.0 3.0 2.0 1.0
0.0 19.8 16.0 15.0 17.0 ... 4.2 4.8 5.2 3.4
0.0 19.6 13.0 12.0 17.0 ... 4.4 6.6 8.4 5.8
0.0 19.4 10.0 9.0 17.0 ... 4.6 8.4 11.6 8.2
0.0 19.2 7.0 6.0 17.0 ... 4.8 10.2 14.8 10.6
1.0 19.0 4.0 3.0 17.0 ... 5.0 12.0 18.0 13.0
1.0 19.2 4.2 3.2 17.0 ... 6.6 11.2 16.6 12.8
1.0 19.4 4.4 3.4 17.0 ... 8.2 10.4 15.2 12.6
1.0 19.6 4.6 3.6 17.0 ... 9.8 9.6 13.8 12.4
1.0 19.8 4.8 3.8 17.0 ... 11.4 8.8 12.4 12.2
2.0 20.0 5.0 4.0 17.0 ... 13.0 8.0 11.0 12.0
2.0 20.0 5.6 3.6 16.6 ... 13.8 7.8 11.6 11.4
2.0 20.0 6.2 3.2 16.2 ... 14.6 7.6 12.2 10.8
2.0 20.0 6.8 2.8 15.8 ... 15.4 7.4 12.8 10.2
2.0 20.0 7.4 2.4 15.4 ... 16.2 7.2 13.4 9.6
3.0 20.0 8.0 2.0 15.0 ... 17.0 7.0 14.0 9.0

Usage before animation

If you wish to use this function before an animation, set steps_per_period to 1.

df_values, df_ranks = bcr.prepare_long_data(df_baseball, index='year', columns='name',
                                            values='hr', steps_per_period=1,
                                            orientation='h', sort='desc')

def period_summary(values, ranks):
    top2 = values.nlargest(2)
    leader = top2.index[0]
    lead = top2.iloc[0] - top2.iloc[1]
    s = f'{leader} by {lead:.0f}'
    return {'s': s, 'x': .95, 'y': .07, 'ha': 'right', 'size': 8}

bcr.bar_chart_race(df_values, period_length=1000,
                   fixed_max=True, fixed_order=True, n_bars=10,
                   figsize=(5, 3), period_fmt='Season {x:,.0f}',
                   title='Top 10 Home Run Hitters by Season Played')