Joining many DataFrames at once in Pandas: "n-ary Join"

By max, Wed 11 February 2015, in category Hacking

::: {.cell .border-box-sizing .text_cell .rendered} ::: {.prompt .input_prompt} :::

::: {.inner_cell} ::: {.text_cell_render .border-box-sizing .rendered_html} Joining many DataFrames at once with Reduce ===========================================

In my last project I wanted to compare many different Gender Inequality Indexes at once, including the one I had just come up with, called "WIGI". The problem was that the rank and score data for each index was in a separate DataFrame. I need to perform repeated SQL-style joins. In this case I actually only had to join 5 dataframes, for 5 indices. But later, in helping my partner with her research, she came across the same problem needed to join more than 100. In my mind I saw that we wanted to accomplish this n-ary join. Mathematically I wanted this type of operation, which I couldn't find in pandasjoin

The answer I enjoyed implementing, perhaps because I saw it as this type of repeated operation, is the reduce of functional programming.

Ok, say we have these two data sets: ::: ::: :::

::: {.cell .border-box-sizing .code_cell .rendered} ::: {.input} ::: {.prompt .input_prompt} In [5]: :::

::: {.inner_cell} ::: {.input_area} ::: {.highlight} wigi ::: ::: ::: :::

::: {.output_wrapper} ::: {.output} ::: {.output_area} ::: {.prompt .output_prompt} Out[5]: :::

::: {.output_html .rendered_html .output_subarea .output_pyout} ::: {style="max-height: 1000px; max-width: 1500px; overflow: auto;"}

::: ::: ::: ::: ::: ::: Rank Score
Economy
Republic of China 1 0.356890
Kingdom of Denmark 2 0.347826
Sweden 3 0.345212
South Korea 4 0.343662
Hong Kong 5 0.342857

::: {.cell .border-box-sizing .code_cell .rendered} ::: {.input} ::: {.prompt .input_prompt} In [6]: :::

::: {.inner_cell} ::: {.input_area} ::: {.highlight} world_economic_forum ::: ::: ::: :::

::: {.output_wrapper} ::: {.output} ::: {.output_area} ::: {.prompt .output_prompt} Out[6]: :::

::: {.output_html .rendered_html .output_subarea .output_pyout} ::: {style="max-height: 1000px; max-width: 1500px; overflow: auto;"}

::: ::: ::: ::: ::: ::: Rank Score
Economy
Iceland 1 0.8594
Finland 2 0.8453
Norway 3 0.8374
Sweden 4 0.8165
Denmark 5 0.8025

::: {.cell .border-box-sizing .text_cell .rendered} ::: {.prompt .input_prompt} :::

::: {.inner_cell} ::: {.text_cell_render .border-box-sizing .rendered_html} We'd probably join them like this: ::: ::: :::

::: {.cell .border-box-sizing .code_cell .rendered} ::: {.input} ::: {.prompt .input_prompt} In [7]: :::

::: {.inner_cell} ::: {.input_area} ::: {.highlight} wigi.join(world_economic_forum, how='outer', lsuffix='_wigi', rsuffix='_wef') ::: ::: ::: :::

::: {.output_wrapper} ::: {.output} ::: {.output_area} ::: {.prompt .output_prompt} Out[7]: :::

::: {.output_html .rendered_html .output_subarea .output_pyout} ::: {style="max-height: 1000px; max-width: 1500px; overflow: auto;"}

::: ::: ::: ::: ::: ::: Rank_wigi Score_wigi Rank_wef Score_wef
Economy
Denmark NaN NaN 5 0.8025
Finland NaN NaN 2 0.8453
Hong Kong 5 0.342857 NaN NaN
Iceland NaN NaN 1 0.8594
Kingdom of Denmark 2 0.347826 NaN NaN
Norway NaN NaN 3 0.8374
Republic of China 1 0.356890 NaN NaN
South Korea 4 0.343662 NaN NaN
Sweden 3 0.345212 4 0.8165

::: {.cell .border-box-sizing .text_cell .rendered} ::: {.prompt .input_prompt} :::

::: {.inner_cell} ::: {.text_cell_render .border-box-sizing .rendered_html} But we want to generalize. Notice here we also inject the name of the DataFrame into the column names to avoid "suffix-hell" as I would like to term it. ::: ::: :::

::: {.cell .border-box-sizing .code_cell .rendered} ::: {.input} ::: {.prompt .input_prompt} In [1]: :::

::: {.inner_cell} ::: {.input_area} ::: {.highlight} import pandas

def make_df(filename):
    df = pandas.DataFrame.from_csv(filename)
    name = filename.split('.')[0]
    df.columns = map(lambda col: '{}_{}'.format(str(col), name), df.columns)
    return df

filenames = !ls

dfs = [make_df(filename) for filename in filenames]

::: ::: ::: ::: :::

::: {.cell .border-box-sizing .text_cell .rendered} ::: {.prompt .input_prompt} :::

::: {.inner_cell} ::: {.text_cell_render .border-box-sizing .rendered_html} Now here's the reducer. I actually end up wanting an inner join in the end, but the type of join is not important to illustrate the fact.

Here we join 5 DataFrames at once. ::: ::: :::

::: {.cell .border-box-sizing .code_cell .rendered} ::: {.input} ::: {.prompt .input_prompt} In [2]: :::

::: {.inner_cell} ::: {.input_area} ::: {.highlight} def join_dfs(ldf, rdf): return ldf.join(rdf, how='inner')

final_df = reduce(join_dfs, dfs) #that's the magic
final_df.head()

::: ::: ::: :::

::: {.output_wrapper} ::: {.output} ::: {.output_area} ::: {.prompt .output_prompt} Out[2]: :::

::: {.output_html .rendered_html .output_subarea .output_pyout} ::: {style="max-height: 1000px; max-width: 1500px; overflow: auto;"}

::: ::: ::: ::: ::: ::: Score_gdi Rank_gdi Score_gei Rank_gei Rank_sigi Score_sigi Rank_wdf Score_wdf Rank_wef Score_wef
Economy
Nicaragua 0.912 102 74 37 53 0.8405 13 0.272727 6 0.7894
Rwanda 0.950 80 77 19 43 0.8661 134 0.096154 7 0.7854
Philippines 0.989 17 76 26 57 0.8235 6 0.322785 9 0.7814
Belgium 0.977 38 79 12 1 0.9984 73 0.163734 10 0.7809
Latvia 1.033 52 77 19 24 0.9489 82 0.157623 15 0.7691

::: {.cell .border-box-sizing .text_cell .rendered} ::: {.prompt .input_prompt} :::

::: {.inner_cell} ::: {.text_cell_render .border-box-sizing .rendered_html} I really like the elegance of this solution. I admit there may be other ways to go about it with pandas only, and I understand the R mentality of "no for loops". Still this is precisely why I like pandas in python - you still get the freedom to play as you wish if it makes more sense to you. ::: ::: :::