Marking Statistically Significant Values using Pandas

Wed 21 May 2014 by Eoin Travers

Writing a results section, I had some data collated using the Pandas library in Python, which I wanted to display the mean for a number of groups, and show if that mean was significantly different from chance (.5) in each case.

Calculating the means, and running the binomial test, is simple. I'll demonstrate with a data set from UCLA, the details of which aren't important, but I'm going to look at average admit, grouped by rank.

import pandas as pd
from scipy import stats
# Data courtesy of http://www.ats.ucla.edu/stat/r/dae/logit.htm
data = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")

data_grouped = data.groupby('rank') # Grouped values
data_means = data_grouped.mean() # Mean values

# Number of values in the first group (assuming all groups to be equal)
N = data_grouped.count().admit.iloc[0]
# Run a binomial tests for each group
# m*N = Mean accuracy * Number of trials = Total Accuracy
# .5 = Chance.
data_means['p'] = [stats.binom_test(m*N, N, .5) for m in data_means.admit]

print np.round(data_means[['admit', 'p']], 3)
      admit      p
rank              
1     0.541  0.609
2     0.358  0.020
3     0.231  0.000
4     0.179  0.000

[4 rows x 2 columns]

To output this they way you would expect in a publication, I used the following function, which takes a Pandas dataframe, a list of value column names, a list of p value column names, and a number why which to round to output. The input is a list, rather than just a value, so you can enter a list of columns for each.

 def mark_sig(df, val_cols, p_cols, round_to=3):
    df = df.copy() # Don't modify the original data
    mapper = {1:'', .1:' .', .05:' *', .01:' **', .001:' ***', .0001:' ***',}
    posible_p = [.0001, .001, .01, .05, .1, 1]
    for val_col, p_col in zip(val_cols, p_cols):
        # For each value/p value pairing...
        for i in range(len(df)):
            # For every row...
            val = df[val_col].iloc[i]
            for p in posible_p:
                # Check if the p value if below any of those on the list
                if df[p_col].iloc[i] < p:
                    # If so, add the appropriate asterisks
                    df[val_col].iloc[i] = str(np.round(val, round_to)) + mapper[p]
                    break
    print_me = val_cols # Only print the value columns
    print df[print_me]

mark_sig(data_means, ['admit'], ['p'])
          admit
rank           
1         0.541
2       0.358 *
3     0.231 ***
4     0.179 ***

Feel free to use and modify this as you wish, although I'm sure there's nicer ways of doing this built into some R and Python packages that give this kind of output.

PS: Analysing the example data set in this way doesn't make any sense: it's used purely for illustrative purposes.