import pandas as pd
import numpy as np
import math
import scipy.stats as stats
Assuming a 10 fold cross validation was run for different set of parameters or features and we obtain these scores for 2 groups of models:
BaseModelScores = [0.709202,0.675973,0.690961,0.692875,0.678119,0.699425,0.679891,0.691891,0.705739,0.702819]
OtherModelScores = [0.693766,0.668319,0.678609,0.680208,0.663592,0.682784,0.670627,0.683872,0.68519,0.692516]
We need an estimation the results of our tests on samples are also true for the population
alpha=0.05
The data for t-test should meet these requirements:
Let's assume for now the data in our experiment are from 2 independent groups and verify 2 other conditions:
Normality can be verified with Shapiro-Wilk test for normality. The null hypothesis for Shapiro-Wilk test is that the data are normally distributed. If the the p-value is less than the choosen confidence level, then the null hypothesis that the data are normally distributed is rejected. If the p-value is greater than teh confidence level, then the null hypothesis is not rejected.
print('BaseModelScores:')
shapiro_test = stats.shapiro(BaseModelScores)
print(shapiro_test)
if shapiro_test.pvalue < alpha:
print('The null hypothesis that the data are normally distributed is rejected')
else:
print('The data are normally distributed')
BaseModelScores: ShapiroResult(statistic=0.9325253963470459, pvalue=0.4731871783733368) The data are normally distributed
print('OtherModelScores:')
shapiro_test = stats.shapiro(OtherModelScores)
print(shapiro_test)
if shapiro_test.pvalue < alpha:
print('The null hypothesis that the data are normally distributed is rejected')
else:
print('The data are normally distributed')
OtherModelScores: ShapiroResult(statistic=0.9528518915176392, pvalue=0.7022936940193176) The data are normally distributed
A Z-score of zero represents a value that equals the mean. The further away an observation's Z-score is from zero, the more unusual it is. A standard cut-off value for finding outliers are Z-scores of +/-3 or further from zero.
if sum(np.abs(stats.zscore(BaseModelScores))>3)>0:
print('There are outliers in BaseModelScores')
else:
print('No outliers in BaseModelScores')
No outliers in BaseModelScores
if sum(np.abs(stats.zscore(OtherModelScores))>3)>0:
print('There are outliers in OtherModelScores')
else:
print('No outliers in OtherModelScores')
No outliers in OtherModelScores
The null hypothesis that 2 models have identical scores (average of individual cross validation scores). If the t-value is greater than the critical value obtained from Student’s distribution, then the difference is significant. Otherwise it isn’t. The level of significance or (p-value) corresponds to the risk indicated by the t-test table for the calculated t-value. A larger t-value shows that the difference between group means is greater than the common variance, indicating a more significant difference between the groups. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html
t=stats.ttest_rel(BaseModelScores,OtherModelScores)
print(t)
if t.pvalue>=alpha:
print('No difference between the models with %s significance level'%alpha)
else:
print('There is a difference between models with %s significance level'%alpha)
Ttest_relResult(statistic=9.772287205694246, pvalue=4.333029637146347e-06) There is a difference between models with 0.05 significance level
From the test output not clear what's the critical value obtained from Student’s distribution. Let's do the test in a manual mode. To get the critical value from the distribution we need a degree of freedom
#Paired t-test
diff=[y - x for y, x in zip(BaseModelScores, OtherModelScores)]
n = len(diff)
m = np.mean(diff)
#it's important to provide ddof=1 (delta degrees of freedom) in numpy var to calculate variance with degre of freedom n - 1.
v = np.var(diff,ddof=1)
t = m/math.sqrt(v/n)
print(t)
9.772287205694246
#degree of freedom
df = n - 1
#Critical value for Two-tailed test from t distribution table:
critical_value=stats.t.ppf(q=1-alpha/2, df=df)
print('Critical value from Student`s distribution with significance level %s and degree of freedom %s is %s'%(alpha, df, critical_value))
Critical value from Student`s distribution with significance level 0.05 and degree of freedom 9 is 2.2621571627409915
#p-value - probability of getting a more extreme value - for two-sided test
p = 2*(1-stats.t.cdf(t, df))
print('p-value from Student`s distribution with significance level %s and degree of freedom %s is %s'%(alpha, df, p))
p-value from Student`s distribution with significance level 0.05 and degree of freedom 9 is 4.333029637093588e-06
The t-value (9.772287205694246)is greater than the critical value (2.2621571627409915) obtained from Student’s distribution and the difference is significant.
In fact, the data are not independent in K-fold cross-validation. Let's assume n1 is teh size of the training set and n2 is the size of the validation set
n2=89559
n1=806039
#Corrected Paired t-test
diff=[y - x for y, x in zip(BaseModelScores, OtherModelScores)]
n = len(diff)
m = np.mean(diff)
#it's important to provide ddof=1 (delta degrees of freedom) in numpy var to calculate variance with degre of freedom n - 1.
v = np.var(diff,ddof=1)
t = m/math.sqrt(v*(1/n + n2/n1))
print(t)
6.725766889467009
The corrected t-value (6.725766889467009) is still greater than the critical value (2.2621571627409915) obtained from Student’s distribution and the difference is significant.
# Nadeau and Bengio corrected paired t-test
# https://link.springer.com/content/pdf/10.1023/A:1024068626366.pdf
# https://www.cs.waikato.ac.nz/~eibe/pubs/bouckaert_and_frank.pdf
import numpy as np
import math
import scipy.stats as stats
def corrected_paired_ttest(data1, data2, n_training_size_folds, n_test_size_folds, alpha):
#corrected paired t-test
diff=[y - x for y, x in zip(data1, data2)]
n = len(diff)
m = np.mean(diff)
#it's important to provide ddof=1 (delta degrees of freedom) in numpy var to calculate variance with degre of freedom n - 1.
v = np.var(diff,ddof=1)
t = m/math.sqrt(v*(1/n + n2/n1))
#degree of freedom
df = n - 1
#Critical value for Two-tailed test from t distribution table:
critical_value=stats.t.ppf(q=1-alpha/2, df=df)
#p-value - probability of getting a more extreme value - for two-sided test
pvalue = 2*(1-stats.t.cdf(t, df))
return t, critical_value, pvalue
(c_t, critical_value, pvalue) = corrected_paired_ttest(BaseModelScores, OtherModelScores, n1, n2, alpha)
print('Corrected t-test value is %s , critical value is %s, p-value is %s'%(c_t, critical_value, pvalue))
if pvalue>=alpha:
print('No difference between the models with %s significance level'%alpha)
else:
print('There is a difference between models with %s significance level'%alpha)
Corrected t-test value is 6.725766889467009 , critical value is 2.2621571627409915, p-value is 8.598010400850953e-05 There is a difference between models with 0.05 significance level
The difference between the means of model scores for the entire population present in this confidence interval. If there is no difference, then the interval contains zero (0). If zero is NOT in the range of values, the difference is statistically significant.
diff=[y - x for y, x in zip(BaseModelScores, OtherModelScores)]
import scipy.stats as st
CI=st.t.interval(1-alpha, len(diff)-1, loc=np.mean(diff), scale=st.sem(diff))
CI
(0.009791778208024765, 0.015690621791975272)
import statsmodels.stats.api as sms
CI=sms.DescrStatsW(diff).tconfint_mean()
CI
(0.009791778208024765, 0.015690621791975272)
if CI[0]<=0:
print('No difference between the models with %s confidence level'%(1-alpha))
else:
print('There is a difference between models with %s confidence level'%(1-alpha))
There is a difference between models with 0.95 confidence level
import scipy.stats as st
def corrected_confidence_interval(data1, data2, n1, n2, confidence=0.95):
diff=[y - x for y, x in zip(data1, data2)]
n = len(diff)
m = np.mean(diff)
v = np.var(diff, ddof=1)
df = n - 1
t = stats.t.ppf((1 + confidence)/2, df)
lower = m - t * math.sqrt(v*(1/n + n2/n1))
upper = m + t * math.sqrt(v*(1/n + n2/n1))
return lower, upper
Corrected_CI = corrected_confidence_interval(BaseModelScores, OtherModelScores, n1, n2,1-alpha)
Corrected_CI
(0.00845580068188603, 0.017026599318114007)
if Corrected_CI[0]<=0:
print('No difference between the models with %s confidence level'%(1-alpha))
else:
print('There is a difference between models with %s confidence level'%(1-alpha))
There is a difference between models with 0.95 confidence level
data_dict = {}
data_dict['category'] = ['CI','Corrected CI']
data_dict['mean'] = [np.mean(diff),np.mean(diff)]
data_dict['lower'] = [CI[0],Corrected_CI[0]]
data_dict['upper'] = [CI[1],Corrected_CI[1]]
dataset = pd.DataFrame(data_dict)
import matplotlib.pyplot as plt
dim=np.arange(0,dataset['upper'].max() + dataset['upper'].max()/10,dataset['upper'].max()/10)
for lower,mean,upper,x in zip(dataset['lower'],dataset['mean'],dataset['upper'],range(len(dataset))):
plt.plot((x,x),(lower,upper),'r_-',markersize=20,color='blue')
plt.plot(x,mean,'ro',color='red')
plt.xticks(range(len(dataset)),list(dataset['category']))
plt.yticks(dim)
plt.grid(axis='both')
plt.margins(x=2)
0 is not in confidence intervals. It means there is a significant difference between models.Corrected confidence interval is wider and close to 0.