Part 2: From SPSS to Python

Independent T-Test

  1. Background assumptions of the Independent T-Test
  2. Checking for Outliers
  3. Dealing With Outliers
  4. Homogeneity of Variances
  5. Creating the Output of the Independent t-test
  6. Exporting to excel and png files

Background

  1. You must have one continuous dependent variable. Meaning the variable could technically be infinite like time running, test scores(0 to 100) etc.
  2. Your other variable is independent containing two groups aka a form of categorical data. Comparing being employed or not, Pepsi or coke and so on.
  3. You have an assumption of independence, meaning the groups are made of different people, things etc. You don’t want anything that could skew the data due to the populations being too similar. No one who said they liked Pepsi also said they liked coke. They have chosen one or the other.

Dataset for this tutorial

Getting Started

Importing Packages

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from scipy.stats import mstats
import pylab
Image by Author
Image by Author

Dataset from desktop

df = pd.read_csv("data/StudentsPerformance.csv")
df
Image by Author

Dataset Validation

Checking for outliers

sns.boxplot(x=df['math score'], y=df['lunch'])
Image by Author

Interquartile Range

df["lunch"]
[df["lunch"] == "free/reduced"]
df[df["lunch"] == "free/reduced"].quantile(0.25)
Q1_f = df[df["lunch"] == "free/reduced"].quantile(0.25)
Q3_f = df[df["lunch"] == "free/reduced"].quantile(0.75)
Q1_f = df[df["lunch"] == "free/reduced"].quantile(0.25)
Q3_f = df[df["lunch"] == "free/reduced"].quantile(0.75)
IQR_f = Q3_f - Q1_f
print(Q1_f)
print(Q3_f)
print(IQR_f)
math score       49.0
reading score 56.0
writing score 53.0
Name: 0.25, dtype: float64
Q1_s = df[df["lunch"] == "standard"].quantile(0.25)
Q3_s = df[df["lunch"] == "standard"].quantile(0.75)
IQR_s = Q3_s - Q1_s
print(Q1_s)
print(Q3_s)
print(IQR_s)
  1. Multiply the IQR of both free/reduced and standard lunches by 1.5
  2. Adding this value (IQR x 1. 5) to the Q3. Any number greater than this is a suspected outlier.
  3. Subtracting this value (IQR x 1. 5) from the Q1. Any number less than this is a suspected outlier.
IQR_freelunch = pd.DataFrame()
(df < (Q1_f - 1.5 * IQR_f)) |(df > (Q3_f + 1.5 * IQR_f))

IQR_freelunch = pd.DataFrame(df < (Q1_f - 1.5 * IQR_f)) |(df > (Q3_f + 1.5 * IQR_f))
IQR_standardlunch = pd.DataFrame(df < (Q1_s - 1.5 * IQR_s)) |(df > (Q3_s+ 1.5 * IQR_s))
IQR_freelunch.loc[IQR_freelunch['math score'] == True]
Image by Author
Image by Author

Cleaning up the Data

  1. Change the outlier to match the data
  2. Remove that data from the data frame
  3. Keep the outliers and leave it as is

Changing the outlier values

Winsorizing

Q1_s = df[df["lunch"] == "standard"].quantile(0.25)
Q3_s = df[df["lunch"] == "standard"].quantile(0.75)
IQR_s = Q3_s - Q1_s
print(Q1_s)
print(Q3_s)
print(IQR_s)
#IQR for Whole DataSet
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outer_fence = 3*IQR
outer_fence_le = Q1-outer_fence
outer_fence_ue = Q3+outer_fence
#outerfence values
outer_fence = 3*IQR
outer_fence_le = Q1-outer_fence #lower
outer_fence_ue = Q3+outer_fence #higher
#what does this do?
print (outer_fence)
print (outer_fence['math score'])
print (outer_fence_le['math score'])
print (outer_fence_ue['math score'])
outerfencedata={'Quantile Percentage':
[],
'math score':
[]}
outerfencedata={'Quantile Percentage':
['99.9%','99%','97.5%','95%','92.5%','90%','10%','9%','7.5%','5%','2.5%','0.9%' ],
'math score':
[]}
outerfencedata={'Quantile Percentage':
['99.9%','99%','97.5%','95%','92.5%','90%','10%','9%','7.5%','5%','2.5%','0.9%' ],
'math score':
[df['math score'].quantile(0.999),df['math score'].quantile(0.99), df['math score'].quantile(0.975),df['math score'].quantile(0.95) , df['math score'].quantile(0.925),df['math score'].quantile(0.90),
df['math score'].quantile(0.1), df['math score'].quantile(0.0999), df['math score'].quantile(0.075),df['math score'].quantile(0.05), df['math score'].quantile(0.025),df['math score'].quantile(0.009)]}
outerfence= pd.DataFrame(outerfencedata)
outerfence
Image by Author
new_df = df.copy(deep=True)
new_df['math_wins'] =
new_df['math_wins'] = mstats.winsorize(new_df['math score'], 
limits=(0.1, 0.05))
new_df
Image by Author
  1. Re-do the boxplot
sns.boxplot(x=new_df['math_wins'], y=new_df['lunch'])
Image by Author
new_df['math score'].equals(new_df['math_wins'])
values =[17,59,76,145,149,327,338,363,451,458,466,528,555,596,601,623,625,683,785,787,842,895,916,962,980]
new_df.iloc[values, :]
Image by Author

Removing the outliers

values =[17,59,76,145,149,327,338,363,451,458,466,528,555,596,601,623,625,683,785,787,842,895,916,962,980]
rem_df = df.copy(deep=True)
rem_df.drop(values)
rem_df.drop([17,149,596])
rem_df[rem_df['math score'] <=97]

Testing for Normality

def shapirostat(x):
def shapirostat(x):
statistic, pvalue = stats.shapiro(x)
return statistic
df[df["lunch"] == "standard"]
standard = df[df["lunch"] == "standard"]
stan = standard['math score'].to_numpy()
def shapirostat(x):
statistic, pvalue = stats.shapiro(x)
return statistic
standard = df[df["lunch"] == "standard"]
stan = standard['math score'].to_numpy()
shapirostat(stan)
def shapirostat(x,y):
x=x[y].to_numpy()
statistic, pvalue= stats.shapiro(x)
return statistic
standard = df[df["lunch"] == "standard"]
shapriostat(standard, 'math score')
def shapirostat(x,y):
x=x[y].to_numpy()
statistic, pvalue= stats.shapiro(x)
return statistic
def shapirop(x,y):
x=x[y].to_numpy()
statistic, pvalue = stats.shapiro(x)
return pvalue
standard = df[df["lunch"] == "standard"]
freereduced = df[df["lunch"] == "free/reduced"]
math= df['math score'].to_numpy()
norm={'Lunch':["free/reduced","standard"],
'statistics':[shapirostat(freereduced,'math score'),shapirostat(standard,'math score') ],
'df': [998,998],
'Sig(pvalue)':[shapirop(freereduced,'math score'),shapirop(standard,'math score')]}
shar=pd.DataFrame(norm)
shar
Image by Author
stan = standard['math score'].to_numpy()
stats.probplot(stan, dist="norm", plot=pylab)
pylab.show()
Image by Author
free = freereduced['math score'].to_numpy()
stats.probplot(free, dist="norm", plot=pylab)
pylab.show()
Image by Author
  1. Moving over to Mann- Whitney U
  2. Transforming the dataset via a different method

Homogeneity of Variances

levene = stats.levene(free,stan)
levene
LeveneResult(statistic=3.193786657293625, pvalue=0.07422200559323446)

Independent T-Test

  1. Build a descriptive statistic table in python
  2. Mean difference, STD error difference, 95% confidence intervals in lower and upper bounds
  3. Independent T-Test results
  4. Displaying the data
  5. Exporting your results

Descriptive Statistic Table

mean = x[col].mean()
count = len(x[col])
std = x[col].std()
std_err = x[col].sem()
mean = df['math score'].mean()
mean
66.089
def descript_vals(x,col):mean = x[col].mean()
count = len(x[col])
std = x[col].std()
std_err = x[col].sem()
data = [mean,count,std,std_err]
return data
descript_vals(df,'math score')
[66.089, 1000, 15.16308009600945, 0.4794986944695449]
#standard = df[df["lunch"] == "standard"]
#freereduced = df[df["lunch"] == "free/reduced"]
stan = descript_vals(standard,'math score')
free = descript_vals(freereduced,'math score')
descriptive_table = {'Lunch':["free/reduced","standard"],'Mean':[stan[0],free[0]],'Count':[stan[1],free[1]],
'Standard Deviation': [stan[2],free[2]],
'Standard Deviation Error':[stan[3],free[3]]}
descript =pd.DataFrame(descriptive_table)
descript
Image by Author

Mean difference, STD error difference, 95% confidence intervals in lower and upper bounds

def mean_confidence_interval(data, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), stats.sem(a)
h = se * stats.t.ppf((1 + confidence) / 2., n-1)
return m, m-h, m+h
meandiff = stan[0]-free[0]
stderrdif = free[3]-stan[3]
min_val = df['math score'].min()
max_val = df['math score'].max()
_, lower_bound, upper_bound = mean_confidence_interval(df['math score'])
meandifference = {'Mean Difference':[meandiff],'Standard Deviation Error Difference':[stderrdif],'Lower Bound':[lower_bound],
'Upper Bound': [upper_bound]}
mean_df =pd.DataFrame(meandifference)
mean_df
Image by Author

Independent T-Test results

stan_t = standard['math score'].to_numpy()
free_t = freereduced['math score'].to_numpy()
dof = len(stan_t) + len(free_t) - 2
t_stat, pval =stats.ttest_ind(data1,data2)
def independent_ttest(data1, data2):
# degrees of freedom
dof = len(data1) + len(data2) - 2
t_stat, pval =stats.ttest_ind(data1,data2)
return t_stat, dof, pval
independent_ttest(stan_t, free_t)
(11.837180472914612, 998, 2.4131955993137074e-30)
def independent_ttest(data1, data2):
# degrees of freedom
dof = len(data1) + len(data2) - 2
t_stat, pval =stats.ttest_ind(data1,data2)
pval = f"{pval:.30f}" <=
return t_stat, dof, pval
pval= "{:.30f}".format(pval)
independent_ttest(stan_t, free_t)
(11.837180472914612, 998, '0.000000000000000000000000000002')
meandifference = 
{'Mean Difference':[meandiff],
'Standard Deviation Error Difference':[stderrdif],
'Lower Bound':[lower_bound],
'Upper Bound': [upper_bound],
'T Statistic':[independent_ttest(stan_t, free_t)[0]] ,
'DF':[independent_ttest(stan_t, free_t)[1]],
'pval':[independent_ttest(stan_t, free_t)[2]]}
mean_df =pd.DataFrame(meandifference)
mean_df

Displaying the Data

sns.boxplot(x=df['math score'], y=df['lunch'])
fig, ax = plt.subplots(figsize=(12, 9))
fig, ax = plt.subplots(figsize=(12, 9))
sns.boxplot(x=df['math score'], y=df['lunch'], ax=ax)
ax.set_title('Boxplot of Math Scores Compared by Type of Lunch', fontsize=25)
ax.set_xlabel('Math Scores', fontsize=17)
ax.set_ylabel('Lunch Types',fontsize=17)
ax.set_yticklabels(['Standard','Free/Reduced'])
fig, ax = plt.subplots(figsize=(12, 9))sns.boxplot(x=df['math score'], y=df['lunch'], ax=ax,palette="Blues")ax.set_title('Boxplot of Math Scores Compared by Type of Lunch', fontsize=25)
ax.set_xlabel('Math Scores', fontsize=17)
ax.set_ylabel('Lunch Types',fontsize=17)
ax.set_yticklabels(['Standard','Free/Reduced'])
Image by Author
fig, ax = plt.subplots(figsize=(12, 9))sns.barplot(x=df['lunch'], y=df['math score'], ax=ax,palette="Blues")ax.set_title('Math Scores Compared by Type of Lunch', fontsize=25)
ax.set_xlabel('Math Scores', fontsize=17)
ax.set_ylabel('Lunch Types',fontsize=17)
ax.set_xticklabels(['Standard','Free/Reduced'])
Image by Author
fig = plt.figure(figsize=[6, 6], dpi=100)
ax = fig.add_subplot(111)
fig = stats.probplot(stan_t, dist="norm", plot=plt,fit=False)
#These next 3 lines just demonstrate that some plot features
#can be changed independent of the probplot function.
ax.set_title("QQ plot of Standard Lunch")
ax.set_xlabel("Quantiles", fontsize=10)
ax.set_ylabel("Ordered Values", fontsize=10)
plt.show()
Image by Author
fig = plt.figure(figsize=[6, 6], dpi=100)
ax = fig.add_subplot(111)
fig = stats.probplot(free_t, dist="norm", plot=plt,fit=False)
#These next 3 lines just demonstrate that some plot features
#can be changed independent of the probplot function.
ax.set_title("QQ plot of Fre/Reduced Lunch")
ax.set_xlabel("Quantiles", fontsize=10)
ax.set_ylabel("Ordered Values", fontsize=10)
plt.show()
Image by Author

Exporting the Data

new_df.to_excel("output.xlsx")
with pd.ExcelWriter('output/independentttestresults.xlsx') as writer:
new_df.to_excel(writer,sheet_name='Sheet1')
outerfence.to_excel(writer,sheet_name='Sheet2')
with pd.ExcelWriter('output/independentttestresults.xlsx') as writer:
new_df.to_excel(writer,sheet_name='Sheet1')
outerfence.to_excel(writer,sheet_name='Sheet2')
shar.to_excel(writer,sheet_name='Sheet3')
descript.to_excel(writer,sheet_name='Sheet4')
mean_df.to_excel(writer,sheet_name='Sheet5')
plt.savefig('output/nameofplot.png')

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store