Part 2: From SPSS to Python

Performing an Independent T-Test in Python

Nicoletta Tancred
27 min readJun 27, 2021

Make sure you are following from Part 1 in this series. If you haven’t set it up correctly you may encounter quite a few errors.

This tutorial is based on trying to translate the output you would usually get performing an Independent T-Test in SPSS into Python. This will be done using the principles of Project-Based learning so we will be putting you in the deep end to find answers to errors or problems you may encounter. I find project-based learning to be the most beneficial but if you want to learn some more fundamentals I would recommend Treehouse, Codeacademy and Learn Python. Throughout the tutorial, I will be linking anything discussed to documentation so if you want to understand something further or try it out feel free to do so.

Independent T-Test

One of the simplest tests we can do in statistical analysis is an Independent-Samples T-Test. I figured this would be a good place to start if you are new to Python and using it for data analysis.

We will be going through:

  1. Background assumptions of the Independent T-Test
  2. Checking for Outliers
  3. Dealing With Outliers
  4. Homogeneity of Variances
  5. Creating the Output of the Independent t-test
  6. Exporting to excel and png files

Background

An Independent-samples t-test is about determining if there is a difference between the means of two independent groups. An example of this is determining the difference between a dependent variable like total lung capacity and an independent variable like smoker status.

With that before we begin there are a few assumptions we need to make about the dataset and more specifically the variables we are comparing:

  1. You must have one continuous dependent variable. Meaning the variable could technically be infinite like time running, test scores(0 to 100) etc.
  2. Your other variable is independent containing two groups aka a form of categorical data. Comparing being employed or not, Pepsi or coke and so on.
  3. You have an assumption of independence, meaning the groups are made of different people, things etc. You don’t want anything that could skew the data due to the populations being too similar. No one who said they liked Pepsi also said they liked coke. They have chosen one or the other.

Your null hypothesis should be: the population means of the two groups are equal.

While your alternative hypothesis should be: the population means of the two groups are not equal.

Make sure this test works for before proceeding. There are alternatives like the Mann-Whitney U test, Shaprio-Wilk test, Kruskall Wallis test and One-Way ANOVA test which might fit better. I will be uploading tutorials on those so make sure to follow to be notified.

Dataset for this tutorial

I will be using Kaggle to get my dataset specifically on Students Performance in Exams. I will be looking at students performance in math scores when compared to their lunch(‘standard’, ‘free/reduced’).

Feel free to use your own dataset if you feel comfortable.

Getting Started

Make sure to open up Jupyter Notebook as discussed in Part 1. Open a new Python 3 file in Jupyter, name it and we can get started.

You will see a blank line. Click on it to start typing:

Importing Packages

If you remember all the packages we just installed in Part 1, while we have access to them we need to import them into the python file. So on that blank line type:

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from scipy.stats import mstats
import pylab

So we are importing a stats module from SciPy, Pandas, MatplotLib and Seaborn.

Press Shift + enter to run this line. A new line should appear below and your line should be numbered [1].

Image by Author

Note: When pressing the Enter key it will only create a new line. By pressing Shift + Enter it creates a new line of code. It’s the difference between a space between paragraphs and a page break. In Python you want more page breaks to dissect errors in the code and read your code more clearly. It also allows you to section your code as well for easier reading.

The numbering of [1] matters. It lets you know which lines of code have been run and when. Make sure you order your lines correctly. If you run into trouble a good tip is to go to the top of Jupyter notebook and select the Kernel dropdown and click Restart & Run All. This will just re-run all the lines in order.

Image by Author

Dataset from desktop

Now let’s import this dataset and see what we are working with. From this point on, work using CSV files UTF-8. It makes your life a little easier and if you need to save it in another format later it’s very easy to do. If you are using the Kaggle dataset, download it.

If you remember we created a data folder, make sure to place your CSV file in there. Underneath importing all the packages, on a new line type:

df = pd.read_csv("data/StudentsPerformance.csv")
df

Once running this line, you should see your dataset in the notebook.

Image by Author

Now the dataset is in Python and called ‘df’.

Dataset Validation

In this section, we will be covering how to check your dataset has minimal to no outliers, assumes normality and has an equal variance.

Checking for outliers

The easiest way to check for outliers is to do a simple boxplot. One of the packages we have installed, seaborn has a boxplot function that we can call. Since we imported seaborn as sns (view Importing Packages above), whenever we want to use seaborn we will want to put sns. at the front of everything

sns.boxplot(x=df['math score'], y=df['lunch'])
Image by Author

To break it down:

sns = seaborn

boxplot = specific boxplot package in seaborn

x and y =data we are looking at

df is our data frame and we use[] brackets to call the specific columns we want to look at. In this case, we are looking at math scores.

The boxplot is revealing that we have a few outliers but I want to go over another method of outlier identification in Python that is less visual.

Interquartile Range

You can also confirm and or identify the outliers by the Interquartile range (IQR). I’m going to be doing this via our independent variable df[‘lunch’](dataframe and the column lunch). I will be looking at look at the independent variables separately just like our boxplot above did.

I want to calculate the Q1, Q3 and InterQuartile Range(IQR). We can do this by using pandas one of the packages imported. It has a .quantile which will return the specified quantile. To get the IQR, we can simply minus Q1 from Q3.

To do this I need to define different variables for each value I want. A variable in Python is just a way to contain data values, in this case, the interquartile values. Since I will look at the free/reduced lunch first I will add“ _f” to the end of each variable name.

So I want the independent variable of free/reduced in the column lunch of my data frame. To get that we follow what we did to get the x and y value for our boxplot.

df["lunch"]

However, we need to add another [] to get the specific free/reduced lunch students only.

[df["lunch"] == "free/reduced"]

Then I will use the .quantile and input the quantile to compute (Q1=0.25/25% and Q3=0.75/74%). We will need to add the df at the front of this variable to it searches within the data frame => to lunch column and only to the results that == “free/reduced”.

df[df["lunch"] == "free/reduced"].quantile(0.25)

Then we store this in a variable (Q1_f) and duplicate it, changing the quantile value to 0.75 and labelling it Q3_f.

Q1_f = df[df["lunch"] == "free/reduced"].quantile(0.25)
Q3_f = df[df["lunch"] == "free/reduced"].quantile(0.75)

Finally, the IQR_f will be the Q3-Q1

Q1_f = df[df["lunch"] == "free/reduced"].quantile(0.25)
Q3_f = df[df["lunch"] == "free/reduced"].quantile(0.75)
IQR_f = Q3_f - Q1_f

Then you want to be able to view the values, so you will use the print function.

print(Q1_f)
print(Q3_f)
print(IQR_f)

For my data frame, it prints out the reading, writing and maths score.

math score       49.0
reading score 56.0
writing score 53.0
Name: 0.25, dtype: float64

Since I’m focusing on the maths score today the values for free/reduced lunch are:

Q1= 49

Q3=69

IQR = 20

Following the same code, all you need to do is replace the string of “free/reduced” with “standard”.

Q1_s = df[df["lunch"] == "standard"].quantile(0.25)
Q3_s = df[df["lunch"] == "standard"].quantile(0.75)
IQR_s = Q3_s - Q1_s
print(Q1_s)
print(Q3_s)
print(IQR_s)

Q1= 61

Q3=80

IQR = 19

So now we understand our boxplot a little better, let’s see if the boxplots are right and if there are outliers in our dataset. To do this we are going to use the Interquartile Rule by :

  1. Multiply the IQR of both free/reduced and standard lunches by 1.5
  2. Adding this value (IQR x 1. 5) to the Q3. Any number greater than this is a suspected outlier.
  3. Subtracting this value (IQR x 1. 5) from the Q1. Any number less than this is a suspected outlier.

To do this in python I’m going to make two new data frames using pandas based on lunch selection. Using the Interquartile rule I’m going to get it to return to me a False statement if it matches the rule i.e is between the lower and upper bounds of our dataset, and a True statement if it beyond and below our bounds. Within the data frame creation (pd.DataFrame()) I am inputting the rules above with the specific IQR and Q1 stored variables to match the name of the dataframe.

The new dataframe named “IQR_freelunch” needs Q1_f, Q3_f and IQR_f. How we do this is by first naming our dataframe and storing it in a variable and setting up the dataframe code:

IQR_freelunch = pd.DataFrame()

Then with pd.DataFrame we input it similarly to how you would with excel or even in a calculator. In excel it would be :

=OR(Q1_f— 1.5*IQR_f, Q3_f+1.5*IQR_f)

For Python, we put it in very similar but instead of a comma (,) to separate the conditions we put in |. We also specific not the cell but the data frame we are calling to generate the full dataframe

(df < (Q1_f - 1.5 * IQR_f)) |(df > (Q3_f + 1.5 * IQR_f))

Then place this into your data frame code and duplicate it for the standard lunch:


IQR_freelunch = pd.DataFrame(df < (Q1_f - 1.5 * IQR_f)) |(df > (Q3_f + 1.5 * IQR_f))
IQR_standardlunch = pd.DataFrame(df < (Q1_s - 1.5 * IQR_s)) |(df > (Q3_s+ 1.5 * IQR_s))

Don’t forget to change the _f for free/reduced lunch to _s for standard lunch in the “IQR_standardlunch” data frame variable

Essentially this is checking if the values in df are < or > each of our conditions above. If they are, they return a True value in the data frame, meaning that they are an outlier.

When I display the data frame I get over 1000 rows to look through. We shouldn’t have to do this manually. So I’m going to use a .loc function to find the True values and return them to me with the index number. The index number on the side will allow me to search for the specific row later on if need be. Let’s look at the free/reduced lunch students:

IQR_freelunch.loc[IQR_freelunch['math score'] == True]

To break this down, our new data frame (IQR_freelunch) is going through a Pandas function of .loc. .loc accesses a group of rows and columns by labels and will output whatever you are looking for. Since I’m looking for the maths score in our data frame (IQR_freelunch[‘math score’]) and I only want the values that equal True.

Image by Author

As we can see via the output, we now know which values are outside the bounds of our IQR test by their index on the left-hand side. If we do the same for the standard lunch students:

Image by Author

We see some overlap of the index but some new outliers as well.

To conclude we have a lot of outliers in this dataset. Since this isn’t my data I don’t know what has happened I’m going to assume that these are just unusual values. This is also supported by the fact that no one got above 100 or below 0 in their math scores.

Depending on your data it might be beneficial to do a different t-test but for now, we will stick with this one.

Now that we have identified some outliers I will show you how you can clean this up in Python.

Cleaning up the Data

While I will leave the outliers as is, I do want to show you what to do with outliers in Python.

Your choices should be domain focused and should be based on what you think about the dataset. What you can do is:

  1. Change the outlier to match the data
  2. Remove that data from the data frame
  3. Keep the outliers and leave it as is

Changing the outlier values

When it comes to changing the outlier value there are multiple different methods. My personal suggestion is Winsorizing and Imputation.

Winsorizing is about recoding the outlier values which are above the 95th percentile and below the 5th percentile with the value of the 95th percentile value and the 5th percentile value.

Imputation is applied by removing the data in the cells and replacing it with an estimation based on the data that is still there. This is usually done by the median value. Imputation is usually on done on missing values and I will cover this another time.

Winsorizing

We are going to start with going back to the whole dataset and not just the individual boxplots. Do this by calculating the Q1, Q3 and IQR of the dataframe[df]. We also want to calculate the outer fence for both the lower and higher value. What you can do since you have already done something similar above is copy and alter pre-existing code. For example, let’s take the standard lunch IQR:

Q1_s = df[df["lunch"] == "standard"].quantile(0.25)
Q3_s = df[df["lunch"] == "standard"].quantile(0.75)
IQR_s = Q3_s - Q1_s
print(Q1_s)
print(Q3_s)
print(IQR_s)

When looking at the whole data frame for Q1, Q3 and IQR we don’t need to specify column name [“lunch”] or the type of lunch == “standard”. So all you have to do is remove that to get the variables you want. Also, make sure to remove the “ _s” as it is not necessary and you don’t want to override old variables

#IQR for Whole DataSet
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

Now we want to calculate the outer fence values. Since the equation for the outer fence is just IQR times by 3, we can input this into a variable.

outer_fence = 3*IQR

To calculate the lower and higher fence values it’s

lower =Q1-Outer Fence

outer_fence_le = Q1-outer_fence

higher = Q3+Outer Fence

outer_fence_ue = Q3+outer_fence

And then place it all together on the same line:

#outerfence values
outer_fence = 3*IQR
outer_fence_le = Q1-outer_fence #lower
outer_fence_ue = Q3+outer_fence #higher

Before we print these values, I’ve introduced a new element to python coding

#what does this do?

These are just comments you can place in your code. From now on I will use them to help display information. With the # and in Jupyter it being coloured blue it doesn’t do anything to your code aside from letting you know your own comments.

Now if we printed these values as

print (outer_fence)

We would get all the score values and anything else in the data frame. For simplicity, use the column name ‘math score’ to only show the math score’s outer fence, lower and upper.

print (outer_fence['math score'])
print (outer_fence_le['math score'])
print (outer_fence_ue['math score'])

Outer Fence = 60

Outer Fence Lower = -3.0

Outer Fence Upper = 137

Since both the upper and lower fence is outside of the possible actual outliers (0–100) we will be Winsorizing on both tails. However, if you found that the data is skewed, you might consider Winsorizing only right or only left.

This part requires a bit more manual finesse as this is very data specific. We are going to be judging the cut for both the left and right tails of the dataset by creating a data frame that tells us the math scores from the 0.9%-10% (left tail) and the 90%-99.9% (right tail).

We need to manually input the data and we could print it out as we have been but putting it into a data frame means it’s easier to export it later if we need to. So using Pandas we can create a data frame from the data. Think of it as writing an excel sheet but in Python. To do this, my suggestion is to create the column names like so:

outerfencedata={'Quantile Percentage':
[],
'math score':
[]}

From here, inside that [] brackets we can put whatever we want but this will be the rows. For the Quantile Percentage row I put:

outerfencedata={'Quantile Percentage':
['99.9%','99%','97.5%','95%','92.5%','90%','10%','9%','7.5%','5%','2.5%','0.9%' ],
'math score':
[]}

And for the math scores column, this is where we input the data we want to see. We want to call the original data frame (df) and the specific row, maths score, df[‘math score’] and then add the function .quantile and match it to the percentages in the column Quantile Percentage.

For example, the first one should be df[‘math score’].quantile(0.99).

outerfencedata={'Quantile Percentage':
['99.9%','99%','97.5%','95%','92.5%','90%','10%','9%','7.5%','5%','2.5%','0.9%' ],
'math score':
[df['math score'].quantile(0.999),df['math score'].quantile(0.99), df['math score'].quantile(0.975),df['math score'].quantile(0.95) , df['math score'].quantile(0.925),df['math score'].quantile(0.90),
df['math score'].quantile(0.1), df['math score'].quantile(0.0999), df['math score'].quantile(0.075),df['math score'].quantile(0.05), df['math score'].quantile(0.025),df['math score'].quantile(0.009)]}

Now to place this data into a data frame we just have to:

outerfence= pd.DataFrame(outerfencedata)
outerfence
Image by Author

We can see in the top percentile the closest values are 90%, 92.5% and 95%. Since 95% is more common we will establish that as the cut-off. In the bottom percentile, 10% and 9% have the same value, since 10% is more common we will cut off at the bottom there.

Now to preserve our current dataset, we will make a copy of it. We do this by using pandas .copy() function and placing that copy in a data frame variable.

new_df = df.copy(deep=True)

Using SciPy stats, mstats there is a winsorized function that will return a winsorized version of our input. How we do this is by creating a new column in our new data frame which holds the new windsorized values.

If you recall calling a specific column we usually go dataframe[columnname]. This is the same when making a new column, just that the new column name goes at the start of the variable:

new_df['math_wins'] =

Then we use the mstats. winsorize function and call the data frames current math scores and set our limit to be 10% or lower and our 95% or higher (0.1, 0.05).

new_df['math_wins'] = mstats.winsorize(new_df['math score'], 
limits=(0.1, 0.05))
new_df

So to break it down new_df is the copy of df and we are creating a column called ‘maths_wins’. Within that, we are using the scipy.mstats function winsorize on our ‘maths scores’ column. We set the limit to be 10% or lower and our 95% or higher (0.1, 0.05). You should when finishing sees this as the output:

Image by Author

Now it looks similar so how do we know it has changed. There are three ways to check:

  1. Re-do the boxplot
sns.boxplot(x=new_df['math_wins'], y=new_df['lunch'])
Image by Author

As we can see the outliers are all gone.

2. Check the columns to see if they are equal or if they have changed

This is done by the .equals function from pandas which will compare the two columns new_df[‘math score’] and new_df[‘math_wins’] to see if there is a difference.

new_df['math score'].equals(new_df['math_wins'])

If they are different you should get an output of False.

4. Check each outlier column

The easiest way to do this is by using the .iloc function to check for the index and show us the row. What you need to do is get the number of every row that was an outlier and put it into an array.:

values =[17,59,76,145,149,327,338,363,451,458,466,528,555,596,601,623,625,683,785,787,842,895,916,962,980]

An array will store values of the same type in Python.

Then, using .iloc from pandas which will search the index, search for those rows in the new dataset:

new_df.iloc[values, :]
Image by Author

When comparing the math score and math_wins column there is a difference between the two as per the winsorizing method.

Removing the outliers

This is pretty simple and easy to do in python. Make sure that you document heavily what you removed, why you did and do data analysis on both datasets to ensure accuracy.

So with the previous part, we were able to identify all the outliers in the dataset by the index.

values =[17,59,76,145,149,327,338,363,451,458,466,528,555,596,601,623,625,683,785,787,842,895,916,962,980]

My suggestion would be to create a new data frame of the original, just to make sure you are not accidentally removing data from the CSV file.

rem_df = df.copy(deep=True)

There are a few ways to remove rows in python. Since we have all the values listed, we can just remove them all

rem_df.drop(values)

If you only wanted to remove a few you could just select the ones you didn’t like

rem_df.drop([17,149,596])

You could even do it by value

rem_df[rem_df['math score'] <=97]

Really it depends on the dataset, where you are judging the percentile cut off and so on.

Testing for Normality

Since I’m going forward with outliers, I need to test for normality. I will be using the Shapiro-Wilk test.

Now python provides Shapiro in scipy.stats, however, we want it to run through just the math scores and the singular lunch variables. So far we have avoided functions but let’s make one together.

Functions are a block of code that will only run when called. Essentially all the packages we have imported and have been using rely on functions, to well function.

Functions start with def as in defining the function, a name of your choice, a placeholder for the input (x) and: the semicolon.

def shapirostat(x):

From here we can put whatever we want in this function. When it is called, it will give us whatever we put in it. In this case, I want to make one function that will give me the statistic value of the Shapiro test. Since the .shapiro will give us both p-value and statistic, I will only return the statistic.

def shapirostat(x):
statistic, pvalue = stats.shapiro(x)
return statistic

If you try to put in your data frame, df it will run a type error. That’s because Shapiro only lets you input arrays. So we will create two arrays, one for free/reduced lunch and the other for standard lunch. We do this once again by stealing the previous code. If you remember when we were defining quantiles for just standard lunch we used:

df[df["lunch"] == "standard"]

We can re-use this, place it into a different variable and then turn that variable into an array. So first we will create a variable of just the data frame with the standard lunch:

standard = df[df["lunch"] == "standard"]

Then, we need to select the column of math score in this new data frame standard and turn it into an array. We do this by using pandas to_numpy(), which turns a data frame into an array.

stan = standard['math score'].to_numpy()

So all together we can now finally run our function shapriostat on our stan array which is an array of the standard lunch students results.

def shapirostat(x):
statistic, pvalue = stats.shapiro(x)
return statistic
standard = df[df["lunch"] == "standard"]
stan = standard['math score'].to_numpy()
shapirostat(stan)

However, we can make this a little neater and we also need the p-value as well We can do this by placing the creation of the array into the function:

def shapirostat(x,y):
x=x[y].to_numpy()
statistic, pvalue= stats.shapiro(x)
return statistic
standard = df[df["lunch"] == "standard"]
shapriostat(standard, 'math score')

What we also do is add in a new value to feed the function, y which is the column name we are specifically looking at.

So now make the same function but only for the p-value. It’s the same function but it will return the p-value and only the p-value as so:

def shapirostat(x,y):
x=x[y].to_numpy()
statistic, pvalue= stats.shapiro(x)
return statistic
def shapirop(x,y):
x=x[y].to_numpy()
statistic, pvalue = stats.shapiro(x)
return pvalue

Now, all we need is our data ready to be put in to make an array in the function. I also added a math array of all the math scores as well.

standard = df[df["lunch"] == "standard"]
freereduced = df[df["lunch"] == "free/reduced"]
math= df['math score'].to_numpy()

Using our new functions we can make it into a neat data frame to export later. With the shapirosta() and tshapirop()to give us the significance, p-value and degrees of freedom. I simply know we have 1000 results and so -2 from that number.

norm={'Lunch':["free/reduced","standard"],
'statistics':[shapirostat(freereduced,'math score'),shapirostat(standard,'math score') ],
'df': [998,998],
'Sig(pvalue)':[shapirop(freereduced,'math score'),shapirop(standard,'math score')]}
shar=pd.DataFrame(norm)
shar
Image by Author

Looking at the p values to have normality they need to be at p > .05. Even though one group achieves normality, the other does, meaning I do not have normally distributed data. Since I do have a large dataset, I will also conduct a Quantile-Quantile just to confirm this.

Seaborn doesn’t have a Q-Q-plot, although scipy.stats does. What we need to do is create an array outside of our function and import that as our data into a stats.probplot. Copy the array in the function and paste it above the plot we are going to make. stats.probplot requires as arguments the array of data(stan), the distribution and the type of plot which for this is pylab

stan = standard['math score'].to_numpy()
stats.probplot(stan, dist="norm", plot=pylab)
pylab.show()
Image by Author
free = freereduced['math score'].to_numpy()
stats.probplot(free, dist="norm", plot=pylab)
pylab.show()
Image by Author

When it comes to normality, while it’s common to check it in datasets it’s known for rejecting larger datasets. We are asking if there is persuasive evidence of any deviation from the Gaussian ideal? There always will be when looking at a large enough population as these normality tests are testing against normality. I would recommend normality testing if your data is on the lower end (10–100) but not with this dataset.

At this point, I am once again continuing. However if after the Q-Q plot you are still worried about normality I would suggest:

  1. Moving over to Mann- Whitney U
  2. Transforming the dataset via a different method

The next step is to just verify the homogeneity of variances.

Homogeneity of Variances

To perform and test the equal variance of a dataset, we can do it in two ways. Using Barlett’s test or Levene’s test. These are the most common ways to achieve this. Since we have the two arrays of data already stored in stan and free, this is going to be a lot quicker. We are going to focus on Levene’s test because we found deviations to normality with our Shapiro Wilk’s test (p<0.05).

For this test, all we need to do is use the stats.levene and put in our two arrays.

levene = stats.levene(free,stan)
levene

My results were:

LeveneResult(statistic=3.193786657293625, pvalue=0.07422200559323446)

Since the p-value is >0.05 we can establish equal variance between the two independent variables.

So at this point, your data should be ready to actually do the Independent-sample t-test.

Independent T-Test

While in python we can literally just run stats.ttest_ind and get the results we need. Like above, we need to able to output like SPSS so we can report it correctly. So in this section, we will be covering how to:

  1. Build a descriptive statistic table in python
  2. Mean difference, STD error difference, 95% confidence intervals in lower and upper bounds
  3. Independent T-Test results
  4. Displaying the data
  5. Exporting your results

Descriptive Statistic Table

For this table we will need, the groups were are looking at (Free/reduced and Standard Lunch), the number of participants, standard deviation and standard deviation error mean in each group. Python has all the tools we need.

To get the mean all we need to do is use .mean(), for standard deviation .std(), for standard deviation error .sem() , number of participants len() and for the groups we can groupby the column lunch.

Putting this all together is where it can get a little tricky. So, let’s lay it out.

mean = x[col].mean()
count = len(x[col])
std = x[col].std()
std_err = x[col].sem()

For each of these, if I changed the x to the data frame and col to math score, we get the number for the whole data frame.

mean = df['math score'].mean()
mean
66.089

What we want to do is make it so that it groups by the lunch column and makes it into a nice dataframe. To do this we are going to make a function which calls all these functions and puts them neatly into said dataframe.

def descript_vals(x,col):mean = x[col].mean()
count = len(x[col])
std = x[col].std()
std_err = x[col].sem()
data = [mean,count,std,std_err]
return data
descript_vals(df,'math score')

This will now print:

[66.089, 1000, 15.16308009600945, 0.4794986944695449]

What we want to do now is use the values we stored before in standard and freereduced to run this function and create a list for both lunch types.

#standard = df[df["lunch"] == "standard"]
#freereduced = df[df["lunch"] == "free/reduced"]
stan = descript_vals(standard,'math score')
free = descript_vals(freereduced,'math score')

As the final step, we are now going to compile all of this together to make a data frame.

descriptive_table = {'Lunch':["free/reduced","standard"],'Mean':[stan[0],free[0]],'Count':[stan[1],free[1]],
'Standard Deviation': [stan[2],free[2]],
'Standard Deviation Error':[stan[3],free[3]]}
descript =pd.DataFrame(descriptive_table)
descript
Image by Author

So what can we say about this table in terms of the dataset? We can say that the individuals how had a standard lunch did better in their math scores (70.03 ± 13.65) than individuals who had a free/reduced lunch (58.92 ±15.16).

Mean difference, STD error difference, 95% confidence intervals in lower and upper bounds

Due to the fact we have confirmed equal variance thanks to the Levene test, we did earlier we can figure out the rest of the descriptive statistics.

Mean difference is calculated by simply taking the top mean, in this case, a standard lunch and minus the bottom mean, free/reduced lunch.

The 95% confidence interval is tricky. I chose to use this function I found on StackOverflow.

def mean_confidence_interval(data, confidence=0.95):
a = 1.0 * np.array(data)
n = len(a)
m, se = np.mean(a), stats.sem(a)
h = se * stats.t.ppf((1 + confidence) / 2., n-1)
return m, m-h, m+h

What it’s doing is running the equation for the confidence interval. It calculates first the data we are passing through (a), then it gets the length of the data (n), then it calculates the mean (m) and the standard deviation error(se). The final step is the standard deviation error times by the probability density and returning the mean, the mean- h and the mean+h.

meandiff = stan[0]-free[0]
stderrdif = free[3]-stan[3]
min_val = df['math score'].min()
max_val = df['math score'].max()
_, lower_bound, upper_bound = mean_confidence_interval(df['math score'])

To once again neaten it up, you can place it into a data frame, similar to how we did above.

meandifference = {'Mean Difference':[meandiff],'Standard Deviation Error Difference':[stderrdif],'Lower Bound':[lower_bound],
'Upper Bound': [upper_bound]}
mean_df =pd.DataFrame(meandifference)
mean_df
Image by Author

Independent T-Test results

So now we are onto the final section, the actual T-Test. If you really wanted to you could just create the function and use your math skills to calculate the T but SciPy does offer a ttest_ind. So to do this and to have it output like SPSS we are going to create a function that gives us the t statistic, degrees of freedom(length of the two samples -2) and the p-value.

stats.ttest_ind allows us to import the data as an array and also allows us to perform a Welch t-test at this point if you have come this far and found unequal variance. It is automatically set to be 2 tailed.

stan_t = standard['math score'].to_numpy()
free_t = freereduced['math score'].to_numpy()
dof = len(stan_t) + len(free_t) - 2
t_stat, pval =stats.ttest_ind(data1,data2)

Because we re-named ‘stan’ and ‘free’ above to hold the mean, std and other values, we need to remake the arrays we are working with. Essential has created two lines one for the degrees of freedom (dof) and one for the t statistic(t_Stat) and p-value(pval). Now, we can tidy this up for future use a little by making it into, you guessed it, a dataframe!

def independent_ttest(data1, data2):
# degrees of freedom
dof = len(data1) + len(data2) - 2
t_stat, pval =stats.ttest_ind(data1,data2)
return t_stat, dof, pval

Now when data1= stan_t and data2=free_t we get

independent_ttest(stan_t, free_t)
(11.837180472914612, 998, 2.4131955993137074e-30)

We have an issue. Much like Excel does when squishing cells, we are getting the scientific notation for the pvalue (2.4131955993137074e-30). What we need to do is go back into the dataframe and use format().

def independent_ttest(data1, data2):
# degrees of freedom
dof = len(data1) + len(data2) - 2
t_stat, pval =stats.ttest_ind(data1,data2)
pval = f"{pval:.30f}" <=
return t_stat, dof, pval

the f”{pval:.30f}” is the string literal syntax approach and relies on str.format(). An alternative way is to just use str.format().

pval= "{:.30f}".format(pval)

Once you run your function again your new output should look like this:

independent_ttest(stan_t, free_t)
(11.837180472914612, 998, '0.000000000000000000000000000002')

Now personally I want to put all of this in the dataframe we made above mean differences. How we do that is by inputting three new columns with these values in the dataframe. We can do this via an index method of selecting the values that are returned in our t-test dataframe. t_stat =[0], dof=[1] and pval=[2].

meandifference = 
{'Mean Difference':[meandiff],
'Standard Deviation Error Difference':[stderrdif],
'Lower Bound':[lower_bound],
'Upper Bound': [upper_bound],
'T Statistic':[independent_ttest(stan_t, free_t)[0]] ,
'DF':[independent_ttest(stan_t, free_t)[1]],
'pval':[independent_ttest(stan_t, free_t)[2]]}
mean_df =pd.DataFrame(meandifference)
mean_df

Now we have all the information neatly compiled into multiple dataframes.

Displaying the Data

Let’s go over how we can display this data in python. We have already done a boxplot and a QQ plot so let’s make these more appealing. Let’s start with the boxplot.

The current code we have for it is

sns.boxplot(x=df['math score'], y=df['lunch'])

What we are doing here is calling seaborn and calling the two columns and that’s it. Everything else is set to the default settings. The first step is to change the figuresize.

seaborn runs off matplotlib, so to change the figsize we really need to use matplotlib and reinput it into the seaborn boxplot.

To do this we use two values, fig and ax:

fig, ax = plt.subplots(figsize=(12, 9))

Then when implementing the boxplot all we need to do is ax=ax. This then affects the axes with the plot drawn onto them.

fig, ax = plt.subplots(figsize=(12, 9))
sns.boxplot(x=df['math score'], y=df['lunch'], ax=ax)

Now I want to change the x and y axes labels, xticks and the colour of the plot.

For the axes labels ax provides us with a bunch of set_ options such as

ax.set_title
ax.set_xlabel
ax.set_ylabel
ax.set_yticklabels

There are a few more but will we focus on these 4 today. Within this, we can set, the string, the size, the font and the location. For today I want to change these elements and make them fit better on the larger subplot.

ax.set_title('Boxplot of Math Scores Compared by Type of Lunch', fontsize=25)
ax.set_xlabel('Math Scores', fontsize=17)
ax.set_ylabel('Lunch Types',fontsize=17)
ax.set_yticklabels(['Standard','Free/Reduced'])

Something else I want to discuss is the seaborn colour palettes. All you need to do is in the boxplot function call palette=. You can set this to predesigned colour palettes in seaborn, create your own or just call hex codes. Today I want a simple blue colour scheme.

fig, ax = plt.subplots(figsize=(12, 9))sns.boxplot(x=df['math score'], y=df['lunch'], ax=ax,palette="Blues")ax.set_title('Boxplot of Math Scores Compared by Type of Lunch', fontsize=25)
ax.set_xlabel('Math Scores', fontsize=17)
ax.set_ylabel('Lunch Types',fontsize=17)
ax.set_yticklabels(['Standard','Free/Reduced'])
Image by Author

There are many different customization options in Python but once you get a handle on this system it becomes quite simple. You could even make your own stylesheet to create uniform plots for your research and each time you need a certain plot, call that function and just input the elements you need.

Now I’ve also done the plots for a barplot and the QQplot below. Feel free to copy and then edit it to your liking.

fig, ax = plt.subplots(figsize=(12, 9))sns.barplot(x=df['lunch'], y=df['math score'], ax=ax,palette="Blues")ax.set_title('Math Scores Compared by Type of Lunch', fontsize=25)
ax.set_xlabel('Math Scores', fontsize=17)
ax.set_ylabel('Lunch Types',fontsize=17)
ax.set_xticklabels(['Standard','Free/Reduced'])
Image by Author
fig = plt.figure(figsize=[6, 6], dpi=100)
ax = fig.add_subplot(111)
fig = stats.probplot(stan_t, dist="norm", plot=plt,fit=False)
#These next 3 lines just demonstrate that some plot features
#can be changed independent of the probplot function.
ax.set_title("QQ plot of Standard Lunch")
ax.set_xlabel("Quantiles", fontsize=10)
ax.set_ylabel("Ordered Values", fontsize=10)
plt.show()
Image by Author
fig = plt.figure(figsize=[6, 6], dpi=100)
ax = fig.add_subplot(111)
fig = stats.probplot(free_t, dist="norm", plot=plt,fit=False)
#These next 3 lines just demonstrate that some plot features
#can be changed independent of the probplot function.
ax.set_title("QQ plot of Fre/Reduced Lunch")
ax.set_xlabel("Quantiles", fontsize=10)
ax.set_ylabel("Ordered Values", fontsize=10)
plt.show()
Image by Author

Exporting the Data

The reason I made you put your data into dataframes was for one reason, so you could export it out of python and into a different file format. I’m going to cover exporting to excel because I find that I can organise my data outputs more clearly there and then re-format the dataframes for meetings, presentations and papers.

Currently, we have a few data frames:

Outerfence: our upper and lower bounds of our quantile tests

new_df: contains the winsorized results we did

shar: Shapiro Wilk test

descript: our descriptive values table

mean_df: mean difference and t-test results

Now you can just port all of these to individual excel files like:

new_df.to_excel("output.xlsx")

There is a better way to achieve this by using Pandas. to_excel which allows us to export all these dataframes to one excel file on different sheets. This uses openxyl, which should be installed from your setup. Another note is to remember the output folder we made? you can export directly to there by simply putting ‘output/filename.xlsx’.

with pd.ExcelWriter('output/independentttestresults.xlsx') as writer:
new_df.to_excel(writer,sheet_name='Sheet1')
outerfence.to_excel(writer,sheet_name='Sheet2')

Run this line and go to the output folder to check that it worked. When it has feel free to add more sheets to the excel file.

with pd.ExcelWriter('output/independentttestresults.xlsx') as writer:
new_df.to_excel(writer,sheet_name='Sheet1')
outerfence.to_excel(writer,sheet_name='Sheet2')
shar.to_excel(writer,sheet_name='Sheet3')
descript.to_excel(writer,sheet_name='Sheet4')
mean_df.to_excel(writer,sheet_name='Sheet5')

We also have a few charts created as well and we have to export them differently. Simply under each graph place this line of code:

plt.savefig('output/nameofplot.png')

Then re-run each plot and they should appear in the output folder.

Conclusion

Now, this tutorial is all about diving into the deep end of python. If you were able to get this all working, congratulations! Python is easy but you have to get the hang of it first. Today you were able to not only perform an Independent-Sample T-Test but also create functions, data frames, import packages and libraries and export your findings.

For me personally, I find this form of learning to be the best for me and I hope it was helpful to you too. If you have anything you want me to write a tutorial on please let me know as I will be making my through as much SPSS as I can and translating it into Python.

--

--

Nicoletta Tancred

Current PhD Candidate in Australia. Studies in Human-Computer Interactions and Games Design. Owner and Editor of The Games Development Journal