In this article, we will understand relationships between variables.
Overview
We will look at:
- Some new Pandas Tricks
- Correlation
- Simple Linear Models (Bivariate)
- Fitting
- Interpretting
- Evaluating
- Multiple Linear Models (Multivariate)
- Fitting
- Interpretting
- Using for predictions
Configuring Pandas and Numpy
We can use pd.set_option()
to set some notebook-wide options for Pandas.
Display
Pandas by default will abridge print outs and only show a certain number of rows, sometimes we want to see more than this! We can use display.max_rows
and display.max_columns
to set this.
Pro Tips: we can use pd.set_option('display.max_rows', None)
to always show all rows but use with caution. If you ask Pandas to print out a DataFrame with 1000000 rows, it will crash your notebook.
Scientific Notation
Scientific Notation is a when large or small numbers are shown with just the digits, with the decimal point placed after the first digit, followed by x10 to a power that puts the decimal point where it should be (i.e. it shows how many places to move the decimal point).
e.g. 0.0000123 = 1.23x10(-5) or 123,000,000 = 1.23x10(8)
I find this hard to read to you can surpress it in Pandas and NumPy using the commands below:
1 | import pandas as pd |
The Dataset
The Electronic Frontier Foundation has collected a dataset called the Atlas of Surveillance. The Atlas of Surveillance is a database of surveillance technologies deployed by law enforcement in communities across the United States. This includes drones, body-worn cameras, automated license plate readers, facial recognition, and more. I’m curious if any states are utilising this technology more than others and why this might be. We will se how we might go about using some data science methods to approach this topic.
First, lets explore the data:
1 | # Load in the CSV |
1 | # Look at the columns |
.value_counts()
We can use value_counts()
to find count up examples of unique values.
We use it below to see what the types of surveillance tech is, and how often each one occurs in the dataset. We can also use it to see which states have the most reports.
1 | # Count the types of surveillance tech in the database |
1 | # Count the reports for each state |
.reset_index()
value_counts()
returned us a Series, with the state abbreviation as the index. Moving forwards, its going to be useful to have this data(the state and its count) as its own dataframe.
We use .reset_index()
to move all the data into its own columns, then apply new column names.
1 | state_counts_df = pd.DataFrame(police_df["State"].value_counts()).reset_index() |
Bringing in extra data
It looks from above that California is the worst offender, but my limited knowledge of the USA also makes me think that they probably have more people and more police. Is there large number of reports just accounted for by this? Also, one might also assume that states with higher crime rates may explain higher numbers of reported surveillance technology.
I found an FBI dataset which gives me not only the populations for each state, but also some crime statistics which might be useful later.
First we’re going to have to do some wrangling to the data! Loading it in I immediately see a few things wrong that I’ll need to sort out.
- There seems to some extra columns at the end that have no information in them! Probably can just remove these
- Each
Area
is given, followed by twoNan
columns. This is because I made a.csv.
out of a weirdly formatted excel sheet (the original format of the dataset) and there are current year, previous year and change stats for eachArea
. We only need to one for eachArea
.
1 | #https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/topic-pages/tables/table-4 |
.drop()
We can use .drop()
to remove our unwanted columns. We give the names and then the argument axis=1 to tell Pandas we are talking about columns that need dropping.
1 | # Drop the columns we do not want |
.dropna()
We can use .dropna()
to drop and rows that have Nan
values in. This should remove the unwanted rows under each Area
.
1 | # Drop all rows with NaN in |
.dtypes, .to_numeric(), str.replace()
Using .dtypes
we can see that Population
, ViolentCrime
and ViolentCrimePer100000
are all object
type. We will need to convert them to numbers if we’re going to use them for calculations.
pd.to_numeric()
lets us convert the object
s to numbers. This will work for ViolentCrimePer100000
but not for Population
and ViolentCrime
. This is beacuse the numbers actually have commas in them and Pandas isn’t smart enought to work it out.
We can remove the commas by using .str.replace()
. Here we provide the string we wanted to replace, and the string we want to insert in its place.
1 | print("Before\n", pop_df.dtypes) |
.unique()
Now lets look at the place names. This dataset uses Area
and using .unique()
we can see all the unique values in this column. We see there are states, as well as larger geographic areas (e.g. West North Central).
1 | pop_df["Area"].unique() |
str.strip(), .rename()
We want to match up this data with the surveillance data (which is seprated into states), there are few issues facing us:
- Our column is called
Area
and notState
- Out states have tab spacing formatting in front of them
- Our states are full names, the surveillance data was abbreviations
We can fix the first 2 issues with .rename()
and str.strip()
, the latter removing all whitespace.
1 | pop_df = pop_df.rename(columns = {'Area':'State'}) |
.merge()
To combine the datasets we’re going to use a pandas function called .merge
. It takes any columns with the same name and matches rows that have the same value for that column.
First, we are going to merge our FBI dataset, with another one I found that has both state names and state abbreviations. They’ll match everything along the State column. Luckily, anything with no match gets dropped so this removes all of our non state entries as well. Then we are going to merge the new dataset with the surveillance dataset along the Code column and remove all the unwanted columns.
1 | # Load in abbreviations |
1 | # Merge abbreviations and FBI |
1 | # Merge new FBI and Surveillance |
Examing the relationship between two variables
So we had seen that there was a big difference in the number of surveillance reports for each state, but had suspected that it may be linked to population.
Here we plot them against each other and it looks like they are related!
1 | import matplotlib.pyplot as plt |
Correlation
We’ve previously use correlation to investigate how much of relationship there is between two variables.
We would say there i probably positive correlation between the Population
and Surveillance
. If the relationship was matching the other way round, we’d say it had negative correlation.
But can we calculate this relationship with a number?
.stats.pearsonr()
Runs from -1 to 1 and tells us how correlated two variables are. A corelation of 0 is random (no relationship).
Here we see that Pearson’s coeffient is 0.891820. This closeness to 1 tells us that there is a pretty good correlation between the Population
and Surveillance
.
1 | # Correlate |
Making Models
So we have determined that we think there is a linear relationship between these two variables, now we’ll look at how we can build a linear model to fit this relationship.
Having a model allows us to make predictions about new values e.g. given what we know, how much surveillance tech should a state have based on its population?
We can define the function for any line in the form
y = ax + b
Where a
gives us the slope of the line, and b
tells us where this line intercepts the y axis.
1 | fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12)) |
What is a model?
The US Federal Reserve defines a model as:
A representation of some aspect of the world that is based on simplifying assumptions.
David Speigelhalter says:
A statistical Model is a formal representation of the relationship between two variables, which we can use for... explanation or prediction.
And broadly we can determine that for any observation:
observation = deterministic model + error
So our model will be defined completely using the formula below:
y = ax + b + e
Where y
is the dependent variable (surveillance
), x
is the independent variable (population
), a
and b
are two parameters for us to set and e
is the error.
Least Squares Error
Similarly to when we looked at neural networks, we will need to find value parameters (a
and b
) that minimise (get the lowest value of) an error function.
In the case of a linear model, we use the sum of the squared error.
- Take the error of each point(actual y - predicted y)
- Square (makes all the differences positive to avoid cancelling out)
- Sum all the squared errors together
Whilst it is possible to use gradient descent to find the right values of a and b, there is actually an algebraic solution where we can just solve to find the best values!
In this lovely GIF, we see how the errors for each point changes with different lines (different values of a and b).
.stats.linregress()
And we can do that using linregress()
from the scipy.stats
package.This returns us a value for the slope (a) and intercept(b).
We can then plot the line, looks like a pretty good model!
1 | data = state_counts_df[["Population", "Surveillance"]].values |
Interpretting the Model
So instead of having a list of points, we now have a model, and this model is defined by a slope and an intercept.
Surveillance = 1.85*Population + 8.5 + error
Beyond plotting the line, what else does this model tell us?
With an intercept of 8.5 and a slope of 1.85, we can say that starting at 8.5, for every 185,000 people we’d expect to see 1 more report of surveillance tech.
We can also put in new population values and predict what level of surveillance we might expect.
p-value
We also get a p-value returned by our model, and we know that this is result of a hypothesis test.
The hypothesis we are looking to test is that there is a non-zero slope for this model, making the null hypothesis that the slope is 0. A p-value of less than 0.05 allows us the reject the null hypothesis and conclude there is a relationship between the two variables.
Our low p-value allows to conclude this is in fact the case for Surveillance and Population.
Which states don’t fit the model, .annotate()
We can see our r^2 close to 1 means that our model is a good fit.
We can also see that some of the states deviate quite signficantly from the general trend, pointing towards that these states perhaps use more or less surveillance than one would expect from a state that size.
In order to see which states to investigate further, we can take the state codes and use the .annotate()
function in pyplot
to add in some labels.
1 | def plot_scatter(x,y,xlabel,ylabel): |
We can also get this calculation for every state, and we can see that proportionally, South Carolina has the most excess surveillance beyond what our population based model would predict.
1 | fl = state_counts_df[state_counts_df["State"] == "FL"] |
1 | state_counts_df["ratioToModel"] = state_counts_df["Surveillance"]/((state_counts_df["Population"]/100000)*slope) |
Violent Crime is not a good predictor
So Polulation can explain most of the variation in our data, but what other information could we bring in that might also account for the error that exists between some of the observations and the model?
Average crime budget? Percentage of Republican Senators? Average Income? Ethnic make up?
Our FBI stats also tell us about the violent crimes committed in each state. Perhaps the amount of crime will also be a good predictor of surveillance tech?
1 | data = state_counts_df[["Population", "ViolentCrime"]].values |
Multiple Regression
Above, we have seen that sometimes just using one independent variable to predict another fails to capture the whole relationship. In this case we can build a model that takes multiple variables into account.
For example, if we had 3 values for each example, we could try and fit the model below
1 | y = ax1 + bx2 + cx3 + d + error |
This could be
Price of painting at auction (
1
y
) given
- previous sale price (
x1
) - time since last sold (
x2
) - age of painting (
x3
)
- previous sale price (
Box office takings for a film (y) given
- money spent on advertising (
x1
) - number of A list stars (
x2
) - time of year released (
x3
)
- money spent on advertising (
Daily minutes spent on social media app (y) given
- number of friends (
x1
) - age (
x2
) - annual wage (
x3
)
- number of friends (
linear_model.LinearRegression()
Again, this can solved algebraicly and use the linear_model.LinearRegression()
function from the sklearn
library to do so.
Taking an example from W3Schools, we can fit a linear model to find the CO2 emissions of a car, based on both its weight and volume.
The resulting coeffients and intercept show us that from a starting point of 80g
- For every 1kg of car, we expect a further 0.0075g of CO2
AND
- For every 1 cubic cm of car we expect a further 0.0078g of CO2.
The interesting thing about this is we can determine contribution of each separate variable to predicting the outcome.
1 | from sklearn import linear_model |
Predicting New Values, .predict()
Now we have fitted the model, we can use it to predict new values using the .predict()
function.
For example, it we have a car which weighs 800kg and is 2000 cubic cm, we would expect 101.34g of CO2 emission.
1 | regr.predict(([800,2000])) |
Caveats
George Box (apparently) said
1 | "All models are bad, but some models are useful" |
So whilst we have found some interesting insights about the requests of surveillance technology between different states in the USA, its also important to note the assumptions made
- The Atlas of Surveillance Dataset is not complete neccessarily, as it crowd sourced by volunteers.
- We have used the fairly course measure of just counting reports, modelling all reports as equal.
- Reported crime is a proxy for crime (more police and surveillance is likely to cause more reported crime, certain crimes (such as sexual harassment and abuse, domesitc abuse) are underreported).
Summary
We can model one variable based on the values of one (or more) other variables
This model attempts to fit a straight line (or plane) to map this relationship
We fit the line trying to minimise the squared error, however, there will always be error in the model
We can use this model to predict new values
Or to explain something about the relationship between the variables
About this Post
This post is written by Siqi Shu, licensed under CC BY-NC 4.0.