July 10, 2021

IDS1 - What is a Dataset

Where does you data come from?

  • Classification with KNNs, Native Bayes, Decision Trees and Random Forests

  • Regression

  • Clustering

  • Data Ethics

What is Data Science?

It’s a broad church.

  • Improving decision making through the analysis of data

  • Can involve Machine Learning

    Extracting patterns from structured, and unstructured data

  • But we need to know how to capture and clean data, there are some issues of ethics and regulation.

So, what kinds of questions can we answer?

Can we group people eliciting similar behaviors together? Can we spot abnormal events? Can we predict future values? Can we assign a label to new data?

Importantly, we address these questions using data, rather than hand coding rules. Often, this is a faster approach providing more robust solutions, and allows us to solve complex problems that may be otherwise intractable.

Where is data used?

Walmart! Walmart’s models restocked Strawberry pop tarts in stores on the path of Hurricane Francis, based on earlier weather related shopping data. Recommending pregnancy products before the customer knows themselves.

What is data?

Errr…its actually what are data. Data is technically plural, datum is singular. We consider data having different types:

Quantitative

  • Discrete (the number of songs on an album or the number of tickets sold for a play)
  • Continuous (alomost any numeric value)
  • Nominal (no order)
  • Ordinal (intrinsic order)

“All data are cultural artefacts, created by people…at a time, in a place”

- Yanni Loukissas, All data are local

Digital humanities theorist Johanna Drucker says we should use capta(meaning taken) rather than data (meaning given).

“Raw Data is an oxymoron.”

- Geoffrey Bowker and Lisa Gitelman

All data is in some way curated (or cooked), collected for a reason, recorded in a certain way for a reason, encapsulating both prejudice and intent. (E.g. The way Facebook records gender)

Underreporting of sexual assault on campuses (higher reporting often means better support / attempts to deal with problems).

“Data feminism asserts that data are not neutral or objective. They are the products of unequal social relations, and this context is essential for conducting accurate, ethical analysis…Big Dick Data is a formal, academic term that we, the authors, have coined to denote big data projects that are characterized by patriarchal, cis-masculinist, totalizing fantasies of world domination as enacted through data capture and analysis. Big Dick Data projects ignore context, fetishize size, and inflate theri technical and scientific capabilities.”

- Catherine D’Ignazio and Lauren Klein, Data Feminism

GDELT is a data aggregator that collects and sells data; FiveThirtyEight used them to write an article suggesting a massive increase in kidnappings in Nigeria; however, the data actuaaly represented media reports of kidnappings, not discrete event, massively overstating the increase. Althrough data is more avaliable than ever, we most properly examine where it comes from; “Zombie Data” (Daniel Kaufman) - purports to datasets that have been published without purpose in mind, and in a state that makes their context and so correct analysis tricky.

Data have heterogeneous sources. Heather Krause investigates violence against women over their lifetimes using UN data and finds some interesting trends.

Data Biographies: Getting to Know Your Data

Some of the data reflected all womens, some reflected only women of a certain age, and some only included women of a specific martial status; the methods and definitions change over time and between countries. Sometimes we find alternate reasons based in data collection.

Data Biographies: Getting to Know Your Data

Sometimes the data is telling us the “truth”.

Data Biographies: Getting to Know Your Data

“Spots that we’ve left blank reveal our hidden social biases and indifferences”

- Mimi Onuoha, Library of Missing Datasets

Includes:

  • People excluded from public housing because of criminal records
  • Muslim mosques/communities surveilled by the FBI/CIA
  • Undocumented immigrants currently incarcerated and/or underpaid

Those who have the resources to coleect data lack the incentive to (corollary: often those who have access to a dataset are the same ones who have the ablity to remove, hide, or obscure it). E.g. although police use data for predictive policing, there is little history of rigorously collecting and analysing data about police brutality.

The data to be collected resist simple quantification (corollary: we prioritize collecting things that fit our modes of collection).

The act of collection involves more work than the benefit the presence of the data is preceived to give. Sexual harassment is often underreported due to the difficult and traumatic mechanism for doing so.

We have seen the issues that need inspection when we first get our dataset; we will look at how best to interpret what comes out of our models, and how to approach this critically, and how to consider how we use these insights, and who benefits and who loses out.

About this Post

This post is written by Siqi Shu, licensed under CC BY-NC 4.0.