Introduction to statistics and data

Goals

  1. Introduce the concept of data
  2. Compare the different types of data
  3. Explore data sampling and types of study designs

Why do we care about statistics?

“An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem”

— John Tukey

“By a small sample, we may judge of the whole piece.”

— Miguel de Cervantes from Don Quixote

While we would like to say that we know a lot about the world, this is categorically untrue. Instead, unless we are really lucky, we can use statistics to get an approximate view of how the world works.

Of course, the accuracy of our approximation is dependent on a lot of factors, including what data we collect, how we collect it, and how we analyze it.


Data are the information gathered for reference or analysis, and the different types of data that are collected are called variables.

Variables can be broken down in this way:

  • Numerical: numbers, either whole or fractions
    • Discrete: Only certain values possible within a given range (e.g. counts)
    • Continuous: Any value is possible within a given range (e.g. measurements)
  • Categorical (factors): identifies a grouping of data
Population size is discrete
Measurements are continuous

How do we collect data

Scientists and statisticians design studies to collect data for analysis:

  • Experimental studies: design an experiment that has specific treatments and controls for outside variables. Good for determining cause and effect, but difficult to generalize to real-world situations.
  • Observational studies: make observations about the real-world. Good for determining patterns, but does not not control for outside variables. Difficult to determine cause and effect.

How much data do we need?

Population (N): all the data from a particular source. Any conclusions from a population are parameters – constant and exact

Samples (n): a subset of the population that is collected for analysis. Any conclusions from samples are estimates – random and approximate

Estimates, because they are based on data from a subset of the total population, are subject to sampling error and sampling bias

Sampling error: the random chance that the results of the sample estimate is off from the true population parameters by random chance.

How to fix: Increase sample size. A larger n results in a lower sampling error

Sampling bias: Certain members of the population are more likely to be sampled than others.

How to fix: Insure that samples are taken randomly, though this is difficult under some circumstances.

Sources of sampling bias:

  • Sample of convenience: the easiest samples to collect are taken
  • Volunteer bias: only certain members of the population provide data (e.g. people volunteering to respond to a survey)