### Goals

- Introduce the concept of data
- Compare the different types of data
- Explore data sampling and types of study designs

## Why do we care about statistics?

“An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem”

— John Tukey

“By a small sample, we may judge of the whole piece.”

— Miguel de Cervantes from Don Quixote

While we would like to say that we know a lot about the world, this is categorically untrue. Instead, unless we are really lucky, we can use statistics to get an **approximate** view of how the world works.

Of course, the accuracy of our approximation is dependent on a lot of factors, including what data we collect, how we collect it, and how we analyze it.

**Data** are the information gathered for reference or analysis, and the different types of data that are collected are called **variables**.

Variables can be broken down in this way:

**Numerical**: numbers, either whole or fractions- Discrete: Only certain values possible within a given range (e.g. counts)
- Continuous: Any value is possible within a given range (e.g. measurements)

**Categorical**(factors): identifies a grouping of data

## How do we collect data

**Scientists and statisticians design **studies to collect data for analysis:

**Experimental studies**: design an experiment that has specific treatments and controls for outside variables. Good for determining cause and effect, but difficult to generalize to real-world situations.**Observational studies**: make observations about the real-world. Good for determining patterns, but does not not control for outside variables. Difficult to determine cause and effect.

## How much data do we need?

**Population (N)**: all the data from a particular source. Any conclusions from a population are **parameters** – constant and exact**Samples (n)**: a subset of the population that is collected for analysis. Any conclusions from samples are **estimates** – random and approximate

Estimates, because they are based on data from a subset of the total population, are subject to **sampling error** and **sampling bias**

**Sampling error**: the random chance that the results of the sample estimate is off from the true population parameters by random chance.

How to fix: Increase sample size. A larger **n** results in a lower sampling error**Sampling bias**: Certain members of the population are more likely to be sampled than others.

How to fix: Insure that samples are taken **randomly**, though this is difficult under some circumstances.

#### Sources of sampling bias:

**Sample of convenience**: the easiest samples to collect are taken**Volunteer bias**: only certain members of the population provide data (e.g. people volunteering to respond to a survey)