# Basic stats in R

## Basic statistical analysis of data using χ2 and t tests in in R

This is an update to my previous post from 2015.

One of the most important aspects to scientific inquiry, after experimental design and implementation and data collection, is the proper analysis and interpretation of your data.

There are a number of resources out there for learning R and a lot of different tools for authoring R code, implementing that code, and error checking that code. Google is your friend!.

This post has six parts:

- Introduction to R
- Guide to downloading and installing R and RStudio
- Writing your first .R file
- Importing data and basic functions in R
- Advanced analysis: χ2
- Advanced analysis: t tests

# Quick Introduction to R

R is a statical software environment for the analysis of data and the production of graphics (R-project.org).
Using R can be daunting for the uninitiated because it is entirely command-line based. However, this is the greatest strength that R has because it is extremely flexible, customizable, and (for most things) fast. R is also **free and open-source**, meaning that there are a lot of people out there that maintain the project and write specific statistical packages. There are plenty of other programs (SAS, SPSS, MATLAB, Excel) out there that you can use to do data analysis but they are expensive and less customizable. There are other software environments out there too (Python, Julia) for those who need a little more speed and flexibility. R brings to the table some very powerful **analysis** and **graphics** production tools in a package that is pretty easy to use (once you get the hang of it). It is an attractive program for students and professionals alike and is continuing to grow in interest and complexity. Skills in data analysis with R is now becoming a must for many looking for graduate or professional positions and is a great addition to a resumé.

# Download and install R and RStudio Free

For most students, all that needs to be downloaded are R and RStudio Free.

R is hosted on a number of servers around the world. This link is to a USA server, but you can pick any server you want. It is best to choose one that is close to your location.

RStudio is a R Integrated Development Environment (IDE) to write and run R code and R derivatives like RMarkdown provided by a company free of charge for personal use. There are also enterprise pricing options and a neat web platform called Shiny that can be used to produce cool R-based web applications. **RStudio does not come packaged with R so you need to download both programs to use RStudio.**

# Writing your first .R file

R is a program that lives almost exclusively in the command line. This is great for streamlining your work flow: you type in what you want R to do, press `ENTER`

and off you go. Great!

**BUT** once you press `ENTER`

that code is basically gone and you better hope you didn't have any typos.

The **BETTER** way of working with R is writing a plain text document that has all of the R commands you will need and saving it with a ".R" extension.
Any command that you want to pass to R should be written down in an .R file. You can save R commands
in any format you want (.docx, .txt) but if you want to be able to use it directly in R or RStudio, you need to
use the .R extension.

You can use any text editor you want. Some examples are Notepad (built into Windows), TextEdit (built into OSX), Notepad++, Atom, Sublime Text, Emacs or Vim. The last four also have plugins designed to work directly in R from that particular program. I have used all of these and really can't suggest one over the other. (If you are curious, I currently used NeoVim)

You can also use the built-in text editor in RStudio. It is located by default in the top left quarter of the program window. In that box
you can write your code and use a keyboard shortcut, **Ctl + Enter** in Windows, **Cmd + Enter** in OSX to evaluate
that code right in an R console session.

R scripts generally have two parts: **code you want to run**, and **annotations to code using comments ('#')**

Here is an example script:

` ````
## Sample R script
## Author: Matthew Lundquist
## Date: 11-1-2017
## Create vector of 10 random numbers between 1 and 1000 with replacement
x <- sample(1:1000, 10, replace = T)
## Create another vector of 10 randowm numbers between 1 and 1000 with replacement
y <- sample(1:1000, 10, replace = T)
## plot the result
plot(x, y)
```

Copy and paste the above code into the text editor in RStudio and save it as "sample.R". Run the code and it will result in a neat little scatter plot in the lower right. You can also see the values of x and y on the top right. Try this out in R studio. **Note:** You will get different results than I did.

Anything with a **#** is considered an annotation and will not be evaluated by R. Anything not "commented out" will be evaluated by R.

**Congratulations! You just wrote your first .R script.**

# Inputting data and basic functions in R

The next step is to actually use R to input and analyze your data. First we need data. In this case the data below, **iris.txt** is saved as a text file. You can
also save data as .csv (Comma Separated Value) files. Data can be edited directly in those files using a text editor but it is easier just to use a spreadsheet application like Microsoft Excel or Libre Office Calc and then saving them as a .csv.

Once you download iris.txt save it to your **Desktop**, then **open RStudio**

The easiest way to upload data into R is to point it to the directory that your data file is in then calling that data file using the **read.table()** function.

` ````
## Inputting data in R from .txt file
## Point R to the directory that contains my data file using the "set working directory" function
setwd("~/Desktop") # In OSX or Windows 10
## setwd("C://Users/UserName/Desktop") ## For older version of Windows, replace UserName with your user name
## Read in iris.txt and name it iris
iris <- read.table("iris.txt", h = T) # h = T tells R that the first column are column names, not data
## Call the data to make sure it is inputted correctly
iris
```

Your output should look something like this:

Notice that there are 150 rows of data and five columns. Each row represents an individual iris flower. The first four columns indicate what was measured from that flower. The last column identifies the species of the flower. The
first four columns are what we would call **numerical data**, we can perform mathematical operations on them. The last column is what we call a **factor** or a non-numeric identifier for your data. In this case, the column "Species"
indicates from what species of iris the data was collected. There are a total of three species in this data set.

Now that our data in in R, we can start to do basic analyses like mean and median.

` ````
## Mean Sepal.Length
sepal.mean <- mean(iris$Sepal.Length)
sepal.mean
## Median Petal.Width
petal.median <- median(iris$Petal.Width)
petal.median
## Mean Petal.Length for just setosa
setosa.petal <- mean(iris$Petal.Length[iris$Species == "setosa"])
setosa.petal
## Summary statistics for all observations and factors
summary(iris)
```

This is the output:

# Advanced analysis: χ2

A χ2 test is used to compare frequency data observed *versus* some sort of expected frequency. For example, a χ2 would be appropriate for comparing the number of cockroaches that choose one of three foods.
**Note:** This only works if you use a lot of cockroaches and that you observed each cockroach only once. Lets say you expose individually 120 cockroaches to apples, pears, and oranges. You postulate that
if there is no preference for any specific food type, they will visit each food equally. You then set your expected frequencies such that your null hypothesis is H0 : p1 = p2 = p3.
You then expose 120 roaches to your choice experiment and count the number of cockroaches that chose each food. A table of these observations would look like this:

food | apple | pear | orange | total |

observed | 39 | 25 | 56 | 120 |

expected | 40 | 40 | 40 | 120 |

Are the observed frequencies significantly different than you expected frequencies? We can use R to test that!

` ````
food <- c("apple", "pear", "orange")
observed <- c(39, 25, 56)
expected <- c(40, 40, 40)
chisq.test(observed, p = expected, rescale = TRUE)
## Chi-squared test for given probabilities
## data: observed
## X-squared = 12.05, df = 2, p-value = 0.002418
```

The results are that with a χ2 value of 12.05 with 2 degrees of freedom, our P value is < 0.05. Therefore, we reject the null hypothesis that the foods are equally preferred.

**Easy peasy!**

You can also test any number of infinite null hypotheses. For example, are apples 50 % more likely to be chosen over the other two. Your null hypothesis would then be H0: p1 = 0.50, p2 = 0.25, p3 = 0.25.

# Advanced analysis: t test

T tests are a bit more involved and are likely more appropriate for the data that you are collecting. Unlike a χ2 test, t tests use continuous data, not frequencies. A t test is appropriate for when you want to compare the averages of a particular measurement between two populations. Going back to our cockroach example, imagine that you have two groups of cockroaches one given an apple the other group given a pear. You are interested in whether or not food type affects time to feeding. You place a single cockroach in front of a type of food and measure the amount of time (in seconds) it takes for the cockroach to approach the food. You do this for 20 roaches for each food type. The data would look something like this:

apple | pear |
---|---|

27 | 27 |

9 | 26 |

53 | 1 |

35 | 21 |

50 | 55 |

11 | 53 |

14 | 55 |

33 | 4 |

29 | 12 |

46 | 6 |

60 | 17 |

26 | 18 |

45 | 15 |

2 | 31 |

19 | 42 |

36 | 31 |

54 | 1 |

54 | 8 |

31 | 10 |

52 | 55 |

For now we will assume that the variance is equal between the two food types. We can use a t test to compare between the amount of time to feed taken by cockroaches in apples and pears. Our null hypothesis would be that the mean of apples = mean of pears

` ````
apple <- c(27, 9, 53, 35, 50, 11, 14, 33, 29, 46, 60, 26, 45, 2, 19, 36, 54, 54, 31, 52)
pear <- c(27, 26, 1, 21, 55, 53, 55, 4, 12, 6, 17, 18, 15, 31, 42, 31, 1, 8, 10, 55)
t.test(apple, pear, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)
Two Sample t-test
data: apple and pear
t = 1.7369, df = 38, p-value = 0.0905
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.638486 21.438486
sample estimates:
mean of x mean of y
34.3 24.4
```

The R output may be a bit confusing but the important information can be found in the third line. A t statistic of -1.74 with degrees of freedom of 37.7 resulted in a P value of 0.09 which is > 0.05. We failed to reject the null hypothesis that the observed differences in feeding times are due to random chance.

There is infinitely more information out there on R and many, many more statistical tests that can be run based on your data and experimental design. I hope that this post has provided a good start to using R for your statistical analysis needs.