Basic stats in R

Basic statistical analysis of data using χ2 and t tests in in R

This is an update to my previous post from 2015.

One of the most important aspects to scientific inquiry, after experimental design and implementation and data collection, is the proper analysis and interpretation of your data.

There are a number of resources out there for learning R and a lot of different tools for authoring R code, implementing that code, and error checking that code. Google is your friend!.

This post has six parts:

  1. Introduction to R
  2. Guide to downloading and installing R and RStudio
  3. Writing your first .R file
  4. Importing data and basic functions in R
  5. Advanced analysis: χ2
  6. Advanced analysis: t tests

Quick Introduction to R

R is a statical software environment for the analysis of data and the production of graphics (R-project.org). Using R can be daunting for the uninitiated because it is entirely command-line based. However, this is the greatest strength that R has because it is extremely flexible, customizable, and (for most things) fast. R is also free and open-source, meaning that there are a lot of people out there that maintain the project and write specific statistical packages. There are plenty of other programs (SAS, SPSS, MATLAB, Excel) out there that you can use to do data analysis but they are expensive and less customizable. There are other software environments out there too (Python, Julia) for those who need a little more speed and flexibility. R brings to the table some very powerful analysis and graphics production tools in a package that is pretty easy to use (once you get the hang of it). It is an attractive program for students and professionals alike and is continuing to grow in interest and complexity. Skills in data analysis with R is now becoming a must for many looking for graduate or professional positions and is a great addition to a resumé.

Download and install R and RStudio Free

For most students, all that needs to be downloaded are R and RStudio Free.

R is hosted on a number of servers around the world. This link is to a USA server, but you can pick any server you want. It is best to choose one that is close to your location.

RStudio is a R Integrated Development Environment (IDE) to write and run R code and R derivatives like RMarkdown provided by a company free of charge for personal use. There are also enterprise pricing options and a neat web platform called Shiny that can be used to produce cool R-based web applications. RStudio does not come packaged with R so you need to download both programs to use RStudio.

Writing your first .R file

R is a program that lives almost exclusively in the command line. This is great for streamlining your work flow: you type in what you want R to do, press ENTER and off you go. Great!

BUT once you press ENTER that code is basically gone and you better hope you didn't have any typos.

The BETTER way of working with R is writing a plain text document that has all of the R commands you will need and saving it with a ".R" extension. Any command that you want to pass to R should be written down in an .R file. You can save R commands in any format you want (.docx, .txt) but if you want to be able to use it directly in R or RStudio, you need to use the .R extension.

You can use any text editor you want. Some examples are Notepad (built into Windows), TextEdit (built into OSX), Notepad++, Atom, Sublime Text, Emacs or Vim. The last four also have plugins designed to work directly in R from that particular program. I have used all of these and really can't suggest one over the other. (If you are curious, I currently used NeoVim)

You can also use the built-in text editor in RStudio. It is located by default in the top left quarter of the program window. In that box you can write your code and use a keyboard shortcut, Ctl + Enter in Windows, Cmd + Enter in OSX to evaluate that code right in an R console session.

R scripts generally have two parts: code you want to run, and annotations to code using comments ('#')

Here is an example script:

    
    ## Sample R script
    ## Author: Matthew Lundquist
    ## Date: 11-1-2017

    ## Create vector of 10 random numbers between 1 and 1000 with replacement
    x <- sample(1:1000, 10, replace = T)

    ## Create another vector of 10 randowm numbers between 1 and 1000 with replacement
    y <- sample(1:1000, 10, replace = T)

    ## plot the result
    plot(x, y)
    

Copy and paste the above code into the text editor in RStudio and save it as "sample.R". Run the code and it will result in a neat little scatter plot in the lower right. You can also see the values of x and y on the top right. Try this out in R studio. Note: You will get different results than I did.

Anything with a # is considered an annotation and will not be evaluated by R. Anything not "commented out" will be evaluated by R.

Congratulations! You just wrote your first .R script.

Inputting data and basic functions in R

The next step is to actually use R to input and analyze your data. First we need data. In this case the data below, iris.txt is saved as a text file. You can also save data as .csv (Comma Separated Value) files. Data can be edited directly in those files using a text editor but it is easier just to use a spreadsheet application like Microsoft Excel or Libre Office Calc and then saving them as a .csv.

Once you download iris.txt save it to your Desktop, then open RStudio

The easiest way to upload data into R is to point it to the directory that your data file is in then calling that data file using the read.table() function.

    
    ## Inputting data in R from .txt file

    ## Point R to the directory that contains my data file using the "set working directory" function
    setwd("~/Desktop") # In OSX or Windows 10
    ## setwd("C://Users/UserName/Desktop") ## For older version of Windows, replace UserName with your user name

    ## Read in iris.txt and name it iris
    iris <- read.table("iris.txt", h = T) # h = T tells R that the first column are column names, not data

    ## Call the data to make sure it is inputted correctly
    iris

    

Your output should look something like this:

Notice that there are 150 rows of data and five columns. Each row represents an individual iris flower. The first four columns indicate what was measured from that flower. The last column identifies the species of the flower. The first four columns are what we would call numerical data, we can perform mathematical operations on them. The last column is what we call a factor or a non-numeric identifier for your data. In this case, the column "Species" indicates from what species of iris the data was collected. There are a total of three species in this data set.

Now that our data in in R, we can start to do basic analyses like mean and median.

    

    ## Mean Sepal.Length
    sepal.mean <- mean(iris$Sepal.Length)
    sepal.mean

    ## Median Petal.Width
    petal.median <- median(iris$Petal.Width)
    petal.median

    ## Mean Petal.Length for just setosa
    setosa.petal <- mean(iris$Petal.Length[iris$Species == "setosa"])
    setosa.petal

    ## Summary statistics for all observations and factors
    summary(iris)


    

This is the output:

Advanced analysis: χ2

A χ2 test is used to compare frequency data observed versus some sort of expected frequency. For example, a χ2 would be appropriate for comparing the number of cockroaches that choose one of three foods. Note: This only works if you use a lot of cockroaches and that you observed each cockroach only once. Lets say you expose individually 120 cockroaches to apples, pears, and oranges. You postulate that if there is no preference for any specific food type, they will visit each food equally. You then set your expected frequencies such that your null hypothesis is H0 : p1 = p2 = p3. You then expose 120 roaches to your choice experiment and count the number of cockroaches that chose each food. A table of these observations would look like this:

food apple pear orange total
observed 39 25 56 120
expected 40 40 40 120

Are the observed frequencies significantly different than you expected frequencies? We can use R to test that!

    

    food <- c("apple", "pear", "orange")
    observed <- c(39, 25, 56)
    expected <- c(40, 40, 40)

    chisq.test(observed, p = expected, rescale = TRUE)

    ## Chi-squared test for given probabilities

         ## data:  observed
         ## X-squared = 12.05, df = 2, p-value = 0.002418

    

The results are that with a χ2 value of 12.05 with 2 degrees of freedom, our P value is < 0.05. Therefore, we reject the null hypothesis that the foods are equally preferred.

Easy peasy!

You can also test any number of infinite null hypotheses. For example, are apples 50 % more likely to be chosen over the other two. Your null hypothesis would then be H0: p1 = 0.50, p2 = 0.25, p3 = 0.25.

Advanced analysis: t test

T tests are a bit more involved and are likely more appropriate for the data that you are collecting. Unlike a χ2 test, t tests use continuous data, not frequencies. A t test is appropriate for when you want to compare the averages of a particular measurement between two populations. Going back to our cockroach example, imagine that you have two groups of cockroaches one given an apple the other group given a pear. You are interested in whether or not food type affects time to feeding. You place a single cockroach in front of a type of food and measure the amount of time (in seconds) it takes for the cockroach to approach the food. You do this for 20 roaches for each food type. The data would look something like this:

apple pear
27 27
9 26
53 1
35 21
50 55
11 53
14 55
33 4
29 12
46 6
60 17
26 18
45 15
2 31
19 42
36 31
54 1
54 8
31 10
52 55

For now we will assume that the variance is equal between the two food types. We can use a t test to compare between the amount of time to feed taken by cockroaches in apples and pears. Our null hypothesis would be that the mean of apples = mean of pears

    

    apple <- c(27, 9, 53, 35, 50, 11, 14, 33, 29, 46, 60, 26, 45, 2, 19, 36, 54, 54, 31, 52)

    pear <- c(27, 26, 1, 21, 55, 53, 55, 4, 12, 6, 17, 18, 15, 31, 42, 31, 1, 8, 10, 55)

    t.test(apple, pear, alternative = "two.sided", var.equal = TRUE, conf.level = 0.95)


   Two Sample t-test

    data:  apple and pear
    t = 1.7369, df = 38, p-value = 0.0905
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
    -1.638486 21.438486
    sample estimates:
    mean of x mean of y
         34.3      24.4


    

The R output may be a bit confusing but the important information can be found in the third line. A t statistic of -1.74 with degrees of freedom of 37.7 resulted in a P value of 0.09 which is > 0.05. We failed to reject the null hypothesis that the observed differences in feeding times are due to random chance.

There is infinitely more information out there on R and many, many more statistical tests that can be run based on your data and experimental design. I hope that this post has provided a good start to using R for your statistical analysis needs.

Posted on Nov 1
Written by Matthew J. Lundquist

Terms of Use     Privacy Policy