Using R for exploratory data analysis (summarising data)

Hi Everyone, welcome to the first blogpost of my data science journey. In this post I am going to show you how to quickly and easily implement a variety of common exploratory data analyses using R Statistical software. Such analyses are commonly used for descriptive studies.

For this blogpost, I assume that you have some basic understanding of the R programming language. But if not, no worries there is a plethora of resources on R programming on the internet. One of the brilliant resources to get you up and running with R and R Studio is the R Ladies Sydney webpage.

For this post I will be showing you how to obtain basic frequency data, mean, median, mode, range, interquartile range variance and test for normality.

There are hundreds of ways to obtain such information. You do not need to install any packages to perform many statistical analysis - this means that the base R has inbuilt commands to do most of the stats but for the purpose of ease and for other various uses it is worth installing the package tidyverse.

## # A tibble: 6 x 14
##   ID      Age Childnumber childid Gender AutoreractorSE `Myopia Group` SE_P_AVE
##   <chr> <dbl>       <dbl>   <dbl>  <dbl>          <dbl>          <dbl>    <dbl>
## 1 A000…    10           3       2      1           1.12              0  -0.688 
## 2 A000…     7           3       3      1           1.75              0  -0.688 
## 3 A000…    11           2       1      1           1.68              0   0.125 
## 4 A000…     7           2       2      1           3.30              0   0.125 
## 5 A000…    12           3       1      2           1.18              0  -0.0625
## 6 A000…    10           3       2      2           0.75              0  -0.0625
## # … with 6 more variables: NEARWORKtime <dbl>, OUTDOORtime <dbl>,
## #   GROUP_NEAR <dbl>, GROUP_OUT <dbl>, EDUM_P_NEW <dbl>, EDUF_P_NEW <dbl>

Here, I have uploaded a datasheet that I obtained from this paper. The paper determines the relationship between outdoor activities, nearwork and myopia. The dataset is denoted as d1. The easiest way to get the descriptive data of a dataset is to call the function summary(d1).

Let’s see what hapends when I type summary(d1):

summary(d1)
##       ID                 Age         Childnumber       childid     
##  Length:574         Min.   : 6.00   Min.   :1.000   Min.   :1.000  
##  Class :character   1st Qu.: 9.00   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Median :11.00   Median :2.000   Median :1.000  
##                     Mean   :10.63   Mean   :1.962   Mean   :1.481  
##                     3rd Qu.:12.00   3rd Qu.:2.000   3rd Qu.:2.000  
##                     Max.   :18.00   Max.   :4.000   Max.   :4.000  
##      Gender      AutoreractorSE       Myopia Group       SE_P_AVE      
##  Min.   :1.000   Min.   :-6.055000   Min.   :0.0000   Min.   :-5.5312  
##  1st Qu.:1.000   1st Qu.:-0.433750   1st Qu.:0.0000   1st Qu.:-0.7812  
##  Median :1.000   Median : 0.305000   Median :0.0000   Median :-0.3750  
##  Mean   :1.449   Mean   : 0.001577   Mean   :0.2422   Mean   :-0.5259  
##  3rd Qu.:2.000   3rd Qu.: 0.750000   3rd Qu.:0.0000   3rd Qu.:-0.0625  
##  Max.   :2.000   Max.   : 3.305000   Max.   :1.0000   Max.   : 1.0312  
##   NEARWORKtime     OUTDOORtime      GROUP_NEAR      GROUP_OUT    
##  Min.   : 2.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 3.429   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 4.429   Median :2.429   Median :2.000   Median :2.000  
##  Mean   : 4.751   Mean   :2.936   Mean   :2.012   Mean   :2.002  
##  3rd Qu.: 5.500   3rd Qu.:3.714   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :14.429   Max.   :9.643   Max.   :3.000   Max.   :3.000  
##    EDUM_P_NEW      EDUF_P_NEW   
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :3.000  
##  Mean   :2.237   Mean   :2.709  
##  3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :4.000

Now you can see the mean, median, range and quartiles of the variables. Remember, that the summary() command provides these values for all of the variables either numeric or categorical and it is your job to identify and select the variable appropriately. For instance, the mean of gender doesn’t make sense, does it ?

More detailed descriptive analysis of a dataset can be obtained by using package psych. I am installing this package by typing the command install.package("psych") and calling the function library(psych).

In order to get the descriptive data, I am typing the command describe(d1) - remember d1 is my dataset that contains all these variables such as age, gender etc. Make sure, your dataset does not have missing values otherwise it will give NA or NAN values as you can see below for ID variable.

library(psych)
describe(d1)
##                vars   n  mean   sd median trimmed  mad   min   max range  skew
## ID*               1 574   NaN   NA     NA     NaN   NA   Inf  -Inf  -Inf    NA
## Age               2 574 10.63 2.47  11.00   10.55 2.97  6.00 18.00 12.00  0.33
## Childnumber       3 574  1.96 0.78   2.00    1.93 1.48  1.00  4.00  3.00  0.33
## childid           4 574  1.48 0.66   1.00    1.37 0.00  1.00  4.00  3.00  1.15
## Gender            5 574  1.45 0.50   1.00    1.44 0.00  1.00  2.00  1.00  0.20
## AutoreractorSE    6 574  0.00 1.24   0.30    0.14 0.82 -6.05  3.31  9.36 -1.33
## Myopia Group      7 574  0.24 0.43   0.00    0.18 0.00  0.00  1.00  1.00  1.20
## SE_P_AVE          8 574 -0.53 0.76  -0.38   -0.43 0.51 -5.53  1.03  6.56 -1.79
## NEARWORKtime      9 574  4.75 1.62   4.43    4.54 1.48  2.00 14.43 12.43  1.49
## OUTDOORtime      10 574  2.94 1.40   2.43    2.81 1.06  1.00  9.64  8.64  1.06
## GROUP_NEAR       11 574  2.01 0.81   2.00    2.02 1.48  1.00  3.00  2.00 -0.02
## GROUP_OUT        12 574  2.00 0.82   2.00    2.00 1.48  1.00  3.00  2.00  0.00
## EDUM_P_NEW       13 574  2.24 0.74   2.00    2.26 1.48  1.00  4.00  3.00  0.04
## EDUF_P_NEW       14 574  2.71 0.65   3.00    2.72 0.00  1.00  4.00  3.00 -0.41
##                kurtosis   se
## ID*                  NA   NA
## Age               -0.20 0.10
## Childnumber       -0.60 0.03
## childid            0.55 0.03
## Gender            -1.96 0.02
## AutoreractorSE     2.83 0.05
## Myopia Group      -0.56 0.02
## SE_P_AVE           5.90 0.03
## NEARWORKtime       3.70 0.07
## OUTDOORtime        1.31 0.06
## GROUP_NEAR        -1.48 0.03
## GROUP_OUT         -1.52 0.03
## EDUM_P_NEW        -0.45 0.03
## EDUF_P_NEW         0.26 0.03

So the additional results that we obtained from describe() command as compared to summary() command are sd (standard deviation), mad(mean absolute deviation), Kurtosis, Skewness, se(standard error), trimmed (trimmed mean).

There are some baseR commands such as mean(), median(), mode(), sd() but they can be applied to vectors only one by one but not to the whole dataset at once.

Ok, to wrap it up, in this tutorial we learned how to calculate mean, median, standard deviation, standard error, IQR, Range, Kurtosis and Skewness of a data.

Take home messages are:

  1. Remember to use install.packages() just once and library() everytime you use R if you need to load a specific package
  2. summary(dataset) is a base R command so you do not need to install anything to obtain descriptive statistics.
  3. If you want to obtain more detailed summary statistics please install psych package and use the describe(dataset) command.

If you would like to learn more on different methods of summarising data see this link on My favourite R package for: summarising data by Adam Medcalf.

Good luck exploring you data. Feedbacks welcome.

Nabin Paudel
Nabin Paudel
Postdoctoral Fellow in Optometry and Vision Science

My research interests include Pediatric Visual Development, Amblyopia, Myopia, R programming, Visual Psychophysics, Ophthalmic Imaging, Teaching

Next