Inspect data

Learning Objectives
1. Glimpse the structure of the dataframe
2. Summarize the structure of a dataframe

Overview

Once you have loaded your data set into R, you have to inspect and get a feel for the data. We are typically interested in the following:

  1. The dimensions of the data set; how many rows (cases) and how many columns does the data frame have.
  2. The types of variables we have; are they integer, character, logical, factor (categorical) etc.
  3. The number of missing, or NA, values in the dataframe.
  4. A quick look at some summary statistics

First, if we wanted to look at the dataset in a spreadsheet-style data viewer, we can just invoke View(gapminder) (View with a capital V).

While this is nice, it is not very useful, as we cannot dig deeper and see what kind of variables we have, whether there are any missing values, etc.

There are two functions that we will talk about, dplyr::glimpse() and skimr::skim().

dplyr::glimpse()

glimpse() is like a transposed version of print(): It first gives you the dimensions (rows and columns) and then gives us the dataframe’s columns (or variables), the variable type (fct, int, dbl), and then gives us the first few values of each variable. Let us look at the outcome of

glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...

We see that we have 1704 rows, or cases. We also have 6 columns, or variables and right underneath we see each column individually:

  • country is a categorical, or factor variable of type <fct>. and the first few cases are all Afghanistan, just because it is the first one alphabetically
  • continent is also a factor variable, and the first values of this categorical variable are “Asia”, "Europe’, etc.
  • year is an integer variable of type <int>. This is the year for which we have data for each country, between 1952 and 2007 in 5-year intervals.
  • lifeExp is a double precision, or real number, of type <dbl> that refers to life expectancy
  • pop is an integer variable of type <int> that refers to the population
  • gdpPercap is a double precision, or real number, of type <dbl> that refers to GDP per capita

skimr::skim()

While glimpse() allows us to look at the contents of the dataframe, skimr::skim() is more useful and I always use it in my workflow.

skimr::skim(gapminder)
Table 1: Data summary
Name gapminder
Number of rows 1704
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country 0 1 FALSE 142 Afg: 12, Alb: 12, Alg: 12, Ang: 12
continent 0 1 FALSE 5 Afr: 624, Asi: 396, Eur: 360, Ame: 300

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1979.50 17.27 1952.00 1965.75 1979.50 1993.25 2007.0 ▇▅▅▅▇
lifeExp 0 1 59.47 12.92 23.60 48.20 60.71 70.85 82.6 ▁▆▇▇▇
pop 0 1 29601212.32 106157896.74 60011.00 2793664.00 7023595.50 19585221.75 1318683096.0 ▇▁▁▁▁
gdpPercap 0 1 7215.33 9857.45 241.17 1202.06 3531.85 9325.46 113523.1 ▇▁▁▁▁

skimr::skim() give us a data summary with the dimensions (rows and columns) of the dataframe, and the type of columns; in this case, we have 2 factor and 2 numeric columns (variables).

For all variable types, it gives us the number of missing values (n_missing) and the complete_rate; in the gapminder data, there are no missing values, so n_mising = 0 and complete_rate = 1.

  • For factor variables, skim() provides information on
    • whether it is an ordered factor; if false, the default ordering is alphabetical, otherwise one has to explicitly specify the order of the categories.
    • the n_unique, or distinct instances of each country; in gapminder we have data on 142 distinct countries and 5 continents
    • the top_counts shows the top number of instances for each factor; each country has 12 observations, but in continent Africa has 624 observations, Asia 396, etc.
  • For numeric variables, skim() provides summary statistics; mean, standard deviation and the 0th (min), 25th, 50, 75th and 100th (max) percentile. It also gives us a rough histogram to get an idea on the shape of the distribution (normal, skewed, uniform)

On your own

The following dataframe has data on London’s cycle hire scheme, Santander Cycles. Besides the number of bikes rented out, the dataframe also contains weather information.

bikes <- read_csv(here("data", "londonBikes.csv"))
skimr::skim(bikes)
Table 2: Data summary
Name bikes
Number of rows 3439
Number of columns 14
_______________________
Column type frequency:
character 1
logical 4
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
date 0 1 8 8 0 3439 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rain 851 0.75 0.62 TRU: 1595, FAL: 993
fog 851 0.75 0.07 FAL: 2403, TRU: 185
thunderstorm 851 0.75 0.03 FAL: 2512, TRU: 76
snow 851 0.75 0.02 FAL: 2533, TRU: 55

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bikes_hired 0 1.00 26158.95 9135.13 3531.0 19626.00 26022.0 32759.0 73094.0 ▃▇▅▁▁
season 0 1.00 2.46 1.12 1.0 1.00 2.0 3.0 4.0 ▇▇▁▇▇
max_temp 1877 0.45 16.48 6.19 -1.2 11.93 16.7 20.9 36.7 ▁▆▇▃▁
min_temp 1929 0.44 7.62 5.14 -8.2 3.90 7.9 11.8 20.0 ▁▅▇▇▂
avg_temp 27 0.99 11.70 5.41 -4.1 7.60 11.6 15.9 28.6 ▁▆▇▅▁
avg_humidity 745 0.78 74.91 10.84 37.0 67.00 76.0 83.0 100.0 ▁▂▆▇▂
avg_pressure 773 0.78 1015.10 10.24 979.0 1009.00 1016.0 1022.0 1044.0 ▁▂▇▆▁
avg_windspeed 745 0.78 14.01 6.10 3.0 10.00 13.0 18.0 47.0 ▇▇▂▁▁
rainfall_mm 51 0.99 1.67 3.68 0.0 0.00 0.0 1.5 48.0 ▇▁▁▁▁

A couple of graded learnr interactive exercices?

  1. What kind of variable is date? What kind of variable is season?
  2. How often does it rain in London?
  3. What is the average annual temperature (in degrees C)?
  4. What is the maximum rainfall?