Inspect data

Overview
dplyr::glimpse()
skimr::skim()
On your own

Learning Objectives
1. Glimpse the structure of the dataframe
2. Summarize the structure of a dataframe

Overview

Once you have loaded your data set into R, you have to inspect and get a feel for the data. We are typically interested in the following:

The dimensions of the data set; how many rows (cases) and how many columns does the data frame have.
The types of variables we have; are they integer, character, logical, factor (categorical) etc.
The number of missing, or NA, values in the dataframe.
A quick look at some summary statistics

First, if we wanted to look at the dataset in a spreadsheet-style data viewer, we can just invoke View(gapminder) (View with a capital V).

While this is nice, it is not very useful, as we cannot dig deeper and see what kind of variables we have, whether there are any missing values, etc.

There are two functions that we will talk about, dplyr::glimpse() and skimr::skim().

`dplyr::glimpse()`

glimpse() is like a transposed version of print(): It first gives you the dimensions (rows and columns) and then gives us the dataframe’s columns (or variables), the variable type (fct, int, dbl), and then gives us the first few values of each variable. Let us look at the outcome of

glimpse(gapminder)

## Rows: 1,704
## Columns: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...

We see that we have 1704 rows, or cases. We also have 6 columns, or variables and right underneath we see each column individually:

country is a categorical, or factor variable of type <fct>. and the first few cases are all Afghanistan, just because it is the first one alphabetically
continent is also a factor variable, and the first values of this categorical variable are “Asia”, "Europe’, etc.
year is an integer variable of type <int>. This is the year for which we have data for each country, between 1952 and 2007 in 5-year intervals.
lifeExp is a double precision, or real number, of type <dbl> that refers to life expectancy
pop is an integer variable of type <int> that refers to the population
gdpPercap is a double precision, or real number, of type <dbl> that refers to GDP per capita

`skimr::skim()`

While glimpse() allows us to look at the contents of the dataframe, skimr::skim() is more useful and I always use it in my workflow.

skimr::skim(gapminder)

Table 1: Data summary
Name	gapminder
Number of rows	1704
Number of columns	6
_______________________
Column type frequency:
factor	2
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
country	0	1	FALSE	142	Afg: 12, Alb: 12, Alg: 12, Ang: 12
continent	0	1	FALSE	5	Afr: 624, Asi: 396, Eur: 360, Ame: 300

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	1979.50	17.27	1952.00	1965.75	1979.50	1993.25	2007.0	▇▅▅▅▇
lifeExp	1	59.47	12.92	23.60	48.20	60.71	70.85	82.6	▁▆▇▇▇
pop	1	29601212.32	106157896.74	60011.00	2793664.00	7023595.50	19585221.75	1318683096.0	▇▁▁▁▁
gdpPercap	1	7215.33	9857.45	241.17	1202.06	3531.85	9325.46	113523.1	▇▁▁▁▁

skimr::skim() give us a data summary with the dimensions (rows and columns) of the dataframe, and the type of columns; in this case, we have 2 factor and 2 numeric columns (variables).

For all variable types, it gives us the number of missing values (n_missing) and the complete_rate; in the gapminder data, there are no missing values, so n_mising = 0 and complete_rate = 1.

For factor variables, skim() provides information on
- whether it is an ordered factor; if false, the default ordering is alphabetical, otherwise one has to explicitly specify the order of the categories.
- the n_unique, or distinct instances of each country; in gapminder we have data on 142 distinct countries and 5 continents
- the top_counts shows the top number of instances for each factor; each country has 12 observations, but in continent Africa has 624 observations, Asia 396, etc.
For numeric variables, skim() provides summary statistics; mean, standard deviation and the 0th (min), 25th, 50, 75th and 100th (max) percentile. It also gives us a rough histogram to get an idea on the shape of the distribution (normal, skewed, uniform)

On your own

The following dataframe has data on London’s cycle hire scheme, Santander Cycles. Besides the number of bikes rented out, the dataframe also contains weather information.

bikes <- read_csv(here("data", "londonBikes.csv"))

skimr::skim(bikes)

Table 2: Data summary
Name	bikes
Number of rows	3439
Number of columns	14
_______________________
Column type frequency:
character	1
logical	4
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
date	0	1	8	8	0	3439	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
rain	851	0.75	0.62	TRU: 1595, FAL: 993
fog	851	0.75	0.07	FAL: 2403, TRU: 185
thunderstorm	851	0.75	0.03	FAL: 2512, TRU: 76
snow	851	0.75	0.02	FAL: 2533, TRU: 55

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bikes_hired	0	1.00	26158.95	9135.13	3531.0	19626.00	26022.0	32759.0	73094.0	▃▇▅▁▁
season	0	1.00	2.46	1.12	1.0	1.00	2.0	3.0	4.0	▇▇▁▇▇
max_temp	1877	0.45	16.48	6.19	-1.2	11.93	16.7	20.9	36.7	▁▆▇▃▁
min_temp	1929	0.44	7.62	5.14	-8.2	3.90	7.9	11.8	20.0	▁▅▇▇▂
avg_temp	27	0.99	11.70	5.41	-4.1	7.60	11.6	15.9	28.6	▁▆▇▅▁
avg_humidity	745	0.78	74.91	10.84	37.0	67.00	76.0	83.0	100.0	▁▂▆▇▂
avg_pressure	773	0.78	1015.10	10.24	979.0	1009.00	1016.0	1022.0	1044.0	▁▂▇▆▁
avg_windspeed	745	0.78	14.01	6.10	3.0	10.00	13.0	18.0	47.0	▇▇▂▁▁
rainfall_mm	51	0.99	1.67	3.68	0.0	0.00	0.0	1.5	48.0	▇▁▁▁▁

A couple of graded learnr interactive exercices?

What kind of variable is date? What kind of variable is season?
How often does it rain in London?
What is the average annual temperature (in degrees C)?
What is the maximum rainfall?

Last updated on August 25, 2020