Inspect data
Learning Objectives
1. Glimpse the structure of the dataframe
2. Summarize the structure of a dataframe
Overview
Once you have loaded your data set into R, you have to inspect and get a feel for the data. We are typically interested in the following:
- The dimensions of the data set; how many rows (cases) and how many columns does the data frame have.
- The types of variables we have; are they integer, character, logical, factor (categorical) etc.
- The number of missing, or
NA, values in the dataframe. - A quick look at some summary statistics
First, if we wanted to look at the dataset in a spreadsheet-style data viewer, we can just invoke View(gapminder) (View with a capital V).
While this is nice, it is not very useful, as we cannot dig deeper and see what kind of variables we have, whether there are any missing values, etc.
There are two functions that we will talk about, dplyr::glimpse() and skimr::skim().
dplyr::glimpse()
glimpse() is like a transposed version of print(): It first gives you the dimensions (rows and columns) and then gives us the dataframe’s columns (or variables), the variable type (fct, int, dbl), and then gives us the first few values of each variable. Let us look at the outcome of
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
We see that we have 1704 rows, or cases. We also have 6 columns, or variables and right underneath we see each column individually:
countryis a categorical, or factor variable of type<fct>. and the first few cases are all Afghanistan, just because it is the first one alphabeticallycontinentis also a factor variable, and the first values of this categorical variable are “Asia”, "Europe’, etc.yearis an integer variable of type<int>. This is the year for which we have data for each country, between 1952 and 2007 in 5-year intervals.lifeExpis a double precision, or real number, of type<dbl>that refers to life expectancypopis an integer variable of type<int>that refers to the populationgdpPercapis a double precision, or real number, of type<dbl>that refers to GDP per capita
skimr::skim()
While glimpse() allows us to look at the contents of the dataframe, skimr::skim() is more useful and I always use it in my workflow.
skimr::skim(gapminder)
| Name | gapminder |
| Number of rows | 1704 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| country | 0 | 1 | FALSE | 142 | Afg: 12, Alb: 12, Alg: 12, Ang: 12 |
| continent | 0 | 1 | FALSE | 5 | Afr: 624, Asi: 396, Eur: 360, Ame: 300 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 1979.50 | 17.27 | 1952.00 | 1965.75 | 1979.50 | 1993.25 | 2007.0 | ▇▅▅▅▇ |
| lifeExp | 0 | 1 | 59.47 | 12.92 | 23.60 | 48.20 | 60.71 | 70.85 | 82.6 | ▁▆▇▇▇ |
| pop | 0 | 1 | 29601212.32 | 106157896.74 | 60011.00 | 2793664.00 | 7023595.50 | 19585221.75 | 1318683096.0 | ▇▁▁▁▁ |
| gdpPercap | 0 | 1 | 7215.33 | 9857.45 | 241.17 | 1202.06 | 3531.85 | 9325.46 | 113523.1 | ▇▁▁▁▁ |
skimr::skim() give us a data summary with the dimensions (rows and columns) of the dataframe, and the type of columns; in this case, we have 2 factor and 2 numeric columns (variables).
For all variable types, it gives us the number of missing values (n_missing) and the complete_rate; in the gapminder data, there are no missing values, so n_mising = 0 and complete_rate = 1.
- For factor variables, skim() provides information on
- whether it is an
orderedfactor; if false, the default ordering is alphabetical, otherwise one has to explicitly specify the order of the categories. - the
n_unique, or distinct instances of each country; ingapminderwe have data on 142 distinct countries and 5 continents - the
top_countsshows the top number of instances for each factor; eachcountryhas 12 observations, but incontinentAfrica has 624 observations, Asia 396, etc.
- whether it is an
- For numeric variables, skim() provides summary statistics; mean, standard deviation and the 0th (min), 25th, 50, 75th and 100th (max) percentile. It also gives us a rough histogram to get an idea on the shape of the distribution (normal, skewed, uniform)
On your own
The following dataframe has data on London’s cycle hire scheme, Santander Cycles. Besides the number of bikes rented out, the dataframe also contains weather information.
bikes <- read_csv(here("data", "londonBikes.csv"))
skimr::skim(bikes)
| Name | bikes |
| Number of rows | 3439 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 4 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| date | 0 | 1 | 8 | 8 | 0 | 3439 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| rain | 851 | 0.75 | 0.62 | TRU: 1595, FAL: 993 |
| fog | 851 | 0.75 | 0.07 | FAL: 2403, TRU: 185 |
| thunderstorm | 851 | 0.75 | 0.03 | FAL: 2512, TRU: 76 |
| snow | 851 | 0.75 | 0.02 | FAL: 2533, TRU: 55 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bikes_hired | 0 | 1.00 | 26158.95 | 9135.13 | 3531.0 | 19626.00 | 26022.0 | 32759.0 | 73094.0 | ▃▇▅▁▁ |
| season | 0 | 1.00 | 2.46 | 1.12 | 1.0 | 1.00 | 2.0 | 3.0 | 4.0 | ▇▇▁▇▇ |
| max_temp | 1877 | 0.45 | 16.48 | 6.19 | -1.2 | 11.93 | 16.7 | 20.9 | 36.7 | ▁▆▇▃▁ |
| min_temp | 1929 | 0.44 | 7.62 | 5.14 | -8.2 | 3.90 | 7.9 | 11.8 | 20.0 | ▁▅▇▇▂ |
| avg_temp | 27 | 0.99 | 11.70 | 5.41 | -4.1 | 7.60 | 11.6 | 15.9 | 28.6 | ▁▆▇▅▁ |
| avg_humidity | 745 | 0.78 | 74.91 | 10.84 | 37.0 | 67.00 | 76.0 | 83.0 | 100.0 | ▁▂▆▇▂ |
| avg_pressure | 773 | 0.78 | 1015.10 | 10.24 | 979.0 | 1009.00 | 1016.0 | 1022.0 | 1044.0 | ▁▂▇▆▁ |
| avg_windspeed | 745 | 0.78 | 14.01 | 6.10 | 3.0 | 10.00 | 13.0 | 18.0 | 47.0 | ▇▇▂▁▁ |
| rainfall_mm | 51 | 0.99 | 1.67 | 3.68 | 0.0 | 0.00 | 0.0 | 1.5 | 48.0 | ▇▁▁▁▁ |
A couple of graded learnr interactive exercices?
- What kind of variable is
date? What kind of variable isseason? - How often does it rain in London?
- What is the average annual temperature (in degrees C)?
- What is the maximum rainfall?