R Syntax, Vectors, missing data
R Syntax
We can type commands in the command prompt and use R as a simple calculator. For instance, try typing 5 + 20, and hitting enter. When you do this, you’ve entered a command, and R will execute that command. However, it’s more interesting when we can create objects or variables and work with these beasts!
Assignmnent Operator <-
R treats everything (single numbers, lists, vectors, datasets) as objects. To create an object, we must use the assignment operator <-. For instance, if we had data on a student whose name is Alex, is 28 years old, and comes from Athens, we would create three objects, name, height, and city and assign the values of Alex, 28, and Athens respectively, we would type
name <- "Alex"
age <- 28
city <- "Athens"
The two objects have now been created; if we wanted to print out their values, we can use the print() function or just type the names of the objects.
print(name); print(age); print(city)
## [1] "Alex"
## [1] 28
## [1] "Athens"
name
## [1] "Alex"
age
## [1] 28
city
## [1] "Athens"
You can mentally read the command age <- 28 as object age becomes equal to the value 28. There is a keyboard shortcut Alt + - to get the assignment operator. We can do more interesting and useful things creating variables and assigning values to them. For instance, if we have the relevant dimensions and wanted to calculate the area and volume of a room, we could do it as follows:
room_length <- 5.63
room_width <- 6.48
room_height <- 2.93
room_area <- room_length * room_width
room_volume <- room_length * room_width * room_height
room_area
## [1] 36.4824
room_volume
## [1] 106.8934
R is case sensitive
R is case sensitive and needs everything exactly as it was defined. age is different from AgE and Age. So if you type
age <- 28
AgE <- 34
Age <- 55
age; AgE; Age
## [1] 28
## [1] 34
## [1] 55
R will create three different objects.
Typos
R is a brilliant piece of software, but it cannot handle typos. Unlike Google’s search, “Did you mean…”, it takes it on faith that what you typed is exactly what you meant. For example, suppose that you forgot to hit the shift key when trying to type +, and as a result your command ended up being 5 = 20 rather than 5 + 20. Here’s what happens:
5 = 20
## Error in 5 = 20: invalid (do_set) left-hand side to assignment
R attempted to interpret 5 = 20 as a command, and spits out an error message because this makes no sense to it. Even more subtle is the fact that some typos won’t produce errors at all, because they happen to correspond to R commands. For instance, suppose that instead of 5 + 20, I mistakenly type command 5 - 20. Clearly, R has no way of knowing that you meant to add 20 to 5, not subtract 20 from 5, so what happens this time is this:
5 - 20
## [1] -15
In this case, R produces the right answer, but to the the wrong question.
R will always try to do exactly what you ask it to do. There is no autocorrect or equivalent to “Did you mean..” in R, and for good reason. When doing advanced stuff and even the simplest of statistics is pretty advanced in a lot of ways, it’s dangerous to let a mindless automaton like R try to overrule the human user. But because of this, it’s your responsibility to be careful. Always make sure you type exactly what you mean. When dealing with computers, it’s not enough to type approximately the right thing. In general, you absolutely must be precise in what you say to R … like all machines it is too stupid to be anything other than absurdly literal in its interpretation.
R knows you’re not finished
If you hit enter in a situation where it’s obvious to R that you haven’t actually finished typing the command, R is just smart enough to keep waiting. For example, if you wanted to calculate 15 - 4, and start by typing type 15 - and then press enter by mistake, R is smart enough to realise that you probably wanted to type in another number. So here’s what happens:
> 15 -
+
and there’s a blinking cursor next to the plus + sign. What this means is that R is still waiting for you to finish. It thinks you’re still typing your command, so it hasn’t tried to execute it yet. In other words, this plus sign is actually another command prompt. It’s different from the usual one (i.e., the > symbol) to remind you that R is going to add whatever you type now to what you typed last time. For example, if I then go on to type 4 and hit enter, what we get:
> 15 -
+ 4
[1] 11
And as far as R is concerned, this is exactly the same as if you had typed 15 - 4.
By the way, if after entering the 15 - you wanted to stop execution and cancel your command, just hit the escape key. R will return you to the normal command prompt (i.e. >) without attempting to execute the botched command.
Arithmetic Operations and Functions
R has the basic operators and you can use it as as simple calculator: addition is +, subtraction is -, multiplication is *, division is /, and ^ is the power operator:
2 + 3
## [1] 5
5 - 8
## [1] -3
13 * 21
## [1] 273
34 / 55
## [1] 0.6181818
(5 * 13)/4 - 7
## [1] 9.25
# ^ : to the power off
2^3
## [1] 8
# for exponentiation, you can also use **
2 ** 3
## [1] 8
# square root
sqrt(25)
## [1] 5
Besides the basic operations functions, you can use standard mathematical functions
- Rounding
-
round(),floor(),ceiling(), - Logarithms and Exponentials
-
exp(),log(),log10(),log2()
# R knows pi = 3.1415926...
# round to 2 decimal places
round(pi, digits = 2); round(pi,2)
## [1] 3.14
## [1] 3.14
#Round down to nearest integer
floor(pi)
## [1] 3
#Round up to nearest integer
ceiling(pi)
## [1] 4
Main Data types and Vectors
- character: sometimes referred to as
stringdata, tend to be surrounded by quotes<chr> - numeric: real or decimal numbers, sometimes referred to as “double”
<dbl> - integer: a subset of numeric in which numbers are stored as integers
<int> - factor: a categorical variables with different categories sorted alphabetically by default
<fct> - logical: Boolean data (TRUE and FALSE)
<lgl>
Your turn
Vectors
A vector is a collection of objects. There is a magical operator in R, c which we use to combine different elements.
# assign vector
ages <- c(20:30, 35, 50, 42, 72)
# recall vector
ages
## [1] 20 21 22 23 24 25 26 27 28 29 30 35 50 42 72
# how many things are in the vector 'ages'?
length(ages)
## [1] 15
# what type of object is 'ages?
class(ages)
## [1] "numeric"
R allows vectorized operations, so we can get the average, or median of ages by just typing
# performing functions with vectors
mean(ages)
## [1] 31.6
median(ages)
## [1] 27
We can also have a collection of strings, or characters
# vector of days of the week
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
In this case, each word is encased in quotation marks, indicating they are characters rather than object names.
Please answer the following questions about days:
- How many values are in
days? - What type of data (
class) isdays? - Overview of
days
Manipulating vectors
# add a value to end of vector
ages <- c(ages, 90)
# add value at the beginning
ages <- c(30, ages)
# extracting second value
days[2]
## [1] "Tuesday"
# excluding (dropping) second value
days[-2]
## [1] "Monday" "Wednesday" "Thursday" "Friday" "Saturday" "Sunday"
# extracting first and third values
days[c(1, 3)]
## [1] "Monday" "Wednesday"
Your turn
R tends to handle interpreting data types in the background of most operations. Usually it tries to coerce data to fit the general pattern of the data given to it.
What type of data is each of the following objects? Anything unusual?
num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")
Factors
days is a character vector so R internally sorts it alphabetically and it thinks that Friday should be first. If we wanted to make into a categorical variable, we use
days <- factor(days)
We can reorder, or relevel the factor, using fct_relevel from the tidyverse package forcats, or using levels from baseR.
days_sorted <- forcats::fct_relevel(days, levels = c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"))
days_sorted2 <- factor(days, levels = c("Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday"))
Your turn
Missing data, or NA
# create a vector with missing data
times <- c(2, 4, 4, NA, 6)
NA is not a character
# calculate mean and max on vector with missing data
mean(times)
## [1] NA
max(times)
## [1] NA
# add argument to remove NA
mean(times, na.rm = TRUE)
## [1] 4
max(times, na.rm = TRUE)
## [1] 6
# remove incomplete cases
na.omit(times)
## [1] 2 4 4 6
## attr(,"na.action")
## [1] 4
## attr(,"class")
## [1] "omit"
Comments
It is useful to put comments in your code, to make everything more readable. These comments could help others and you when you go back to your code in the future. R comments start with a hashtag sign
#. Everything after the hashtag to the end of the line will be ignored by R. RStudio by default thinks that every line you write is a command; if you want to turn a line into a comment, place the cursor in the line and hitCtrl + Shift + Cin Windows orCmd + Shift + Cin a Mac.