# Declare a vector variable of strings
a_vector <- c("Birmingham", "Derby", "Leicester", "Lincoln", "Nottingham", "Wolverhampton")
a_vector
[1] "Birmingham" "Derby" "Leicester" "Lincoln"
[5] "Nottingham" "Wolverhampton"
In this lesson, we expand upon the simple data types (numeric, character and logical) discussed in Lesson 1 by introducing more complex data structures.
In this lesson, you will get to know the following data structures in R:
A Vector is an ordered list of values. Vectors can be of any of the following simple types:
However, all items in a vector must be of the same type. A vector can be of any length.
Defining a vector variable is similar to declaring a simple type variable, but the vector is created using the function c()
, which combines values into a vector:
# Declare a vector variable of strings
a_vector <- c("Birmingham", "Derby", "Leicester", "Lincoln", "Nottingham", "Wolverhampton")
a_vector
[1] "Birmingham" "Derby" "Leicester" "Lincoln"
[5] "Nottingham" "Wolverhampton"
Note that the second line of the returned elements starts with [5], as it begins with the fifth element of the vector.
Other functions for creating vectors include seq()
and rep()
:
# Create a vector of real numbers with an interval of 0.5 between 1 and 7
a_vector <- seq(1, 7, by = 0.5)
a_vector
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
# Create a vector with four identical character string values
a_vector <- rep("Ciao", 4)
a_vector
[1] "Ciao" "Ciao" "Ciao" "Ciao"
Numeric vectors can also be created using a simple syntax:
# Create a vector of integer numbers from 1 to 10
a_vector <- 1:10
a_vector
[1] 1 2 3 4 5 6 7 8 9 10
You can access individual elements of a vector by specifying the index of the element between square brackets, following the vector’s identifier. Remember, in R, the first element of a vector has an index of 1. For example, to retrieve the third element of a vector named a_vector
:
a_vector <- 3:8
a_vector[3] # Retrieves the third element
[1] 5
To retrieve multiple elements, use a vector of indices:
a_vector <- 3:8
a_vector[c(2, 4)] # Retrieves the second and fourth elements
[1] 4 6
In this case, the values 4 and 6 are returned, corresponding to indices 2 and 4 in a_vector
.
Note that the vector of indices (
c(2, 4)
) is created on the fly without declaring a variable name.
Try creating and selecting elements from a vector yourself. Follow these steps:
east_midlands_cities
containing the cities: Derby, Leicester, Lincoln, Nottingham.selected_cities
.east_midlands_cities <- c("Derby", "Leicester", "Lincoln", "Nottingham")
my_indexes <- 2:4
selected_cities <- east_midlands_cities[my_indexes]
The range()
function in R, is used to find the minimum and maximum values within a vector. This can be particularly helpful when analyzing the spread of data in a vector. For example:
In R, functions can be applied to vectors just like they are with individual variables. When a function is applied to a vector, it typically processes each element of the vector, resulting in a new vector of the same length as the input.
For instance, adding a value (like 10) to a numeric vector will add that value to each element of the vector:
numeric_vector <- 1:5
numeric_vector <- numeric_vector + 10 # Adds 10 to each element
numeric_vector
[1] 11 12 13 14 15
Similarly, applying a function like sqrt()
to a numeric vector will compute the square root of each element:
numeric_vector <- 1:5
numeric_vector <- sqrt(numeric_vector)
numeric_vector # Displays the square roots
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
A logical condition applied to a vector will return a logical vector indicating whether each element meets the condition:
numeric_vector <- 1:5
logical_vector <- numeric_vector >= 3
logical_vector # Shows TRUE or FALSE for each element
[1] FALSE FALSE TRUE TRUE TRUE
Moreover, functions like any()
and all()
provide overall evaluations of a vector based on a condition. any()
returns TRUE
if any elements satisfy the condition, while all()
returns TRUE
only if all elements satisfy the condition:
numeric_vector <- 1:5
any(numeric_vector >= 3) # Checks if any element is >= 3
[1] TRUE
all(numeric_vector >= 3) # Checks if all elements are >= 3
[1] FALSE
Also, when creating vectors in R, it’s important to understand the concept of type coercion. R is designed to be user-friendly, and when you combine different data types in a vector (e.g., mixing numbers and characters), R will automatically convert all elements to the same type. This process is known as type coercion. For example, if you combine numeric and character data in a vector, all elements will become characters.
mixed_vector <- c(1, "text", TRUE)
print(mixed_vector) # Notice how all elements are coerced to the same type
[1] "1" "text" "TRUE"
Factors are a special data type in R, similar to vectors but limited to predefined values called levels. Factors are not covered in this module, but you can learn more about them in the Programming with R tutorial.
Matrices in R are two-dimensional data structures, where data is organized in rows and columns. They are particularly useful for performing a variety of mathematical operations.
To create a matrix, use the matrix()
function, providing a vector of values and the desired dimensions:
[,1] [,2]
[1,] 3 4
[2,] 5 3
[3,] 7 1
R supports numerous operators and functions for matrix algebra. For example, basic arithmetic operations can be performed on matrices:
x <- matrix(c(3, 5, 7, 4, 3, 1), nrow=3, ncol=2)
y <- matrix(c(1, 2, 3, 4, 5, 6), nrow=3, ncol=2)
z <- x * y # Element-wise multiplication
z
[,1] [,2]
[1,] 3 16
[2,] 10 15
[3,] 21 6
When working with matrices, range()
can help you quickly identify the lowest and highest values within a particular row or column. However, it’s not typically used for selecting rows or columns. Instead, you’d use direct indexing or other functions for selection. Here’s how to correctly utilize range()
with matrices::
# Creating a matrix with numeric values
matrix_data <- matrix(1:9, nrow=3)
# Finding the range of values in the first column
first_column_range <- range(matrix_data[,1])
print(first_column_range) # Displays the minimum and maximum values of the first column
[1] 1 3
In the context of matrix selection, while range() is not used for selecting specific rows or columns, understanding the spread of data within a matrix can be crucial for informed data manipulation and analysis. Here’s an example of how you might use this information:
# Assuming you want to know if the first column contains values within a specific range
is_in_range <- first_column_range[1] >= 2 && first_column_range[2] <= 8
print(is_in_range) # Checks if the range of the first column is between 2 and 8
[1] FALSE
Or you can exclude specific columns or rows from a matrix using negative indexing. This is particularly useful for analysis or visualization when you want to focus on specific parts of the matrix:
# Creating a matrix
matrix_data <- matrix(1:9, nrow=3)
# Excluding the first column from the matrix
matrix_without_first_column <- matrix_data[, -1] # Excludes the first column
print(matrix_without_first_column)
[,1] [,2]
[1,] 4 7
[2,] 5 8
[3,] 6 9
For a detailed overview of matrix operations, refer to Quick-R.
Arrays in R are like higher-dimensional matrices, capable of storing data in multiple dimensions. Creating an array requires specifying the values and the dimensions for each axis:
, , 1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
, , 2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
Note: An array can have a single dimension, resembling a vector. However, arrays have additional attributes like
dim
and offer different functionalities.
Selecting elements from matrices and arrays in R is similar to vector selection, but requires specifying an index for each dimension.
For matrices:
[,1] [,2]
[1,] 3 4
[2,] 5 3
[3,] 7 1
# Selecting the second row, first and second columns
a_matrix[2, c(1, 2)]
[1] 5 3
For arrays with multiple dimensions:
, , 1
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
, , 2
[,1] [,2]
[1,] 7 10
[2,] 8 11
[3,] 9 12
# Selecting elements with specific indices
an_array[2, c(1, 2), 2]
[1] 8 11
Create a 3-dimensional array, extract 2 elements to form a vector, and 4 elements to form a matrix.
Lists in R are incredibly versatile and can hold elements of different types, including vectors, matrices, other lists, and even functions. This makes lists a powerful tool for organizing and storing complex, heterogeneous collections of data.
Elements in lists are selected using double square brackets.
Basic list:
employee <- list("Christian", 2017)
employee
[[1]]
[1] "Christian"
[[2]]
[1] 2017
# Selecting the first element
employee[[1]]
[1] "Christian"
Named lists allow selection using the $
symbol:
# Named list
employee <- list(employee_name = "Christian", start_year = 2017)
employee
$employee_name
[1] "Christian"
$start_year
[1] 2017
# Selecting by name
employee$employee_name
[1] "Christian"
Data frames are essential in R for representing tables of data. Each data frame is structured similarly to a named list with each element being a vector of equal length. Below is an example of creating a data frame:
employees <- data.frame(
EmployeeName = c("Maria", "Pete", "Sarah"),
Age = c(47, 34, 32),
Role = c("Professor", "Researcher", "Researcher"))
employees
EmployeeName Age Role
1 Maria 47 Professor
2 Pete 34 Researcher
3 Sarah 32 Researcher
Data frames are similar to tables in that each column represents a variable, and each row represents an observation.
Can elements of different types be mixed within a single vector or data frame column?
Vector elements (and by extension, data frame columns) must be of the same type (character, logical, or numeric). For example, EmployeeName
contains characters, while Age
contains numerics.
Elements in each column of a data frame correspond to a row. The first element in EmployeeName
represents the name of the first employee, and similarly for other columns.
To rename columns, use the ‘names()’ function: names(data frame)[column index] = “new name”
Selecting data from a data frame is analogous to vector and list selection, but with a focus on the data frame’s two-dimensional structure. You typically need two indices to extract data.
Example of selecting the first element in the first column:
employees[1, 1]
[1] "Maria"
Selecting whole rows:
employees[1, ]
EmployeeName Age Role
1 Maria 47 Professor
Selecting whole columns:
employees[, 1]
[1] "Maria" "Pete" "Sarah"
Columns can also be selected using dollar signs and column names:
employees$Age
[1] 47 34 32
employees$Age[1] # Selecting the first element in the 'Age' column
[1] 47
Modifying a data frame:
employees$Age[2] <- 33
employees$Place <- c("Salzburg", "Salzburg", "Salzburg")
employees
EmployeeName Age Role Place
1 Maria 47 Professor Salzburg
2 Pete 33 Researcher Salzburg
3 Sarah 32 Researcher Salzburg
Our simple data frame includes columns EmployeeName
, Age
and Role
:
employees <- data.frame(
EmployeeName = c("Maria", "Pete", "Sarah"),
Age = c(47, 34, 32),
Role = c("Professor", "Researcher", "Researcher"))
In this exercise you are asked to include an additional column in the data frame that contains the year of birth of employees. This new column can be derived from column age
.
This step will most likely require consultation of other online resources.
Creating a data frame employees
:
employees <- data.frame(
EmployeeName = c("Maria", "Pete", "Sarah"),
Age = c(47, 34, 32),
Role = c("Professor", "Researcher", "Researcher"))
Calculating the current_year
:
current_year <- as.integer(format(Sys.Date(), "%Y"))
Calculating Year_of_birth
as extra data frame column:
employees$Year_of_birth <- current_year - employees$Age
employees
EmployeeName Age Role Year_of_birth
1 Maria 47 Professor 1978
2 Pete 34 Researcher 1991
3 Sarah 32 Researcher 1993