Lesson 3 Data Structures
In this lesson, we expand upon the simple data types (numeric, character and logical) discussed in Lesson 1 by introducing more complex data structures.
In this lesson, you will get to know the following data structures in R:
- Vectors
- Matrices and Arrays
- Lists
- Data Frames
3.1 Vectors
A Vector is an ordered list of values. Vectors can be of any of the following simple types:
- Numeric
- Character
- Logical
However, all items in a vector must be of the same type. A vector can be of any length.
Defining a vector variable is similar to declaring a simple type variable, but the vector is created using the function c()
, which combines values into a vector:
# Declare a vector variable of strings
a_vector <- c("Birmingham", "Derby", "Leicester", "Lincoln", "Nottingham", "Wolverhampton")
a_vector
## [1] "Birmingham" "Derby" "Leicester" "Lincoln"
## [5] "Nottingham" "Wolverhampton"
Other functions for creating vectors include seq()
and rep()
:
# Create a vector of real numbers with an interval of 0.5 between 1 and 7
a_vector <- seq(1, 7, by = 0.5)
a_vector
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
## [1] "Ciao" "Ciao" "Ciao" "Ciao"
Numeric vectors can also be created using a simple syntax:
## [1] 1 2 3 4 5 6 7 8 9 10
3.1.1 Vector Element Selection
You can access individual elements of a vector by specifying the index of the element between square brackets, following the vector’s identifier. Remember, in R, the first element of a vector has an index of 1. For example, to retrieve the third element of a vector named a_vector
:
## [1] 5
To retrieve multiple elements, use a vector of indices:
## [1] 4 6
In this case, the values 4 and 6 are returned, corresponding to indices 2 and 4 in a_vector
. Note that the vector of indices (c(2, 4)
) is created on the fly.
Try creating and selecting elements from a vector yourself. Follow these steps:
- Create a vector named
east_midlands_cities
containing the cities: Derby, Leicester, Lincoln, Nottingham. - Select the last three cities and assign them to a new vector named
selected_cities
.
See solution!
east_midlands_cities <- c(“Derby”, “Leicester”, “Lincoln”, “Nottingham”)
my_indexes <- 2:4
selected_cities <- east_midlands_cities[my_indexes]
3.1.2 Using the range() Function with Vectors
The range()
function in R, is used to find the minimum and maximum values within a vector. This can be particularly helpful when analyzing the spread of data in a vector. For example:
# Create a numeric vector
numeric_vector <- c(2, 8, 4, 16, 6)
# Apply the range() function
vector_range <- range(numeric_vector)
vector_range # Displays the minimum and maximum values
## [1] 2 16
3.1.3 Applying Functions to Vectors
In R, functions can be applied to vectors just like they are with individual variables. When a function is applied to a vector, it typically processes each element of the vector, resulting in a new vector of the same length as the input.
For instance, adding a value (like 10) to a numeric vector will add that value to each element of the vector:
numeric_vector <- 1:5
numeric_vector <- numeric_vector + 10 # Adds 10 to each element
numeric_vector
## [1] 11 12 13 14 15
Similarly, applying a function like sqrt()
to a numeric vector will compute the square root of each element:
numeric_vector <- 1:5
numeric_vector <- sqrt(numeric_vector)
numeric_vector # Displays the square roots
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
A logical condition applied to a vector will return a logical vector indicating whether each element meets the condition:
numeric_vector <- 1:5
logical_vector <- numeric_vector >= 3
logical_vector # Shows TRUE or FALSE for each element
## [1] FALSE FALSE TRUE TRUE TRUE
Moreover, functions like any()
and all()
provide overall evaluations of a vector based on a condition. any()
returns TRUE
if any elements satisfy the condition, while all()
returns TRUE
only if all elements satisfy the condition:
## [1] TRUE
## [1] FALSE
Also, when creating vectors in R, it’s important to understand the concept of type coercion. R is designed to be user-friendly, and when you combine different data types in a vector (e.g., mixing numbers and characters), R will automatically convert all elements to the same type. This process is known as type coercion. For example, if you combine numeric and character data in a vector, all elements will become characters.
mixed_vector <- c(1, "text", TRUE)
print(mixed_vector) # Notice how all elements are coerced to the same type
## [1] "1" "text" "TRUE"
3.2 Multi-dimensional Data Types
3.2.1 Matrices
Matrices in R are two-dimensional data structures, where data is organized in rows and columns. They are particularly useful for performing a variety of mathematical operations.
To create a matrix, use the matrix()
function, providing a vector of values and the desired dimensions:
## [,1] [,2]
## [1,] 3 4
## [2,] 5 3
## [3,] 7 1
R supports numerous operators and functions for matrix algebra. For example, basic arithmetic operations can be performed on matrices:
x <- matrix(c(3, 5, 7, 4, 3, 1), nrow=3, ncol=2)
y <- matrix(c(1, 2, 3, 4, 5, 6), nrow=3, ncol=2)
z <- x * y # Element-wise multiplication
z
## [,1] [,2]
## [1,] 3 16
## [2,] 10 15
## [3,] 21 6
When working with matrices, range()
can help you quickly identify the lowest and highest values within a particular row or column. However, it’s not typically used for selecting rows or columns. Instead, you’d use direct indexing or other functions for selection. Here’s how to correctly utilize range()
with matrices::
# Creating a matrix with numeric values
matrix_data <- matrix(1:9, nrow=3)
# Finding the range of values in the first column
first_column_range <- range(matrix_data[,1])
print(first_column_range) # Displays the minimum and maximum values of the first column
## [1] 1 3
In the context of matrix selection, while range() is not used for selecting specific rows or columns, understanding the spread of data within a matrix can be crucial for informed data manipulation and analysis. Here’s an example of how you might use this information:
# Assuming you want to know if the first column contains values within a specific range
is_in_range <- first_column_range[1] >= 2 && first_column_range[2] <= 8
print(is_in_range) # Checks if the range of the first column is between 2 and 8
## [1] FALSE
Or you can exclude specific columns or rows from a matrix using negative indexing. This is particularly useful for analysis or visualization when you want to focus on specific parts of the matrix:
# Creating a matrix
matrix_data <- matrix(1:9, nrow=3)
# Excluding the first column from the matrix
matrix_without_first_column <- matrix_data[, -1] # Excludes the first column
print(matrix_without_first_column)
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
## [3,] 6 9
For a detailed overview of matrix operations, refer to Quick-R.
3.2.2 Arrays
Arrays in R are like higher-dimensional matrices, capable of storing data in multiple dimensions. Creating an array requires specifying the values and the dimensions for each axis:
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 13 17 21
## [2,] 14 18 22
## [3,] 15 19 23
## [4,] 16 20 24
dim
and offer different functionalities.
3.2.3 Selection in Multi-Dimensional Data Types
Selecting elements from matrices and arrays in R is similar to vector selection, but requires specifying an index for each dimension.
For matrices:
## [,1] [,2]
## [1,] 3 4
## [2,] 5 3
## [3,] 7 1
## [1] 5 3
For arrays with multiple dimensions:
## , , 1
##
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
##
## , , 2
##
## [,1] [,2]
## [1,] 7 10
## [2,] 8 11
## [3,] 9 12
## [1] 8 11
Create a 3-dimensional array, extract 2 elements to form a vector, and 4 elements to form a matrix.
See solution!
Array creation:
a3dim_array <- array(1:24, dim=c(4, 3, 2))
Extracting elements:
a_vector <- a3dim_array[3, c(1, 2), 2] a_matrix <- a3dim_array[c(3, 4), c(1, 2), 2]
3.2.4 Lists
Lists in R are incredibly versatile and can hold elements of different types, including vectors, matrices, other lists, and even functions. This makes lists a powerful tool for organizing and storing complex, heterogeneous collections of data.
Elements in lists are selected using double square brackets.
Basic list:
## [[1]]
## [1] "Christian"
##
## [[2]]
## [1] 2017
## [1] "Christian"
Named lists allow selection using the $
symbol:
## $employee_name
## [1] "Christian"
##
## $start_year
## [1] 2017
## [1] "Christian"
Applying range()
to lists requires consideration of the list’s diverse elements. For numeric elements, range()
can be applied either individually or to the entire list converted into a numeric vector:
# List with numeric and character vectors
list_data <- list(num_vector = 1:5, char_vector = c("a", "b", "c"))
# Applying range() to numeric elements
numeric_ranges <- lapply(list_data, function(x) if(is.numeric(x)) range(x))
numeric_ranges
## $num_vector
## [1] 1 5
##
## $char_vector
## NULL
3.2.5 Data Frame
Data frames are essential in R for representing tables of data. Each data frame is structured similarly to a named list with each element being a vector of equal length. Below is an example of creating a data frame:
employees <- data.frame(
EmployeeName = c("Maria", "Pete", "Sarah"),
Age = c(47, 34, 32),
Role = c("Professor", "Researcher", "Researcher"))
employees
## EmployeeName Age Role
## 1 Maria 47 Professor
## 2 Pete 34 Researcher
## 3 Sarah 32 Researcher
Data frames are similar to tables in that each column represents a variable, and each row represents an observation.
Can elements of different types be mixed within a single vector or data frame column?
See solution!
Vector elements (and by extension, data frame columns) must be of the same type (character, logical, or numeric). For example, EmployeeName
contains characters, while Age
contains numerics.
Elements in each column of a data frame correspond to a row. The first element in EmployeeName
represents the name of the first employee, and similarly for other columns.
Selecting data from a data frame is analogous to vector and list selection, but with a focus on the data frame’s two-dimensional structure. You typically need two indices to extract data.
Example of selecting the first element in the first column:
## [1] "Maria"
Selecting whole rows:
## EmployeeName Age Role
## 1 Maria 47 Professor
Selecting whole columns:
## [1] "Maria" "Pete" "Sarah"
Columns can also be selected using dollar signs and column names:
## [1] 47 34 32
## [1] 47
Modifying a data frame:
- Changing an element (e.g., updating Pete’s age):
- Adding a new column:
## EmployeeName Age Role Place
## 1 Maria 47 Professor Salzburg
## 2 Pete 33 Researcher Salzburg
## 3 Sarah 32 Researcher Salzburg
Perform operations on data frame columns as you would on vectors. Create a variable to represent the current year and use it to calculate and add a new column for each employee’s year of birth.
See solution!
Creating a data frame employees
:
employees <- data.frame( EmployeeName = c(“Maria”, “Pete”, “Sarah”), Age = c(47, 34, 32), Role = c(“Professor”, “Researcher”, “Researcher”))
Calculating the current year:
current_year <- as.integer(format(Sys.Date(), “%Y”))
Calculating year of birth:
employees\(Year_of_birth <- current_year - employees\)Age employees