# Declare a vector variable of strings
<- c("Birmingham", "Derby", "Leicester", "Lincoln", "Nottingham", "Wolverhampton")
a_vector a_vector
[1] "Birmingham" "Derby" "Leicester" "Lincoln"
[5] "Nottingham" "Wolverhampton"
In this lesson, we expand upon the simple data types (numeric, character and logical) discussed in Lesson 1 by introducing more complex data structures.
In this lesson, you will get to know the following data structures in R:
A Vector is an ordered list of values. Vectors can be of any of the following simple types:
However, all items in a vector must be of the same type. A vector can be of any length.
Defining a vector variable is similar to declaring a simple type variable, but the vector is created using the function c()
, which combines values into a vector:
# Declare a vector variable of strings
<- c("Birmingham", "Derby", "Leicester", "Lincoln", "Nottingham", "Wolverhampton")
a_vector a_vector
[1] "Birmingham" "Derby" "Leicester" "Lincoln"
[5] "Nottingham" "Wolverhampton"
Note that the second line of the returned elements starts with [5], as it begins with the fifth element of the vector.
Other functions for creating vectors include seq()
and rep()
:
# Create a vector of real numbers with an interval of 0.5 between 1 and 7
<- seq(1, 7, by = 0.5)
a_vector a_vector
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
# Create a vector with four identical character string values
<- rep("Ciao", 4)
a_vector a_vector
[1] "Ciao" "Ciao" "Ciao" "Ciao"
Numeric vectors can also be created using a simple syntax:
# Create a vector of integer numbers from 1 to 10
<- 1:10
a_vector a_vector
[1] 1 2 3 4 5 6 7 8 9 10
You can access individual elements of a vector by specifying the index of the element between square brackets, following the vector’s identifier. Remember, in R, the first element of a vector has an index of 1. For example, to retrieve the third element of a vector named a_vector
:
<- 3:8
a_vector 3] # Retrieves the third element a_vector[
[1] 5
To retrieve multiple elements, use a vector of indices:
<- 3:8
a_vector c(2, 4)] # Retrieves the second and fourth elements a_vector[
[1] 4 6
In this case, the values 4 and 6 are returned, corresponding to indices 2 and 4 in a_vector
.
Note that the vector of indices (
c(2, 4)
) is created on the fly without declaring a variable name.
Try creating and selecting elements from a vector yourself. Follow these steps:
east_midlands_cities
containing the cities: Derby, Leicester, Lincoln, Nottingham.selected_cities
.<- c("Derby", "Leicester", "Lincoln", "Nottingham")
east_midlands_cities
<- 2:4
my_indexes
<- east_midlands_cities[my_indexes] selected_cities
The range()
function in R, is used to find the minimum and maximum values within a vector. This can be particularly helpful when analyzing the spread of data in a vector. For example:
# Create a numeric vector
<- c(2, 8, 4, 16, 6)
numeric_vector
# Apply the range() function
<- range(numeric_vector)
vector_range # Displays the minimum and maximum values vector_range
[1] 2 16
In R, functions can be applied to vectors just like they are with individual variables. When a function is applied to a vector, it typically processes each element of the vector, resulting in a new vector of the same length as the input.
For instance, adding a value (like 10) to a numeric vector will add that value to each element of the vector:
<- 1:5
numeric_vector <- numeric_vector + 10 # Adds 10 to each element
numeric_vector numeric_vector
[1] 11 12 13 14 15
Similarly, applying a function like sqrt()
to a numeric vector will compute the square root of each element:
<- 1:5
numeric_vector <- sqrt(numeric_vector)
numeric_vector # Displays the square roots numeric_vector
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
A logical condition applied to a vector will return a logical vector indicating whether each element meets the condition:
<- 1:5
numeric_vector <- numeric_vector >= 3
logical_vector # Shows TRUE or FALSE for each element logical_vector
[1] FALSE FALSE TRUE TRUE TRUE
Moreover, functions like any()
and all()
provide overall evaluations of a vector based on a condition. any()
returns TRUE
if any elements satisfy the condition, while all()
returns TRUE
only if all elements satisfy the condition:
<- 1:5
numeric_vector any(numeric_vector >= 3) # Checks if any element is >= 3
[1] TRUE
all(numeric_vector >= 3) # Checks if all elements are >= 3
[1] FALSE
Also, when creating vectors in R, it’s important to understand the concept of type coercion. R is designed to be user-friendly, and when you combine different data types in a vector (e.g., mixing numbers and characters), R will automatically convert all elements to the same type. This process is known as type coercion. For example, if you combine numeric and character data in a vector, all elements will become characters.
<- c(1, "text", TRUE)
mixed_vector print(mixed_vector) # Notice how all elements are coerced to the same type
[1] "1" "text" "TRUE"
Factors are a special data type in R, similar to vectors but limited to predefined values called levels. Factors are not covered in this module, but you can learn more about them in the Programming with R tutorial.
Matrices in R are two-dimensional data structures, where data is organized in rows and columns. They are particularly useful for performing a variety of mathematical operations.
To create a matrix, use the matrix()
function, providing a vector of values and the desired dimensions:
<- matrix(c(3, 5, 7, 4, 3, 1), nrow=3, ncol=2)
a_matrix a_matrix
[,1] [,2]
[1,] 3 4
[2,] 5 3
[3,] 7 1
R supports numerous operators and functions for matrix algebra. For example, basic arithmetic operations can be performed on matrices:
<- matrix(c(3, 5, 7, 4, 3, 1), nrow=3, ncol=2)
x <- matrix(c(1, 2, 3, 4, 5, 6), nrow=3, ncol=2)
y <- x * y # Element-wise multiplication
z z
[,1] [,2]
[1,] 3 16
[2,] 10 15
[3,] 21 6
When working with matrices, range()
can help you quickly identify the lowest and highest values within a particular row or column. However, it’s not typically used for selecting rows or columns. Instead, you’d use direct indexing or other functions for selection. Here’s how to correctly utilize range()
with matrices::
# Creating a matrix with numeric values
<- matrix(1:9, nrow=3)
matrix_data
# Finding the range of values in the first column
<- range(matrix_data[,1])
first_column_range print(first_column_range) # Displays the minimum and maximum values of the first column
[1] 1 3
In the context of matrix selection, while range() is not used for selecting specific rows or columns, understanding the spread of data within a matrix can be crucial for informed data manipulation and analysis. Here’s an example of how you might use this information:
# Assuming you want to know if the first column contains values within a specific range
<- first_column_range[1] >= 2 && first_column_range[2] <= 8
is_in_range print(is_in_range) # Checks if the range of the first column is between 2 and 8
[1] FALSE
Or you can exclude specific columns or rows from a matrix using negative indexing. This is particularly useful for analysis or visualization when you want to focus on specific parts of the matrix:
# Creating a matrix
<- matrix(1:9, nrow=3)
matrix_data
# Excluding the first column from the matrix
<- matrix_data[, -1] # Excludes the first column
matrix_without_first_column print(matrix_without_first_column)
[,1] [,2]
[1,] 4 7
[2,] 5 8
[3,] 6 9
For a detailed overview of matrix operations, refer to Quick-R.
Arrays in R are like higher-dimensional matrices, capable of storing data in multiple dimensions. Creating an array requires specifying the values and the dimensions for each axis:
<- array(1:24, dim=c(4, 3, 2)) # Creates a 3-dimensional array
a3dim_array a3dim_array
, , 1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
, , 2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
Note: An array can have a single dimension, resembling a vector. However, arrays have additional attributes like
dim
and offer different functionalities.
Selecting elements from matrices and arrays in R is similar to vector selection, but requires specifying an index for each dimension.
For matrices:
# Example matrix
<- matrix(c(3, 5, 7, 4, 3, 1), nrow=3, ncol=2)
a_matrix a_matrix
[,1] [,2]
[1,] 3 4
[2,] 5 3
[3,] 7 1
# Selecting the second row, first and second columns
2, c(1, 2)] a_matrix[
[1] 5 3
For arrays with multiple dimensions:
# Example 3-dimensional array
<- array(1:12, dim=c(3, 2, 2))
an_array an_array
, , 1
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
, , 2
[,1] [,2]
[1,] 7 10
[2,] 8 11
[3,] 9 12
# Selecting elements with specific indices
2, c(1, 2), 2] an_array[
[1] 8 11
Create a 3-dimensional array, extract 2 elements to form a vector, and 4 elements to form a matrix.
Array creation:
<- array(1:24, dim=c(4, 3, 2)) a3dim_array
Extracting elements:
<- a3dim_array[3, c(1, 2), 2]
a_vector <- a3dim_array[c(3, 4), c(1, 2), 2] a_matrix
Lists in R are incredibly versatile and can hold elements of different types, including vectors, matrices, other lists, and even functions. This makes lists a powerful tool for organizing and storing complex, heterogeneous collections of data.
Elements in lists are selected using double square brackets.
Basic list:
<- list("Christian", 2017)
employee employee
[[1]]
[1] "Christian"
[[2]]
[1] 2017
# Selecting the first element
1]] employee[[
[1] "Christian"
Named lists allow selection using the $
symbol:
# Named list
<- list(employee_name = "Christian", start_year = 2017)
employee employee
$employee_name
[1] "Christian"
$start_year
[1] 2017
# Selecting by name
$employee_name employee
[1] "Christian"
Data frames are essential in R for representing tables of data. Each data frame is structured similarly to a named list with each element being a vector of equal length. Below is an example of creating a data frame:
<- data.frame(
employees EmployeeName = c("Maria", "Pete", "Sarah"),
Age = c(47, 34, 32),
Role = c("Professor", "Researcher", "Researcher"))
employees
EmployeeName Age Role
1 Maria 47 Professor
2 Pete 34 Researcher
3 Sarah 32 Researcher
Data frames are similar to tables in that each column represents a variable, and each row represents an observation.
Can elements of different types be mixed within a single vector or data frame column?
Vector elements (and by extension, data frame columns) must be of the same type (character, logical, or numeric). For example, EmployeeName
contains characters, while Age
contains numerics.
Elements in each column of a data frame correspond to a row. The first element in EmployeeName
represents the name of the first employee, and similarly for other columns.
To rename columns, use the names()
function: names(data frame)[column index] = "new name"
Selecting data from a data frame is analogous to vector and list selection, but with a focus on the data frame’s two-dimensional structure. You typically need two indices to extract data.
Example of selecting the first element in the first column:
1, 1] employees[
[1] "Maria"
Selecting whole rows:
1, ] employees[
EmployeeName Age Role
1 Maria 47 Professor
Selecting whole columns:
1] employees[,
[1] "Maria" "Pete" "Sarah"
Columns can also be selected using dollar signs and column names:
$Age employees
[1] 47 34 32
$Age[1] # Selecting the first element in the 'Age' column employees
[1] 47
Modifying a data frame:
$Age[2] <- 33 employees
$Place <- c("Salzburg", "Salzburg", "Salzburg")
employees employees
EmployeeName Age Role Place
1 Maria 47 Professor Salzburg
2 Pete 33 Researcher Salzburg
3 Sarah 32 Researcher Salzburg
Our simple data frame includes columns EmployeeName
, Age
and Role
:
<- data.frame(
employees EmployeeName = c("Maria", "Pete", "Sarah"),
Age = c(47, 34, 32),
Role = c("Professor", "Researcher", "Researcher"))
In this exercise you are asked to include an additional column in the data frame that contains the year of birth of employees. This new column can be derived from column age
.
This step will most likely require consultation of other online resources.
Creating a data frame employees
:
<- data.frame(
employees EmployeeName = c("Maria", "Pete", "Sarah"),
Age = c(47, 34, 32),
Role = c("Professor", "Researcher", "Researcher"))
Calculating the current_year
:
<- as.integer(format(Sys.Date(), "%Y")) current_year
Calculating Year_of_birth
as extra data frame column:
$Year_of_birth <- current_year - employees$Age
employees employees
EmployeeName Age Role Year_of_birth
1 Maria 47 Professor 1978
2 Pete 34 Researcher 1991
3 Sarah 32 Researcher 1993