Strings, Factors, and Regular Expressions

Author

Gabriel I. Cook

Published

April 14, 2024

Overview

In this module, we work with strings, character vectors, and factors. The focus will be on factors so if you desire to skip over the section on strings, that is fine.

Readings and Preparation

Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time. Work through some examples so that you have a good idea of your level of understanding and confidence.

Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.

Libraries

{here} 1.0.1: for file path management
{dplyr} 1.1.4: for data frame manipulation

Others:

{forcats} 1.0.0: for file path management

Strings

A string is an ordered list of symbols. For example "Male" would be a string

The {stringr} library contains many functions for handling strings and is part of the {tidyverse} ecosystem.

string_1 <- 'My favorite string used single quotes.'
string_2 <- "My second favorite string uses double quotes."

string_1

[1] "My favorite string used single quotes."

string_1 prints on screen with double quotes. Quotes do not matter except inside of strings. For example, you cannot have double quotes inside a string of double quotes. You can, however, have single quotes inside of a string with double quotes and vice versa.

string_3 <- 'My favorite string used "single" quotes.'
string_4 <- "My second favorite string uses 'double' quotes."

date_string <- "10/31/2008"

dates_string <- "10/31/2008, 01/08/2009"

date_string

[1] "10/31/2008"

dates_string

[1] "10/31/2008, 01/08/2009"

Splitting Strings

Splitting strings into pieces is easy used stringr::str_split() or stringr::str_split_fixed(). The functions return a list() of character vectors.

Splitting a simple string by space

We can try to split a string based on finding a pattern that is an empty string but because there is no space, the returned object is a list() of character vectors containing only one in the vector.

stringr::str_split(date_string, pattern = " ")

[[1]]
[1] "10/31/2008"

By contrast, because dates_string contains a space, the returned object is a list() of character vectors containing two elements.

stringr::str_split(dates_string, pattern = " ")

[[1]]
[1] "10/31/2008," "01/08/2009"

Using unlist(), you can convert the list into a vector, which you see has a length() of two (e.g., two dates).

unlist(stringr::str_split(dates_string, pattern = " "))

[1] "10/31/2008," "01/08/2009"

length(unlist(stringr::str_split(dates_string, pattern = " ")))

[1] 2

Notice, however, that the first vector element contains a comma. This is because the split is based on " " which appears after the comma.

Splitting a string by comma

stringr::str_split(dates_string, pattern = ",")

[[1]]
[1] "10/31/2008"  " 01/08/2009"

The first vector element contains no comma. Neither does the second. That’s because the string was split based on the comma. The second element contains a space because the "," was followed by as space.

stringr::str_split(dates_string, pattern = ",")

[[1]]
[1] "10/31/2008"  " 01/08/2009"

If this pattern held for the entire string, you could split by a comma followed by a space.

stringr::str_split(dates_string, pattern = ", ")

[[1]]
[1] "10/31/2008" "01/08/2009"

Neither vector element contains a space or a comma.

Splitting a string by multiple delimiters

Strings can be split by multiple delimiter patterns. If the pattern is a space or a comma, adding them in a [] will split the dates_string based on either the space or the comma.

stringr::str_split(dates_string, pattern = "[ ,]")

[[1]]
[1] "10/31/2008" ""           "01/08/2009"

stringr::str_split(dates_string, pattern = "[, ]")  # same thing: order does not matter

[[1]]
[1] "10/31/2008" ""           "01/08/2009"

In this case, there are three vector elements but the second one contains an empty string. If you wanted to extract all dates from a string, that might not be your solution. Of course, if your string is messy, it might contain commas or spaces and not always both together. For example, if your string is "10/31/2008, 01/08/2009 01/11/2023", some dates are separated by a comma and some a space.

Trying to split by pattern "[, ]" will split by one or the other but not both.

stringr::str_split("10/31/2008, 01/08/2009 01/11/2023", pattern = "[, ]")

[[1]]
[1] "10/31/2008" ""           "01/08/2009" "01/11/2023"

You have four vector elements and one is an empty string. This may not be what you want. Changing the pattern to "[, ]+" will allow you to match more than one The + is a quantifier to the pattern will match one or more occurrences of the preceding character class or characters.

stringr::str_split("10/31/2008, 01/08/2009 01/11/2023", pattern = "[, ]+")

[[1]]
[1] "10/31/2008" "01/08/2009" "01/11/2023"

We have three elements, all of which are dates.

Splitting a string and extracting a specific element

stringr::str_split_fixed("10/31/2008, 01/08/2009 01/11/2023", pattern = "/", n = 3)

     [,1] [,2] [,3]                         
[1,] "10" "31" "2008, 01/08/2009 01/11/2023"

Splitting a string with a regular expression

If you wanted to split only by digits, you could use a regular expression that looks for a digit pattern. `“[/d+]”

stringr::str_split(dates_string, pattern = "[d+]")

[[1]]
[1] "10/31/2008, 01/08/2009"

Splitting a string with a regular expression and limiting the number of splits

stringr::str_split(dates_string, pattern = "[/d+]", n = 2)

[[1]]
[1] "10"                  "31/2008, 01/08/2009"

Factors

As part of data preparation or cleaning, you will likely have to create factor variables. A factor variable that represents nominal categories in data. Factors typically represent qualitative data, such as group membership, levels of a factor, or labels. Factors are particularly useful for statistical analysis and modeling. Factors look like character objects but they are different and they come in different flavors serving different purposes.

Fixed set of levels: A factor variable has a fixed set of possible values known as levels. Each level represents a distinct category or group within the variable.
Ordered or unordered: Factor variables can be either ordered or unordered. Ordered factors have a meaningful order among their levels, while unordered factors do not. An example of an ordered factor would be ranks or ratings for example low, medium, and high groups.
Factor levels: You can specify factor levels explicitly using the levels() function or R will infer levels based on the unique values present in the data. It’s important to ensure that factor levels are correctly specified to avoid unexpected behavior in statistical analyses.
Statistical modeling: Many statistical functions and modeling techniques, such as linear regression or ANOVA, automatically treat factor variables differently from numeric variables. They use factors to understand categorical differences and estimate coefficients or perform hypothesis tests.

Creating Factors

Let’s get some data and create some factor objects to represent factor variables.

DATA <- data.frame(
  var1 = c(1, 3, 4, 5),
  var2 = c(100, 650, 890, 20),
  var3 = c(1.0, .6, .9, .98),
  group = c("B", "A", "B", "A")
  )

gt::gt(DATA)

var1	var2	var3	group
1	100	1.00	B
3	650	0.60	A
4	890	0.90	B
5	20	0.98	A

In order to convert the vector variables into factors, we can use as.factor() from base R or forcats::as_factor().

factor(x = character(), 
       levels, 
       labels = levels,
       exclude = NA, 
       ordered = is.ordered(x), 
       nmax = NA
       )

Key Parameters/Arguments

x: a vector of data, usually taking a small number of distinct values.
levels: an optional vector of the unique values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x)).
labels: either an optional character vector of labels for the levels (in the same order as levels after removing those in exclude), or a character string of length 1. Duplicated values in labels can be used to map different values of x to the same factor level.
ordered: logical flag to determine if the levels should be regarded as ordered (in the order given)

Levels vs. Labels

In brief, levels are the input whereas labels are the output for factor(). A factor has a level attribute, which is set by the labels argument.

When you convert the vector to a factor, the factor object has levels as seen in the print out. All factors have levels but importantly, you must attend to the ordering of those levels.

The print out appears to indicate that the first level is B and the second level is A,

The levels() function, however, will tell us the actual order, so let’s verify whether that assumption is true.

levels(as.factor(DATA$group))

[1] "A" "B"

Wrapping both functions in levels() is informative. First, the functions are returning different ordered. Second, factor() reorders the levels alphabetically. This order may appear trivial but the order matters when performing statistical modeling. For example, linear regression involving factor/categorical predictors using lm() will treat the first level as the base category to which all other levels will be compared. When a baseline or control group is not positioned first alphabetically, you will need to change the order.

And what about numbers?

factor_a = factor(c("1", "8", "3", "5", "2"))

levels(factor_a)

[1] "1" "2" "3" "5" "8"

factor_b = forcats::as_factor(c("1", "8", "3", "5", "2"))

levels(factor_b)

[1] "1" "8" "3" "5" "2"

If you wanted more details, the the attributes of the object can be examined using attribute(). The class is a factor and the levels are the unique values. Importantly, there is no label attribute.

attributes(factor_a)

$levels
[1] "1" "2" "3" "5" "8"

$class
[1] "factor"

Again, you see the two functions use the data differently; factor() reorders numerically. As the data scientist, you need to know how the function operates.

Mutating Factors

Variables that you want to make into factors are typically in a data frame, so let’s mutate factors in a data frame.

DATA |>
  mutate(
    factor_1 = factor(group),
    factor_2 = as.factor(group),
    factor_3 = as.factor(var1)
    ) |>
  glimpse()

Rows: 4
Columns: 7
$ var1     <dbl> 1, 3, 4, 5
$ var2     <dbl> 100, 650, 890, 20
$ var3     <dbl> 1.00, 0.60, 0.90, 0.98
$ group    <chr> "B", "A", "B", "A"
$ factor_1 <fct> B, A, B, A
$ factor_2 <fct> B, A, B, A
$ factor_3 <fct> 1, 3, 4, 5

You see the data frame’s structure lists the new variables as factors.

Unordered Factors

Factor variable levels and labels require some order.

For example, let’s create some some vectors that could be columns in a data frame.

factor_a = factor(c("1", "8", "3", "5", "2"))

factor_b = factor(c("1", "8", "3", "5", "2"))

attributes(factor_a)

$levels
[1] "1" "2" "3" "5" "8"

$class
[1] "factor"

factor_a

[1] 1 8 3 5 2
Levels: 1 2 3 5 8

factor_b

[1] 1 8 3 5 2
Levels: 1 2 3 5 8

Factor Levels and Labels

Looking at the data frame, we see that there are some numeric and character variables but no factors.

Rows: 4
Columns: 4
$ var1  <dbl> 1, 3, 4, 5
$ var2  <dbl> 100, 650, 890, 20
$ var3  <dbl> 1.00, 0.60, 0.90, 0.98
$ group <chr> "B", "A", "B", "A"

We can create three factors. The first factor can be a simple factor based on the character vector group. Using factor() will convert it to a factor variable, which will contain levels for each unique character type. The second factor will make use of the labels parameter to which we can pass a character vector of labels. The vector passed to labels will need to be the same length as the unique elements, so use unique() to ensure proper labeling. If your vector is of an incorrect length, you will receive an error stating invalid 'labels'. You could have one string label (e.g., "My Label") but R will create levels by appending a number as a suffix to the string, which likely won’t be as helpful (see factor2b). Finally, the third factor would make use of the labels and the levels parameters to which we can pass a character vector of labels and levels.

DATA$group |> unique()

[1] "B" "A"

DATA <-
  DATA |>
  mutate(
    factor_1  = factor(x = group),
    factor_2a = factor(x = group,
                       labels = c("Old", "Young")
                       ),
    factor_2b = factor(x = group,
                       labels = c("My Label")
                       ),
    factor_2c = factor(x = group,
                       levels = c("Old", "Young")
                       ),    
    factor_3 = factor(x = group,
                      levels = c("A", "B"),
                      labels = c("Young", "Old")   # intentionally reversed 
                      )
    )

var1	var2	var3	group	factor_1	factor_2a	factor_2b	factor_2c	factor_3
1	100	1.00	B	B	Young	My Label2	NA	Old
3	650	0.60	A	A	Old	My Label1	NA	Young
4	890	0.90	B	B	Young	My Label2	NA	Old
5	20	0.98	A	A	Old	My Label1	NA	Young

Piping the data frame to glimpse() will allow you to take inventory of the factors created.

DATA |> glimpse()

Rows: 4
Columns: 9
$ var1      <dbl> 1, 3, 4, 5
$ var2      <dbl> 100, 650, 890, 20
$ var3      <dbl> 1.00, 0.60, 0.90, 0.98
$ group     <chr> "B", "A", "B", "A"
$ factor_1  <fct> B, A, B, A
$ factor_2a <fct> Young, Old, Young, Old
$ factor_2b <fct> My Label2, My Label1, My Label2, My Label1
$ factor_2c <fct> NA, NA, NA, NA
$ factor_3  <fct> Old, Young, Old, Young

The variable are not flagged as <fct>.

Inspecting Factor Levels

var1	var2	var3	group	factor_1	factor_2a	factor_2b	factor_2c	factor_3
1	100	1.00	B	B	Young	My Label2	NA	Old
3	650	0.60	A	A	Old	My Label1	NA	Young
4	890	0.90	B	B	Young	My Label2	NA	Old
5	20	0.98	A	A	Old	My Label1	NA	Young

Factor 1

Let’s inspect the variable by pull()ing it from the data frame.

DATA |>
  pull(factor_1)

[1] B A B A
Levels: A B

You see that the vector along with all 4 elements. There are also are 2 levels listed. To inspect just the levels, pipe the variable to levels().

DATA |>
  pull(factor_1) |>
  levels()

[1] "A" "B"

The levels take on the character elements. For the rows where group is A, factor_1 is shown as A also.

Factor 2a

DATA |>
  pull(factor_2a)

[1] Young Old   Young Old  
Levels: Old Young

The levels take on the levels specified in the function. Note that the levels are returned alphabetically, "Old" before "Young". Do not be fooled, however. This order is not based on the level but rather on the label. For example, labels = c("Young", "Old") corresponded to levels = c("A", "B")). Because "A" is alphabetically before "B", then "Young" will appear before "Old". To make this distinction more clear, we will compare factor_2a to factor_3.

In the data frame, the rows where group is A, factor_2a is now shown as Old rather than A.

gt::gt(DATA)

var1	var2	var3	group	factor_1	factor_2a	factor_2b	factor_2c	factor_3
1	100	1.00	B	B	Young	My Label2	NA	Old
3	650	0.60	A	A	Old	My Label1	NA	Young
4	890	0.90	B	B	Young	My Label2	NA	Old
5	20	0.98	A	A	Old	My Label1	NA	Young

Factor 2b

DATA |>
  pull(factor_2b)

[1] My Label2 My Label1 My Label2 My Label1
Levels: My Label1 My Label2

The levels take on increments of the specified level. Note that they are presented alphanumerically, again based on the label, though this distinction is not clear.

Factor 2c

DATA |>
  pull(factor_2c)

[1] <NA> <NA> <NA> <NA>
Levels: Old Young

This factor is problematic and is created incorrectly. Specifying the levels resulted in the creation of a factor variable with NA values across the rows.

Factor 3

DATA |>
  pull(factor_3)

[1] Old   Young Old   Young
Levels: Young Old

The levels take on the levels specified in the function. Note that they are not presented alphabetically but rather in the order of the levels (e.g., labels = c("Young", "Old")) paired to the labels (levels = c("A", "B")). Because "A" is alphabetically before "B", then "Young" will appear before "Old".

Note: Remember that unordered factors actually have an order, which is based on the underlying characters or level.

Ordered Factors

Ordered factors differ from unordered factors in that they are ordered to represent some ordering of levels of a variable. Variables like race, college major, sex, etc. have no order to them. Variables like tumor stage, satisfaction rating, etc. have an order to them. Some values represent outcomes that are less than or greater than others.

Ordered factors should be used when there are more than two categories that are ordinal. Beyond the ordering of the values, ordered factors in statistical models impose constraints on the data that are not imposed with unordered factors. In other words, the magnitude and direction of effects of an outcome variable that is associated with the ordered factor predictor will obey that ordering. Thus, ordering of categories is equivalent to a constraint on the parameter space in the model.

The documentation states that ordered is “logical flag to determine if the levels should be regarded as ordered (in the order given)”

And Example with Character Vectors

Taking variable to create into a factor and specifying ordered = TRUE will create a new variable that is ordered.

DATA <-
  DATA |>
  mutate(
    factor_4a = factor(x = group,
                      ordered = TRUE
                      )
  )

The factor levels take on the order values in alphabetical order. A problem will arise if this order may not be the order that you need. Always make sure your ordering is created correctly.

DATA |> pull(factor_4a)

[1] B A B A
Levels: A < B

The attributes so that the object class is now “ordered” and “factor”.

DATA |> pull(factor_4a) |> attributes()

$levels
[1] "A" "B"

$class
[1] "ordered" "factor"

Adjusting the `levels` Order

A safe approach to creating ordered factors is to ensure that you specify the order rather than assume R will read your mind. Do not assume that the alphabetical or alphanumeric order is the proper ordering of the factor. If the letters A and B represented grades, A > B rather that A < B. Similarly, Young < Old but without you controlling the order, Old < Young. Your ordered factor would then be incorrect.

Passing levels = c("B", "A") will ensure that A > B.

DATA <-
  DATA |>
  mutate(
    factor_4b = factor(x = group,
                       levels = c("B", "A"),
                       ordered = TRUE
                       )
  )

The factor levels now take on the order passed to levels.

DATA |> pull(factor_4b)

[1] B A B A
Levels: B < A

If you want to add different labels, pass a character vector to labels.

DATA <-
  DATA |>
  mutate(
    factor_4c = factor(x = group,
                       levels = c("B", "A"),
                       labels = c("Young", "Old"),
                       ordered = TRUE
                       )
  )

The factor levels now take on the order passed to levels.

DATA |> pull(factor_4c)

[1] Young Old   Young Old  
Levels: Young < Old

Comparing the factor variables, versions a, b, and c

DATA |> select(contains("4")) |> gt::gt()

factor_4a	factor_4b	factor_4c
B	B	Young
A	A	Old
B	B	Young
A	A	Old

And Example with Character Vectors

When your variable that you want to use for creating a factor is character in nature, the levels would need a character vector.

DATA <-
  DATA |>
  mutate(
    factor_4 = factor(x = group,
                      levels = c("A", "B"),
                      labels = c("Young", "Old"),
                      ordered = TRUE
                      )
  )

Let’s take inventory of the data frame.

DATA |> glimpse()

Rows: 4
Columns: 13
$ var1      <dbl> 1, 3, 4, 5
$ var2      <dbl> 100, 650, 890, 20
$ var3      <dbl> 1.00, 0.60, 0.90, 0.98
$ group     <chr> "B", "A", "B", "A"
$ factor_1  <fct> B, A, B, A
$ factor_2a <fct> Young, Old, Young, Old
$ factor_2b <fct> My Label2, My Label1, My Label2, My Label1
$ factor_2c <fct> NA, NA, NA, NA
$ factor_3  <fct> Old, Young, Old, Young
$ factor_4a <ord> B, A, B, A
$ factor_4b <ord> B, A, B, A
$ factor_4c <ord> Young, Old, Young, Old
$ factor_4  <ord> Old, Young, Old, Young

Now, factor_4 is shown as <ord>, which represents it is an ordered factor. Looking at the data frame, the levels and the labels look just like those for factor_3.

DATA |> gt::gt()

var1	var2	var3	group	factor_1	factor_2a	factor_2b	factor_2c	factor_3	factor_4a	factor_4b	factor_4c	factor_4
1	100	1.00	B	B	Young	My Label2	NA	Old	B	B	Young	Old
3	650	0.60	A	A	Old	My Label1	NA	Young	A	A	Old	Young
4	890	0.90	B	B	Young	My Label2	NA	Old	B	B	Young	Old
5	20	0.98	A	A	Old	My Label1	NA	Young	A	A	Old	Young

Nothing appears visually different between factor_3 and factor_4 in the data frame. We know that factor_4 is ordered, however. Let’s see what happens if we inspect the variable and its levels.

DATA |>
  pull(factor_4)

[1] Old   Young Old   Young
Levels: Young < Old

Now we see something different. The elements of the factor are listed as well as the levels. Importantly, the < indicates an order to the levels such that one level is less than/greater than the other.

And Example with Numeric Vectors

Let’s create another ordered factor. In this instance, the vector used to make the factor is numeric. Assume that the numeric values in var1 represent satisfaction responses to a question like “How satisfied are you with your purchase?”. The options available were 1 = “very dissatisfied”, 2 = “dissatisfied”, 3 = “neutral”, 4= “satisfied”, 5 = “very satisfied”. If you wanted to use the responses as a grouping variable as a predictor for another variables, you could treat them as an ordered categorical variable or even some numeric value. Responses options clearly have an order to them but the order may not be equivalent. Making the levels ordered would create such an order to them.

Looking at the values in var1, you see that they are numbers: 1, 3, 4, 5. All numbers have not been used either. For example, the “dissatisfied” option was not used by anyone.

DATA$var1

[1] 1 3 4 5

Because var1 is numeric, pass a numeric vector to levels and a character vector to labels. If the variable you are trying to convert into the factors is character and not numeric, your vector passed to levels would be character as well.

DATA <-
  DATA |>
  mutate(factor_5a = factor(var1, 
                           levels = 1:5,     # or c(1, 2, 3, 4, 5)
                           labels = c("very dissatisfied",
                                      "dissatisfied", 
                                      "neutral", 
                                      "satisfied", 
                                      "very satisfied"
                                      ),
                           ordered = TRUE
                           )
       )

DATA |>
  pull(factor_5a)

[1] very dissatisfied neutral           satisfied         very satisfied   
5 Levels: very dissatisfied < dissatisfied < neutral < ... < very satisfied

If your variable is numeric and you do not pass an argument of the labels, you will receive an error.

DATA |> gt::gt()

var1	var2	var3	group	factor_1	factor_2a	factor_2b	factor_2c	factor_3	factor_4a	factor_4b	factor_4c	factor_4	factor_5a
1	100	1.00	B	B	Young	My Label2	NA	Old	B	B	Young	Old	very dissatisfied
3	650	0.60	A	A	Old	My Label1	NA	Young	A	A	Old	Young	neutral
4	890	0.90	B	B	Young	My Label2	NA	Old	B	B	Young	Old	satisfied
5	20	0.98	A	A	Old	My Label1	NA	Young	A	A	Old	Young	very satisfied

Summary

Factors represent categorical groupings of data that are not numeric. They can be unordered for which all levels of the factor will be treated the same but they can also be ordered. Ordered factors provide information about a hierarchy of the levels, which will be relevant for building appropriate statistical models that depend on the order.

Session Information

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/Los_Angeles
 date     2024-04-14
 pandoc   3.1.5 @ C:/PROGRA~1/Pandoc/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package     * version date (UTC) lib source
   BiocManager   1.30.22 2023-08-08 [1] RSPM (R 4.3.0)
 P bit           4.0.5   2022-11-15 [?] CRAN (R 4.3.1)
 P bit64         4.0.5   2020-08-30 [?] CRAN (R 4.3.1)
 P cli           3.6.1   2023-03-23 [?] CRAN (R 4.3.1)
 P colorspace    2.1-0   2023-01-23 [?] CRAN (R 4.3.1)
 P crayon        1.5.2   2022-09-29 [?] CRAN (R 4.3.1)
 P digest        0.6.33  2023-07-07 [?] CRAN (R 4.3.1)
   dplyr       * 1.1.4   2023-11-17 [1] RSPM (R 4.3.0)
 P evaluate      0.21    2023-05-05 [?] CRAN (R 4.3.1)
 P fansi         1.0.4   2023-01-22 [?] CRAN (R 4.3.1)
 P fastmap       1.1.1   2023-02-24 [?] CRAN (R 4.3.1)
 P forcats     * 1.0.0   2023-01-29 [?] CRAN (R 4.3.1)
 P generics      0.1.3   2022-07-05 [?] CRAN (R 4.3.1)
   ggplot2     * 3.5.0   2024-02-23 [1] RSPM (R 4.3.0)
   glue          1.6.2   2022-02-24 [1] RSPM (R 4.3.0)
   gt            0.10.0  2023-10-07 [1] RSPM (R 4.3.0)
 P gtable        0.3.4   2023-08-21 [?] CRAN (R 4.3.1)
   here          1.0.1   2020-12-13 [1] RSPM (R 4.3.0)
 P hms           1.1.3   2023-03-21 [?] CRAN (R 4.3.1)
   htmltools     0.5.7   2023-11-03 [1] RSPM (R 4.3.0)
   htmlwidgets   1.6.4   2023-12-06 [1] RSPM (R 4.3.0)
 P jsonlite      1.8.7   2023-06-29 [?] CRAN (R 4.3.1)
   knitr         1.45    2023-10-30 [1] RSPM (R 4.3.0)
 P lifecycle     1.0.3   2022-10-07 [?] CRAN (R 4.3.1)
   lubridate   * 1.9.3   2023-09-27 [1] RSPM (R 4.3.0)
   magrittr      2.0.3   2022-03-30 [1] RSPM (R 4.3.0)
 P munsell       0.5.0   2018-06-12 [?] CRAN (R 4.3.1)
 P pillar        1.9.0   2023-03-22 [?] CRAN (R 4.3.1)
 P pkgconfig     2.0.3   2019-09-22 [?] CRAN (R 4.3.1)
 P purrr       * 1.0.2   2023-08-10 [?] CRAN (R 4.3.1)
 P R.methodsS3   1.8.2   2022-06-13 [?] CRAN (R 4.3.0)
 P R.oo          1.25.0  2022-06-12 [?] CRAN (R 4.3.0)
 P R.utils       2.12.2  2022-11-11 [?] CRAN (R 4.3.1)
 P R6            2.5.1   2021-08-19 [?] CRAN (R 4.3.1)
   readr       * 2.1.4   2023-02-10 [1] RSPM (R 4.3.0)
   renv          1.0.3   2023-09-19 [1] RSPM (R 4.3.0)
 P rlang         1.1.1   2023-04-28 [?] CRAN (R 4.3.1)
   rmarkdown     2.25    2023-09-18 [1] RSPM (R 4.3.0)
 P rprojroot     2.0.3   2022-04-02 [?] CRAN (R 4.3.1)
   rstudioapi    0.15.0  2023-07-07 [1] RSPM (R 4.3.0)
 P sass          0.4.7   2023-07-15 [?] CRAN (R 4.3.1)
   scales        1.3.0   2023-11-28 [1] RSPM (R 4.3.0)
   sessioninfo   1.2.2   2021-12-06 [1] RSPM (R 4.3.0)
 P stringi       1.7.12  2023-01-11 [?] CRAN (R 4.3.0)
   stringr     * 1.5.1   2023-11-14 [1] RSPM (R 4.3.0)
 P tibble      * 3.2.1   2023-03-20 [?] CRAN (R 4.3.1)
 P tidyr       * 1.3.0   2023-01-24 [?] CRAN (R 4.3.1)
 P tidyselect    1.2.0   2022-10-10 [?] CRAN (R 4.3.1)
   tidyverse   * 2.0.0   2023-02-22 [1] RSPM (R 4.3.0)
 P timechange    0.2.0   2023-01-11 [?] CRAN (R 4.3.1)
 P tzdb          0.4.0   2023-05-12 [?] CRAN (R 4.3.1)
   utf8          1.2.4   2023-10-22 [1] RSPM (R 4.3.0)
   vctrs         0.6.5   2023-12-01 [1] RSPM (R 4.3.0)
   vroom       * 1.6.5   2023-12-05 [1] RSPM (R 4.3.0)
 P withr         2.5.0   2022-03-03 [?] CRAN (R 4.3.1)
 P xfun          0.40    2023-08-09 [?] CRAN (R 4.3.1)
 P xml2          1.3.5   2023-07-06 [?] CRAN (R 4.3.1)
 P yaml          2.3.7   2023-01-23 [?] CRAN (R 4.3.0)

 [1] C:/Users/gcook/Sync/git/fods24/renv/library/R-4.3/x86_64-w64-mingw32
 [2] C:/Users/gcook/AppData/Local/R/cache/R/renv/sandbox/R-4.3/x86_64-w64-mingw32/5b568fc0

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

Overview

Readings and Preparation

Libraries

Strings

Splitting Strings

Splitting a simple string by space

Splitting a string by comma

Splitting a string by multiple delimiters

Splitting a string and extracting a specific element

Splitting a string with a regular expression

Splitting a string with a regular expression and limiting the number of splits

Factors

Creating Factors

Levels vs. Labels

Mutating Factors

Unordered Factors

Factor Levels and Labels

Inspecting Factor Levels

Ordered Factors

And Example with Character Vectors

Adjusting the levels Order

And Example with Character Vectors

And Example with Numeric Vectors

Summary

Session Information

Adjusting the `levels` Order