I would like to flatten lists extracted from HTML tables. A minimal working example is presented below. The example depends on the stringr
package in R. The first example exhibits the desired behavior.
years <- c("2005-", "2003-")
unlist(str_extract_all(years,"[[:digit:]]{4}"))
[1] "2005" "2003"
The below example produces an undesirable result when I try to match the last 4-digit number in a series of other numbers.
years1 <- c("2005-", "2003-", "1984-1992, 1996-")
unlist(str_extract_all(years1,"[[:digit:]]{4}$"))
character(0)
As I understand the documentation, I should include $
at the end of the pattern in order to request the match at the end of the string. I would prefer to match from the second example the numbers, "2005", "2003", and "1996".
To get the first n characters from a string, we can use the built-in substr() function in R. The substr() function takes 3 arguments, the first one is a string, the second is start position, third is end position. Note: The negative values count backward from the last character.
The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings. substring of a vector or column in R can be extracted using substr() function. To extract the substring of the column in R we use functions like substr() and substring().
To get access to the individual characters in an R string, you need to use the substr function: str = 'string' substr(str, 1, 1) # This evaluates to 's'. For the same reason, you can't use length to find the number of characters in a string. You have to use nchar instead.
The stringi
package has convenient functions that operate on specific parts of a string. So you can find the last occurrence of four consecutive digits with the following.
library(stringi)
x <- c("2005-", "2003-", "1984-1992, 1996-")
stri_extract_last_regex(x, "\\d{4}")
# [1] "2005" "2003" "1996"
Other ways to get the same result are
stri_sub(x, stri_locate_last_regex(x, "\\d{4}"))
# [1] "2005" "2003" "1996"
## or, since these count as words
stri_extract_last_words(x)
# [1] "2005" "2003" "1996"
## or if you prefer a matrix result
stri_match_last_regex(x, "\\d{4}")
# [,1]
# [1,] "2005"
# [2,] "2003"
# [3,] "1996"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With