I have an ebook text file named Frankenstein.txt
and I would like to know how many times each letter used in the novel.
My Setup:
I imported the text file, like this inorder to get a vector of characters character_array
string <- readChar("Frankenstein.txt", filesize)
character_array <- unlist(strsplit(string, ""))
character_array
gives me something like this.
"F" "r" "a" "n" "k" "e" "n" "s" "t" "e" "i" "n" "\r", ...
My Goal:
I would like to get the count of each time a character appears in the text file. In other words, I would like to get a count for each unique(character_array)
[1] "F" "r" "a" "n" "k" "e" "s" "t" "i" "\r" "\n" "b" "y" "M"
[15] " " "W" "o" "l" "c" "f" "(" "G" "d" "w" ")" "S" "h" "C"
[29] "O" "N" "T" "E" "L" "1" "2" "3" "4" "p" "5" "6" "7" "8"
[43] "9" "0" "_" "." "v" "," "g" "P" "u" "D" "—" "Y" "j" "m"
[57] "I" "z" "?" ";" "x" "q" "B" "U" "’" "H" "-" "A" "!" ":"
[71] "R" "J" "“" "”" "æ" "V" "K" "[" "]" "‘" "ê" "ô" "é" "è"
My Attempt
When I call plot(as.factor(character_array))
I get a nice graph which gives me what I want visually.
However, I need to get the exact values for each of these characters. I would like something like a 2D array like:
[,1] [,2] [,3] [,4] ...
[1,] "a" "A" "b" "B" ...
[2,] "1202" "50" "12" "9" ...
One nice way to make these kinds of text processing pipelines is with magrittr::%>%
pipes. Here is one approach, assuming that your text is in "frank.txt"
(see bottom for explanation of each step):
library(magrittr)
# read the text in
frank_txt <- readLines("frank.txt")
# then send the text down this pipeline:
frank_txt %>%
paste(collapse="") %>%
strsplit(split="") %>% unlist %>%
`[`(!. %in% c("", " ", ".", ",")) %>%
table %>%
barplot
Note that you can just stop at the table()
and assign the result to a variable, which you can then manipulate however you want, e.g. by plotting it:
char_counts <- frank_txt %>% paste(collapse="") %>%
strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>%
table
barplot(char_counts)
You can also convert the table into a data frame for easier manipulation/plotting later:
counts_df <- data.frame(
char = names(char_counts),
count = as.numeric(char_counts),
stringsAsFactors=FALSE)
head(counts_df)
## char count
## a 13
## b 2
## c 7
## d 5
## e 24
## f 6
Each step explained: Here is the full pipe-chain with each step explained:
# going to send this text down a pipeline:
frank_txt %>%
# combine lines into a single string (makes things easier downstream)
paste(collapse="") %>%
# tokenize by character (strsplit returns a list, so unlist it)
strsplit(split="") %>% unlist %>%
# remove instances of characters you don't care about
`[`(!. %in% c("", " ", ".", ",")) %>%
# make a frequency table of the characters
table %>%
# then plot them
barplot
Note that this is exactly equivalent to the following horrendous ("monstrous"?!?!) code -- the forward pipe %>%
just applies the function on its right to the value on its left (and .
is a pronoun referring to the value on the left; see intro vignette):
barplot(table(
unlist(strsplit(paste(frank_txt, collapse=""), split=""))[
!unlist(strsplit(paste(frank_txt, collapse=""), split="")) %in%
c(""," ",".",",")]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With