Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character Frequency from a Vector in R

Tags:

r

I have an ebook text file named Frankenstein.txt and I would like to know how many times each letter used in the novel.

My Setup:

I imported the text file, like this inorder to get a vector of characters character_array

string <- readChar("Frankenstein.txt", filesize)
character_array <- unlist(strsplit(string, ""))

character_array gives me something like this.

 "F" "r" "a" "n" "k" "e" "n" "s" "t" "e" "i" "n" "\r", ...

My Goal:

I would like to get the count of each time a character appears in the text file. In other words, I would like to get a count for each unique(character_array)

 [1] "F"  "r"  "a"  "n"  "k"  "e"  "s"  "t"  "i"  "\r" "\n" "b"  "y"  "M" 
 [15] " "  "W"  "o"  "l"  "c"  "f"  "("  "G"  "d"  "w"  ")"  "S"  "h"  "C" 
 [29] "O"  "N"  "T"  "E"  "L"  "1"  "2"  "3"  "4"  "p"  "5"  "6"  "7"  "8" 
 [43] "9"  "0"  "_"  "."  "v"  ","  "g"  "P"  "u"  "D"  "—"  "Y"  "j"  "m" 
 [57] "I"  "z"  "?"  ";"  "x"  "q"  "B"  "U"  "’"  "H"  "-"  "A"  "!"  ":" 
 [71] "R"  "J"  "“"  "”"  "æ"  "V"  "K"  "["  "]"  "‘"  "ê"  "ô"  "é"  "è" 

My Attempt When I call plot(as.factor(character_array)) I get a nice graph which gives me what I want visually. enter image description here However, I need to get the exact values for each of these characters. I would like something like a 2D array like:

    [,1]   [,2] [,3] [,4] ... 
[1,] "a"    "A"  "b"  "B" ...
[2,] "1202" "50" "12" "9" ...
like image 620
Paul Trimor Avatar asked Mar 06 '23 15:03

Paul Trimor


1 Answers

One nice way to make these kinds of text processing pipelines is with magrittr::%>% pipes. Here is one approach, assuming that your text is in "frank.txt" (see bottom for explanation of each step):

library(magrittr)

# read the text in 
frank_txt <- readLines("frank.txt")

# then send the text down this pipeline:
frank_txt %>% 
  paste(collapse="") %>% 
  strsplit(split="") %>% unlist %>% 
  `[`(!. %in% c("", " ", ".", ",")) %>% 
  table %>% 
  barplot

Note that you can just stop at the table() and assign the result to a variable, which you can then manipulate however you want, e.g. by plotting it:

char_counts <- frank_txt %>% paste(collapse="") %>% 
  strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>%
  table

barplot(char_counts)

You can also convert the table into a data frame for easier manipulation/plotting later:

counts_df <- data.frame(
  char = names(char_counts), 
  count = as.numeric(char_counts), 
  stringsAsFactors=FALSE)

head(counts_df)
## char count
##   a    13
##   b     2
##   c     7
##   d     5
##   e    24
##   f     6



Each step explained: Here is the full pipe-chain with each step explained:

# going to send this text down a pipeline:
frank_txt %>% 
  # combine lines into a single string (makes things easier downstream)
  paste(collapse="") %>% 
  # tokenize by character (strsplit returns a list, so unlist it)
  strsplit(split="") %>% unlist %>% 
  # remove instances of characters you don't care about
  `[`(!. %in% c("", " ", ".", ",")) %>% 
  # make a frequency table of the characters
  table %>% 
  # then plot them
  barplot

Note that this is exactly equivalent to the following horrendous ("monstrous"?!?!) code -- the forward pipe %>% just applies the function on its right to the value on its left (and . is a pronoun referring to the value on the left; see intro vignette):

barplot(table(
  unlist(strsplit(paste(frank_txt, collapse=""), split=""))[
    !unlist(strsplit(paste(frank_txt, collapse=""), split="")) %in% 
      c(""," ",".",",")]))
like image 98
lefft Avatar answered Mar 30 '23 16:03

lefft