Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting a string based on position of found character in R

Tags:

text

split

r

I found out the positions of "oo" in the following sentence:

sentence <- "It is a good book. Good for first reading.
This book explains everything in Qdetail with tons of examples and exercises for practice. Good for cracking written tests on campuses and competitive exams. It is cheap so any way one can have a copy along with other books"

pos = gregexpr("oo", sentence)

I got the result as

> pos
[[1]]
[1]  10  15  21  50 136 263
attr(,"match.length")
[1] 2 2 2 2 2 2
attr(,"useBytes")
[1] TRUE

Based on the result, I want to extract 10 characters from each position (5 before the position and 5 after the position)

For an example, I should get result for first location as "s a good bo" And I want this extraction for each and every position. As I am new to R I couldn't figure out much how to do. Please help me out with this.

What should I do if I have to extract the words like it is shown below: I should get "a good book" for the first instance of matching

like image 269
Maddy Avatar asked Jun 15 '16 09:06

Maddy


People also ask

How do you extract a certain part of a string in R?

The substring function in R can be used either to extract parts of character strings, or to change the values of parts of character strings. substring of a vector or column in R can be extracted using substr() function. To extract the substring of the column in R we use functions like substr() and substring().

How do you find the position of a character in a string?

The indexOf() method returns the position of the first occurrence of specified character(s) in a string. Tip: Use the lastIndexOf method to return the position of the last occurrence of specified character(s) in a string.

How do I extract a string from a word in R?

To extract words from a string vector, we can use word function of stringr package. For example, if we have a vector called x that contains 100 words then first 20 words can be extracted by using the command word(x,start=1,end=20,sep=fixed(" ")).


2 Answers

We can use substring after unlisting the gregexpr output.

v1 <- unlist(gregexpr("oo", sentence))
substring(sentence, v1 - 5, v1 +5)
#[1] "s a good bo" "ood book. G" "ok. Good fo" "his book ex" "ce. Good fo" "her books"  
like image 68
akrun Avatar answered Sep 19 '22 03:09

akrun


You could also do

mapply(
  substr, 
  x=sentence, 
  start=pos[[1]]-5, 
  stop=pos[[1]]+5, 
  USE.NAMES = F
)
# [1] "s a good bo" "ood book. G" "ok. Good fo"
# [4] "his book ex" "ce. Good fo" "her books"  
like image 35
lukeA Avatar answered Sep 19 '22 03:09

lukeA