Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete characters from a column 'n' characters after the given condition in R

Tags:

substring

r

dplyr

I want to delete everything in this column 3 characters after '18'

MGL18JUNFUT
NATIONALUM18JUNFUT
NTPC18JUNFUT
ONGC18JUNFUT
PCJEWELLER18JUNFUT
PEL18JUNFUT
PFC18JUNFUT
PIDILITIND18JUNFUT
POWERGRID18JULFUT
PTC18JULFUT
RAYMOND18JULFUT
RBLBANK18JULFUT
RECLTD18JULFUT
RPOWER18JULFUT
MGL18JUN800PE

I want my output to look like

MGL18JUN
NATIONALUM18JUN
NTPC18JUN
ONGC18JUN
PCJEWELLER18JUN
PEL18JUN
PFC18JUN
PIDILITIND18JUN
POWERGRID18JUL
PTC18JUL
RAYMOND18JUL
RBLBANK18JUL
RECLTD18JUL
RPOWER18JUL
MGL18JUN

I have tried the following code.

output <- sub('(^.*?)18???.*', '' , df$column)

But the output is coming

8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUNFUT
8JUN800PE

Excel equivalent for this is.

=LEFT(A1, FIND("18",A1,1) +4)

I have tried many other options like sub, gregexpr , substr but nothing seems to work

like image 893
Aman Nangia Avatar asked Jun 28 '18 05:06

Aman Nangia


People also ask

How do I remove a specific character from a column in R?

Use gsub() function to remove a character from a string or text in R.

How do I remove the first character of a column in R?

To remove first character from column name in R data frame, we can use str_sub function of stringr package.

How do I remove the last character of a string in R?

The easiest way is to use the built-in substring() method of the String class. In order to remove the last character of a given String, we have to use two parameters: 0 as the starting index, and the index of the penultimate character.

How do I delete everything before a character in R?

Using gsub() Function and \\ It is also possible to remove all characters in front of a point using the gsub function.


2 Answers

We could change the sub by capturing the pattern of characters (.* followed by 18 and then zero to three characters (.{0,3} or specifically 3 characters (.{3}) in a group ((...)) and replace by the backreference (\\1) of the captured group

sub("^(.*18.{0,3}).*", "\\1", df$column)

or

sub("^(.*18.{3}).*", "\\1", df$column)
#[1] "MGL18JUN"        "NATIONALUM18JUN" "NTPC18JUN"       "ONGC18JUN"      
#[5] "PCJEWELLER18JUN" "PEL18JUN"        "PFC18JUN"        "PIDILITIND18JUN"
#[9] "POWERGRID18JUL"  "PTC18JUL"        "RAYMOND18JUL"    "RBLBANK18JUL"   
#[13] "RECLTD18JUL"     "RPOWER18JUL"     "MGL18JUN"       

Based on the OP's comments, if there are multiple instances of 18

v1 <- "PIDILITIND18JUN1180CE"
sub("^(.*?18.{3}).*", "\\1", v1)

It would also work on the initial data

sub("^(.*?18.{3}).*", "\\1", df$column)
#[1] "MGL18JUN"        "NATIONALUM18JUN" "NTPC18JUN"       "ONGC18JUN"      
#[5] "PCJEWELLER18JUN" "PEL18JUN"        "PFC18JUN"        "PIDILITIND18JUN"
#[9] "POWERGRID18JUL"  "PTC18JUL"        "RAYMOND18JUL"    "RBLBANK18JUL"   
#[13] "RECLTD18JUL"     "RPOWER18JUL"     "MGL18JUN"       

data

df <- structure(list(column = c("MGL18JUNFUT", "NATIONALUM18JUNFUT", 
"NTPC18JUNFUT", "ONGC18JUNFUT", "PCJEWELLER18JUNFUT", "PEL18JUNFUT", 
"PFC18JUNFUT", "PIDILITIND18JUNFUT", "POWERGRID18JULFUT", "PTC18JULFUT", 
"RAYMOND18JULFUT", "RBLBANK18JULFUT", "RECLTD18JULFUT", "RPOWER18JULFUT", 
"MGL18JUN800PE")), .Names = "column", class = "data.frame",
row.names = c(NA, 
-15L))
like image 174
akrun Avatar answered Sep 27 '22 21:09

akrun


You can also use stringr::str_extract

stringr::str_extract(string, "(.*)18\\w{3}")

Logic:

str_extract extracts the regex (regular expression match). Here I am trying to match everything (using .*, . means any character and * matches zero or more character) till 18 then extracting 3 letters(consists of alphabets and numbers, using \w with {3}), also please note in case you do want it to extract between 1 to 3 you can use {m,n}, where m suggests minimum number of match, and n suggests maximum number of match. An example: \w{2,3} would match any string with 2 or 3 alphabets and so on. You can use help(regex) to have detailed understanding for the same. Thanks. I hope this is helpful.

Output:

#> stringr::str_extract(string, "(.*)18\\w{3}")
# [1] "MGL18JUN"        "NATIONALUM18JUN" "NTPC18JUN"       "ONGC18JUN"      
# [5] "PCJEWELLER18JUN" "PEL18JUN"        "PFC18JUN"        "PIDILITIND18JUN"
# [9] "POWERGRID18JUL"  "PTC18JUL"        "RAYMOND18JUL"    "RBLBANK18JUL"   
# [13] "RECLTD18JUL"     "RPOWER18JUL"     "MGL18JUN" 

Input:

string <- c("MGL18JUNFUT",
"NATIONALUM18JUNFUT",
"NTPC18JUNFUT",
"ONGC18JUNFUT",
"PCJEWELLER18JUNFUT",
"PEL18JUNFUT",
"PFC18JUNFUT",
"PIDILITIND18JUNFUT",
"POWERGRID18JULFUT",
"PTC18JULFUT",
"RAYMOND18JULFUT",
"RBLBANK18JULFUT",
"RECLTD18JULFUT",
"RPOWER18JULFUT",
"MGL18JUN800PE")

EDIT:-


If you have multiple 18s in your data and wanted to match till first 18 then you can stop the greediness of regex character * by using ? , like below:

stringr::str_extract(string, "(.*?)18\\w{3}")
like image 23
PKumar Avatar answered Sep 27 '22 23:09

PKumar