Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if a string contains roman numerals in R?

I have a column for residential adresses in my dataset 'ad'. I want to check for addresses which has no numbers(including roman numerals) present. I'm using

ad$check <- grepl("[[:digit:]]",ad$address)

to flag out addresses with no digits present. How do I do the same with addresses that contain roman numerals?

Eg: "floor X, DLF Building- III, ABC City"

like image 943
Priya T Avatar asked Nov 07 '22 09:11

Priya T


1 Answers

You need to make a regex string.

Edit (my first answer was nonsense):

x <- c("floor Imaginary,  building- Momentum, ABC City", "floor X, DLF Building- III, ABC City")
# here come the regex 
grepl("\\b[I|V|X|L|C|D|M]\\b", x, ignore.case = FALSE)
[1] FALSE  TRUE

To break it down:

\\b are word boundaries. It means roman numerals must be preceded or trailed by whitespace, punctuation or beginning/end of the string.

[I|V|X|L|C|D|M] the "word" we are looking for can only consist of the symbols used for roman numerals. These should be all as far as I know.

ignore.case = FALSE this is the standard which is normally set if you omit the option. I find it safer, however, to mention it explicitly if it is important for the operation at hand.

Use with caution, as a company called e.g., "LCD Industries" would also be flagged as roman numeral. You could combine my approach with this answer to further test if the symbols are in the right order.

Please test on your data and report if it works.

like image 60
JBGruber Avatar answered Nov 12 '22 15:11

JBGruber