Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to extract numbers and trailing letter or white space

Tags:

regex

r

I'm currently trying to extract data from strings that are always in the same format (scraped from social sites with no API support)

example of strings

53.2k Followers, 11 Following, 1,396 Posts
5m Followers, 83 Following, 1.1m Posts

I'm currently using the following regex expression: "[0-9]{1,5}([,.][0-9]{1,4})?" to get the numeric sections, preserving the comma and dot separators.

It yields results like

53.2, 11, 1,396 
5, 83, 1.1

I really need a regular expression that will also grab the character after the numeric sections, even if it's a white-space. i.e.

53.2k, 11 , 1,396
5m, 83 , 1.1m

Any help is greatly appreciated

R code for reproduction

  library(stringr)

  string1 <- ("536.2k Followers, 83 Following, 1,396 Posts")
  string2 <- ("5m Followers, 83 Following, 1.1m Posts")

  info <- str_extract_all(string1,"[0-9]{1,5}([,.][0-9]{1,4})?")
  info2 <- str_extract_all(string2,"[0-9]{1,5}([,.][0-9]{1,4})?")

  info 
  info2 
like image 270
Permafrost Avatar asked Mar 18 '19 03:03

Permafrost


People also ask

How do I allow only letters and numbers in regex?

You can use regular expressions to achieve this task. In order to verify that the string only contains letters, numbers, underscores and dashes, we can use the following regex: "^[A-Za-z0-9_-]*$".

How do you match a space in regex?

\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.

What is\\ d in r?

In the regular expression above, each '\\d' means a digit, and '. ' can match anything in between (look at the number 1 in the list of expressions in the beginning). So we got the digits, then a special character in between, three more digits, then special characters again, then 4 more digits.

How to search for special characters in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).


1 Answers

I would suggest the following regex pattern:

[0-9]{1,3}(?:,[0-9]{3})*(?:\\.[0-9]+)?[A-Za-z]*

This pattern generates the outputs you expect. Here is an explanation:

[0-9]{1,3}      match 1 to 3 initial digits
(?:,[0-9]{3})*  followed by zero or more optional thousands groups
(?:\\.[0-9]+)?  followed by an optional decimal component
[A-Za-z]*       followed by an optional text unit

I tend to lean towards base R solutions whenever possible, and here is one using gregexpr and regmatches:

txt <- "53.2k Followers, 11 Following, 1,396 Posts"
m <- gregexpr("[0-9]{1,3}(?:,[0-9]{3})*(?:\\.[0-9]+)?[A-Za-z]*", txt)
regmatches(txt, m)

[[1]]
[1] "53.2k"   "11"   "1,396"
like image 52
Tim Biegeleisen Avatar answered Sep 22 '22 15:09

Tim Biegeleisen