Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sparklyr/Hive: how to use regex (regexp_replace) correctly?

Consider the following example

dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z'))

# A tibble: 2 x 1
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

sdf <- copy_to(sc, dataframe_test, overwrite = TRUE)

> sdf
# Source:   table<dataframe_test> [?? x 1]
# Database: spark_connection
                    mydate
                     <chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

I would like to modify the character timestamp so that it has a more conventional format. I tried to do so using regexp_replace but it fails.

> sdf <- sdf %>% mutate(regex = regexp_replace(mydate, '(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}).(\\d{3})Z', '$1-$2-$3 $4:$5:$6.$7'))
> sdf
# Source:   lazy query [?? x 2]
# Database: spark_connection
                    mydate                    regex
                     <chr>                    <chr>
1 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z

Any ideas? What is the correct syntax?

like image 415
ℕʘʘḆḽḘ Avatar asked Dec 19 '22 06:12

ℕʘʘḆḽḘ


2 Answers

Spark SQL and Hive provide two different functions:

  • regexp_extract - which takes string, pattern and the index of the group to be extracted.
  • regexp_replace - which takes a string, pattern, and the replacement string.

The former one can be used to extract a single group with the index semantics being the same as for java.util.regex.Matcher

For regexp_replace pattern has to match a whole string and if there is no match, and the input string is returned:

sdf %>% mutate(
 regex = regexp_replace(mydate, '^([0-9]{4}).*', "$1"),
 regexp_bad = regexp_replace(mydate, '([0-9]{4})', "$1"))

## Source:   query [2 x 3]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 3
##                     mydate regex               regexp_bad
##                      <chr> <chr>                    <chr>
## 1 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z
## 2 2011-03-01T00:00:04.226Z  2011 2011-03-01T00:00:04.226Z

while with regexp_extract it is not required:

sdf %>% mutate(regex = regexp_extract(mydate, '([0-9]{4})', 1))

## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate regex
##                      <chr> <chr>
## 1 2011-03-01T00:00:04.226Z  2011
## 2 2011-03-01T00:00:04.226Z  2011

Also, due to indirect execution (R -> Java), you have to escape twice:

sdf %>% mutate(
  regex = regexp_replace(
    mydate, 
    '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$',
    '$1-$2-$3 $4:$5:$6.$7'))

Normally one would use Spark datetime functions:

spark_session(sc) %>%  
  invoke("sql",
    "SELECT *, DATE_FORMAT(CAST(mydate AS timestamp), 'yyyy-MM-dd HH:mm:ss.SSS') parsed from dataframe_test") %>% 
  sdf_register


## Source:   query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
## # A tibble: 2 x 2
##                     mydate                  parsed
##                      <chr>                   <chr>
## 1 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226
## 2 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226

but sadly sparklyr seems to be extremely limited in this area, and treats timestamps as strings.

See also change string in DF using hive command and mutate with sparklyr.

like image 166
zero323 Avatar answered Dec 20 '22 21:12

zero323


I had some difficulties to replace "." with "", but finally it works with:

mutate(myvar2=regexp_replace(myvar, "[.]", ""))
like image 31
SébastienQUINAULT Avatar answered Dec 20 '22 19:12

SébastienQUINAULT