Consider the following example
dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z'))
# A tibble: 2 x 1
mydate
<chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z
sdf <- copy_to(sc, dataframe_test, overwrite = TRUE)
> sdf
# Source: table<dataframe_test> [?? x 1]
# Database: spark_connection
mydate
<chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z
I would like to modify the character timestamp
so that it has a more conventional format. I tried to do so using regexp_replace
but it fails.
> sdf <- sdf %>% mutate(regex = regexp_replace(mydate, '(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}).(\\d{3})Z', '$1-$2-$3 $4:$5:$6.$7'))
> sdf
# Source: lazy query [?? x 2]
# Database: spark_connection
mydate regex
<chr> <chr>
1 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z
Any ideas? What is the correct syntax?
Spark SQL and Hive provide two different functions:
regexp_extract
- which takes string, pattern and the index of the group to be extracted.regexp_replace
- which takes a string, pattern, and the replacement string.The former one can be used to extract a single group with the index semantics being the same as for java.util.regex.Matcher
For regexp_replace
pattern has to match a whole string and if there is no match, and the input string is returned:
sdf %>% mutate(
regex = regexp_replace(mydate, '^([0-9]{4}).*', "$1"),
regexp_bad = regexp_replace(mydate, '([0-9]{4})', "$1"))
## Source: query [2 x 3]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
##
## # A tibble: 2 x 3
## mydate regex regexp_bad
## <chr> <chr> <chr>
## 1 2011-03-01T00:00:04.226Z 2011 2011-03-01T00:00:04.226Z
## 2 2011-03-01T00:00:04.226Z 2011 2011-03-01T00:00:04.226Z
while with regexp_extract
it is not required:
sdf %>% mutate(regex = regexp_extract(mydate, '([0-9]{4})', 1))
## Source: query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
##
## # A tibble: 2 x 2
## mydate regex
## <chr> <chr>
## 1 2011-03-01T00:00:04.226Z 2011
## 2 2011-03-01T00:00:04.226Z 2011
Also, due to indirect execution (R -> Java), you have to escape twice:
sdf %>% mutate(
regex = regexp_replace(
mydate,
'^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$',
'$1-$2-$3 $4:$5:$6.$7'))
Normally one would use Spark datetime functions:
spark_session(sc) %>%
invoke("sql",
"SELECT *, DATE_FORMAT(CAST(mydate AS timestamp), 'yyyy-MM-dd HH:mm:ss.SSS') parsed from dataframe_test") %>%
sdf_register
## Source: query [2 x 2]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
##
## # A tibble: 2 x 2
## mydate parsed
## <chr> <chr>
## 1 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226
## 2 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226
but sadly sparklyr
seems to be extremely limited in this area, and treats timestamps as strings.
See also change string in DF using hive command and mutate with sparklyr.
I had some difficulties to replace "." with "", but finally it works with:
mutate(myvar2=regexp_replace(myvar, "[.]", ""))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With