Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert string data into data frame

Tags:

string

regex

r

I am new to R, any suggestions would be appreciated.

This is the data:

coordinates <- "(-79.43591570873059, 43.68015339477487), (-79.43491506339724, 43.68036886994886), (-79.43394727223847, 43.680578504490335), (-79.43388162422195, 43.68058996121469), (-79.43281544978878, 43.680808044458765), (-79.4326971769691, 43.68079658822322)"

I would like this to become:

Latitude           Longitude
-79.43591570873059 43.68015339477487
-79.43491506339724 43.68036886994886
-79.43394727223847 43.680578504490335
-79.43388162422195 43.68058996121469
-79.43281544978878 43.680808044458765
-79.4326971769691  43.68079658822322
like image 376
Johnny Tang Avatar asked Dec 04 '19 04:12

Johnny Tang


People also ask

How to convert an entire Dataframe to strings?

Example 3: Convert an Entire DataFrame to Strings Lastly, we can convert every column in a DataFrame to strings by using the following syntax: #convert every column to strings df = df.astype (str) #check data type of each column df.dtypes player object points object assists object dtype: object

How to convert string to integer in pandas Dataframe?

- GeeksforGeeks How to Convert String to Integer in Pandas DataFrame? Let’s see methods to convert string to an integer in Pandas DataFrame: Method 1: Use of Series.astype () method. dtype: Data type to convert the series into. (for example str, float, int). copy: Makes a copy of dataframe /series.

How to convert a Dataframe column to a float in Python?

As you can see, the data type of all the columns across the DataFrame is object: You can then add the following syntax to convert all the values into floats under the entire DataFrame: df = df.astype (float) So the complete Python code to perform the conversion would be: import pandas as pd data = {'Price_1': ['300','750','600','770','920'], ...

How to read data given in string form as a Dataframe?

To read data given in string form as a DataFrame, use read_csv (~) along with StringIO like so: Did you find this page useful? Ask a question or leave a feedback...


4 Answers

You can use scan with a little gsub:

matrix(scan(text = gsub("[()]", "", coordinates), sep = ","), 
       ncol = 2, byrow = TRUE, dimnames = list(NULL, c("Lat", "Long")))
# Read 12 items
#            Lat     Long
# [1,] -79.43592 43.68015
# [2,] -79.43492 43.68037
# [3,] -79.43395 43.68058
# [4,] -79.43388 43.68059
# [5,] -79.43282 43.68081
# [6,] -79.43270 43.68080

The precision is still there--just truncated in the matrix display.

Two clear advantages:

  • Fast.
  • Handles multi-element "coordinates" vector (eg: coordinates <- rep(coordinates, 10) as an input).

Here's another option:

library(data.table)
fread(gsub("[()]", "", gsub("), (", "\n", toString(coordinates), fixed = TRUE)), header = FALSE)

The toString(coordinates) is for cases when length(coordinates) > 1. You could also use fread(text = gsub(...), ...) and skip using toString. I'm not sure of the advantages or limitations of either approach.

like image 96
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 18 '22 18:10

A5C1D2H2I1M1N2O1R2T1


We can use str_extract_all from stringr

library(stringr)

df <- data.frame(Latitude = str_extract_all(coordinates, "(?<=\\()-\\d+\\.\\d+")[[1]], 
      Longitude = str_extract_all(coordinates, "(?<=,\\s)\\d+\\.\\d+(?=\\))")[[1]])
df
#            Latitude          Longitude
#1 -79.43591570873059  43.68015339477487
#2 -79.43491506339724  43.68036886994886
#3 -79.43394727223847 43.680578504490335
#4 -79.43388162422195  43.68058996121469
#5 -79.43281544978878 43.680808044458765
#6  -79.4326971769691  43.68079658822322

Latitude captures the negative decimal number from opening round brackets (() whereas Longitude captures it from comma (,) to closing round brackets ()).

Or without regex lookahead and behind and capturing it together using str_match_all

df <- data.frame(str_match_all(coordinates, 
                        "\\((-\\d+\\.\\d+),\\s(\\d+\\.\\d+)\\)")[[1]][, c(2, 3)])

To convert data into their respective types, you could use type.convert

df <- type.convert(df)
like image 20
Ronak Shah Avatar answered Oct 18 '22 19:10

Ronak Shah


Here is a base R option:

coordinates <- "(-79.43591570873059, 43.68015339477487), (-79.43491506339724, 43.68036886994886), (-79.43394727223847, 43.680578504490335), (-79.43388162422195, 43.68058996121469), (-79.43281544978878, 43.680808044458765), (-79.4326971769691, 43.68079658822322)"
coordinates <- gsub("^\\(|\\)$", "", coordinates)
x <- strsplit(coordinates, "\\), \\(")[[1]]
df <- data.frame(lat=sub(",.*$", "", x), lng=sub("^.*, ", "", x), stringsAsFactors=FALSE)
df

The strategy here is to first strip the leading trailing parentheses, then string split on \), \( to generate a single character vector with each latitude/longitude pair. Finally, we generate a data frame output.

                 lat                lng
1 -79.43591570873059  43.68015339477487
2 -79.43491506339724  43.68036886994886
3 -79.43394727223847 43.680578504490335
4 -79.43388162422195  43.68058996121469
5 -79.43281544978878 43.680808044458765
6  -79.4326971769691 43.68079658822322
like image 22
Tim Biegeleisen Avatar answered Oct 18 '22 19:10

Tim Biegeleisen


Yet another base R version with a bit of regex, relying on the fact that replacing the punctuation with blank lines will mean they get skipped on import.

read.csv(text=gsub(")|(, |^)\\(", "\n", coordinates), col.names=c("lat","long"), header=FALSE)
#        lat     long
#1 -79.43592 43.68015
#2 -79.43492 43.68037
#3 -79.43395 43.68058
#4 -79.43388 43.68059
#5 -79.43282 43.68081
#6 -79.43270 43.68080

Advantages:

  • Deals with vector input as well like the other scan answer.
  • Converts to correct numeric types in output

Disadvantages:

  • Not super fast
like image 28
thelatemail Avatar answered Oct 18 '22 18:10

thelatemail