Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a string by more than one space

Tags:

r

I am trying to load some data into R that is in the following format (as a text file)

Name                  Country            Age
John,Smith            United Kingdom     20
Washington,George     USA                50
Martin,Joseph         Argentina          43

The problem I have is that the "columns" are separated by spaces such that they all line up nicely, but one row may have 5 spaces between values and the next 10 spaces. So when I load it in using read.delim I get a one column data.frame with

"John,Smith            United Kingdom     20"

as the first observation and so on.

Is there any way I can either:

  1. Load the data into R into a usable format? or
  2. Split the character strings up into separate columns once I load it in in the one column format?

My thought was to split the character strings by spaces, except it would need to be between 2 and x spaces (so, for example, "United Kingdom" stays together and doesn't become "United" "" "Kingdom"). But I don't know if that is possible.

I tried strsplit(data.frame[,1], sep="\\s") but it returns a list of character strings like:

"John,Smith" "" "" "" "" "" "" "" "United" "" "Kingdom" "" ""...

which I don't know what to do with.

like image 625
moman822 Avatar asked Mar 14 '16 04:03

moman822


People also ask

How do you split more than one space in Python?

We used the str. split() method to split a string by one or more spaces. The str. split() method splits the string into a list of substrings using a delimiter.

Can a string be split on multiple characters?

Use the String. split() method to split a string with multiple separators, e.g. str. split(/[-_]+/) . The split method can be passed a regular expression containing multiple characters to split the string with multiple separators.


2 Answers

Having columns that all "line up nicely" is a typical characteristic of fixed-width data.

For the sake of this answer, I've written your three lines of data and one line of header information to a temporary file called "x". For your actual use, replace "x" with the file name/path, as you would normally use with read.delim.

Here's the sample data:

x <- tempfile()
cat("Name                  Country            Age\nJohn,Smith            United Kingdom     20\nWashington,George     USA                50\nMartin,Joseph         Argentina          43\n", file = x)

R has it's own function for reading fixed width data (read.fwf) but it is notoriously slow and you need to know the widths before you can get started. We can count those if the file is small, and then use something like:

read.fwf(x, c(22, 18, 4), strip.white = TRUE, skip = 1, 
         col.names = c("Name", "Country", "Age"))
#                Name        Country Age
# 1        John,Smith United Kingdom  20
# 2 Washington,George            USA  50
# 3     Martin,Joseph      Argentina  43

Alternatively, you can let fwf_widths from the "readr" package do the guessing of widths for you, and then use read_fwf:

library(readr)
read_fwf(x, fwf_empty(x, col_names = c("Name", "Country", "Age")), skip = 1)
#                Name        Country Age
# 1        John,Smith United Kingdom  20
# 2 Washington,George            USA  50
# 3     Martin,Joseph      Argentina  43
like image 112
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 30 '22 06:09

A5C1D2H2I1M1N2O1R2T1


You can do base R, supposing your columns do not contain words with more than 1 space:

txt = "Name                  Country            Age
John,Smith            United Kingdom     20
Washington,George     USA                50
Martin,Joseph         Argentina          43"

conn = textConnection(txt)
do.call(rbind, lapply(readLines(conn), function(u) strsplit(u,'\\s{2,}')[[1]]))
#     [,1]                [,2]             [,3] 
#[1,] "Name"              "Country"        "Age"
#[2,] "John,Smith"        "United Kingdom" "20" 
#[3,] "Washington,George" "USA"            "50" 
#[4,] "Martin,Joseph"     "Argentina"      "43" 
like image 26
Colonel Beauvel Avatar answered Sep 30 '22 08:09

Colonel Beauvel