Splitting a string by more than one space

Tags:

r

I am trying to load some data into R that is in the following format (as a text file)

Name                  Country            Age
John,Smith            United Kingdom     20
Washington,George     USA                50
Martin,Joseph         Argentina          43

The problem I have is that the "columns" are separated by spaces such that they all line up nicely, but one row may have 5 spaces between values and the next 10 spaces. So when I load it in using read.delim I get a one column data.frame with

"John,Smith            United Kingdom     20"

as the first observation and so on.

Is there any way I can either:

Load the data into R into a usable format? or
Split the character strings up into separate columns once I load it in in the one column format?

My thought was to split the character strings by spaces, except it would need to be between 2 and x spaces (so, for example, "United Kingdom" stays together and doesn't become "United" "" "Kingdom"). But I don't know if that is possible.

I tried strsplit(data.frame[,1], sep="\\s") but it returns a list of character strings like:

"John,Smith" "" "" "" "" "" "" "" "United" "" "Kingdom" "" ""...

which I don't know what to do with.

625

asked Mar 14 '16 04:03

moman822

2 Answers

Having columns that all "line up nicely" is a typical characteristic of fixed-width data.

For the sake of this answer, I've written your three lines of data and one line of header information to a temporary file called "x". For your actual use, replace "x" with the file name/path, as you would normally use with read.delim.

Here's the sample data:

x <- tempfile()
cat("Name                  Country            Age\nJohn,Smith            United Kingdom     20\nWashington,George     USA                50\nMartin,Joseph         Argentina          43\n", file = x)

R has it's own function for reading fixed width data (read.fwf) but it is notoriously slow and you need to know the widths before you can get started. We can count those if the file is small, and then use something like:

read.fwf(x, c(22, 18, 4), strip.white = TRUE, skip = 1, 
         col.names = c("Name", "Country", "Age"))
#                Name        Country Age
# 1        John,Smith United Kingdom  20
# 2 Washington,George            USA  50
# 3     Martin,Joseph      Argentina  43

Alternatively, you can let fwf_widths from the "readr" package do the guessing of widths for you, and then use read_fwf:

library(readr)
read_fwf(x, fwf_empty(x, col_names = c("Name", "Country", "Age")), skip = 1)
#                Name        Country Age
# 1        John,Smith United Kingdom  20
# 2 Washington,George            USA  50
# 3     Martin,Joseph      Argentina  43

112

answered Sep 30 '22 06:09

A5C1D2H2I1M1N2O1R2T1

You can do base R, supposing your columns do not contain words with more than 1 space:

txt = "Name                  Country            Age
John,Smith            United Kingdom     20
Washington,George     USA                50
Martin,Joseph         Argentina          43"

conn = textConnection(txt)
do.call(rbind, lapply(readLines(conn), function(u) strsplit(u,'\\s{2,}')[[1]]))
#     [,1]                [,2]             [,3] 
#[1,] "Name"              "Country"        "Age"
#[2,] "John,Smith"        "United Kingdom" "20" 
#[3,] "Washington,George" "USA"            "50" 
#[4,] "Martin,Joseph"     "Argentina"      "43"

answered Sep 30 '22 08:09

Colonel Beauvel

Related questions
                            
                                The xgboost package and the random forests regression
                            
                                How to background geom_vline and geom_hline in ggplot 2 in a bubble chart
                            
                                dplyr Update a cell in a data.frame
                            
                                Why do rbind() and do.call(rbind, ) return different results?
                            
                                Ways to improve for loop for matrix manipulations depending on another matrix
                            
                                Cannot Change the Version of R in RStudio
                            
                                How to reduce the resolution (Regrid) of netCDF using bi-linear interpolation in R?
                            
                                mean returns NaN besides na.rm= TRUE
                            
                                Left-aligned axis labels when using cowplot to switch x axis to top
                            
                                Add segments of circles to ggplot based on product of x & y
                            
                                How to repeat 1000 times this random walk simulation in R?
                            
                                When creating a multiple line plot in ggplot2, how do you make one line thicker than the others?
                            
                                Overlaying of violin plots in ggplot2 with transparent bodies
                            
                                R Markdown Template Creation
                            
                                Counting previous rows in a data table based on date
                            
                                geom_raster interpolation with log scale
                            
                                Generate a Pop-up box in R
                            
                                show(), hide() usage from shinyjs, Shiny
                            
                                Pandas: aggregating multiple columns with multiple functions
                            
                                Why does the function t return a t.test for objects with class set to "test"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With