I am trying to load some data into R that is in the following format (as a text file)
Name Country Age
John,Smith United Kingdom 20
Washington,George USA 50
Martin,Joseph Argentina 43
The problem I have is that the "columns" are separated by spaces such that they all line up nicely, but one row may have 5 spaces between values and the next 10 spaces. So when I load it in using read.delim
I get a one column data.frame with
"John,Smith United Kingdom 20"
as the first observation and so on.
Is there any way I can either:
My thought was to split the character strings by spaces, except it would need to be between 2 and x spaces (so, for example, "United Kingdom"
stays together and doesn't become "United" "" "Kingdom"
). But I don't know if that is possible.
I tried strsplit(data.frame[,1], sep="\\s")
but it returns a list of character strings like:
"John,Smith" "" "" "" "" "" "" "" "United" "" "Kingdom" "" ""...
which I don't know what to do with.
We used the str. split() method to split a string by one or more spaces. The str. split() method splits the string into a list of substrings using a delimiter.
Use the String. split() method to split a string with multiple separators, e.g. str. split(/[-_]+/) . The split method can be passed a regular expression containing multiple characters to split the string with multiple separators.
Having columns that all "line up nicely" is a typical characteristic of fixed-width data.
For the sake of this answer, I've written your three lines of data and one line of header information to a temporary file called "x". For your actual use, replace "x" with the file name/path, as you would normally use with read.delim
.
Here's the sample data:
x <- tempfile()
cat("Name Country Age\nJohn,Smith United Kingdom 20\nWashington,George USA 50\nMartin,Joseph Argentina 43\n", file = x)
R has it's own function for reading fixed width data (read.fwf
) but it is notoriously slow and you need to know the widths before you can get started. We can count those if the file is small, and then use something like:
read.fwf(x, c(22, 18, 4), strip.white = TRUE, skip = 1,
col.names = c("Name", "Country", "Age"))
# Name Country Age
# 1 John,Smith United Kingdom 20
# 2 Washington,George USA 50
# 3 Martin,Joseph Argentina 43
Alternatively, you can let fwf_widths
from the "readr" package do the guessing of widths for you, and then use read_fwf
:
library(readr)
read_fwf(x, fwf_empty(x, col_names = c("Name", "Country", "Age")), skip = 1)
# Name Country Age
# 1 John,Smith United Kingdom 20
# 2 Washington,George USA 50
# 3 Martin,Joseph Argentina 43
You can do base R
, supposing your columns do not contain words with more than 1 space:
txt = "Name Country Age
John,Smith United Kingdom 20
Washington,George USA 50
Martin,Joseph Argentina 43"
conn = textConnection(txt)
do.call(rbind, lapply(readLines(conn), function(u) strsplit(u,'\\s{2,}')[[1]]))
# [,1] [,2] [,3]
#[1,] "Name" "Country" "Age"
#[2,] "John,Smith" "United Kingdom" "20"
#[3,] "Washington,George" "USA" "50"
#[4,] "Martin,Joseph" "Argentina" "43"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With