Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Import raw data into R

please anyone can help me to import this data into R from a text or dat file. It has space delimited, but cities names should not considered as two names. Like NEW YORK.

1 NEW YORK  7,262,700
2 LOS ANGELES  3,259,340
3 CHICAGO  3,009,530
4 HOUSTON  1,728,910
5 PHILADELPHIA  1,642,900
6 DETROIT  1,086,220
7 SAN DIEGO  1,015,190
8 DALLAS  1,003,520
9 SAN ANTONIO  914,350
10 PHOENIX  894,070
like image 958
Mike Avatar asked Jun 14 '26 11:06

Mike


2 Answers

For your particular data frame, where true spaces only occur between capital letters, consider using a regular expression:

gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK  7,262,700")
# [1] "1 NEW-YORK 7,262,700"
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO  3,009,530")
# [1] "3 CHICAGO  3,009,530"

You can then interpret spaces as field separators.

like image 161
Hugh Avatar answered Jun 17 '26 01:06

Hugh


A variation on a theme... but first, some sample data:

cat("1 NEW YORK  7,262,700",
    "2 LOS ANGELES  3,259,340",
    "3 CHICAGO  3,009,530",
    "4 HOUSTON  1,728,910",
    "5 PHILADELPHIA  1,642,900",
    "6 DETROIT  1,086,220",
    "7 SAN DIEGO  1,015,190",
    "8 DALLAS  1,003,520",
    "9 SAN ANTONIO  914,350",
    "10 PHOENIX  894,070", sep = "\n", file = "test.txt")

Step 1: Read the data in with readLines

x <- readLines("test.txt")

Step 2: Figure out a regular expression that you can use to insert delimiters. Here, the pattern seems to be (looking from the end of the lines) a set of numbers and commas preceded by space preceded by some words in ALL CAPS. We can capture those groups and insert some "tab" delimiters (\t). The extra slashes are to properly escape them.

gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x)
#  [1] "1\t NEW YORK  \t7,262,700"     "2\t LOS ANGELES  \t3,259,340" 
#  [3] "3\t CHICAGO  \t3,009,530"      "4\t HOUSTON  \t1,728,910"     
#  [5] "5\t PHILADELPHIA  \t1,642,900" "6\t DETROIT  \t1,086,220"     
#  [7] "7\t SAN DIEGO  \t1,015,190"    "8\t DALLAS  \t1,003,520"      
#  [9] "9\t SAN ANTONIO  \t914,350"    "10\t PHOENIX  \t894,070"  

Step 3: Since we know our gsub is working, and we know that read.delim has a "text" argument that can be used instead of a "file" argument, we can use read.delim directly on the result of gsub:

out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x), 
                  header = FALSE, strip.white = TRUE)
out
#    V1           V2        V3
# 1   1     NEW YORK 7,262,700
# 2   2  LOS ANGELES 3,259,340
# 3   3      CHICAGO 3,009,530
# 4   4      HOUSTON 1,728,910
# 5   5 PHILADELPHIA 1,642,900
# 6   6      DETROIT 1,086,220
# 7   7    SAN DIEGO 1,015,190
# 8   8       DALLAS 1,003,520
# 9   9  SAN ANTONIO   914,350
# 10 10      PHOENIX   894,070

One possible last step would be to convert the third column to numeric:

out$V3 <- as.numeric(gsub(",", "", out$V3))
like image 23
A5C1D2H2I1M1N2O1R2T1 Avatar answered Jun 17 '26 00:06

A5C1D2H2I1M1N2O1R2T1



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!