I don't know if this question has been asked before so I wanted to go ahead and post it. I have a data frame which looks like this:
YEAR CountyCode C01 C02 C03 C04 C05 C06 . . .
2012 C01 132 0 3 1 4 9
2012 C02 0 21 0 57 0 0
2012 C03 0 0 56 0 1 1
2012 C04 0 6 0 40 0 26
2012 C06 0 0 1 0 0 0
2012 C07 0 0 1 32 219 6
. .
. .
. .
2013 C01 220 9 2 4 0 0
2013 C02 0 54 0 62 0 2
2013 C03 0 0 24 0 1 1
2013 C04 0 6 0 224 0 2
2013 C06 0 0 2 0 0 0
2013 C07 0 0 1 37 2 3
And so on for 5 years of data. Each year is associated with 75 columns with entries "C0x" associated with a specific "C0x" under county code. Basically its sort of a payer matrix.
All the rows under CountyCode are correct (there are about 65 correct ones) while the vertical columns are not all correct. There are very specific county codes (for example, 1,2,3,4,6 and so on).
So as you can see C05 doesn't match a C05 because it doesn't actually exist, and there is a break on the diagonal. Because the table format was made so that anyone can use it, the person who entered in the values obviously made major errors.
The only goal is to find all values on the diagonal where the columns are incorrect and break the diagonal. I want to weed out these. for example., under C05 (column) and row C06 (under County Code) is zero for 2012-2016. The output can just be a simple list like this : [1] C05,C40, ...
so I can find these outliers and better position them.
The point is to find the wrong counties in the columns.
PS: There are about 65 (rows) x 75 (columns) for each year and while I can try doing it manually, I would like to code it up so that it can be used further.
Thank you!
A possible solution using the data.table
-package:
library(data.table)
setDT(df)
df[, .(cols_not_in_rows = setdiff(names(.SD)[-1], CountyCode),
rows_not_in_cols = setdiff(CountyCode, names(.SD)[-1]))
, by = YEAR]
which gives:
YEAR cols_not_in_rows rows_not_in_cols 1: 2012 C05 C07 2: 2013 C05 C07
Used data:
df <- structure(list(YEAR = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L),
CountyCode = c("C01", "C02", "C03", "C04", "C06", "C07", "C01", "C02", "C03", "C04", "C06", "C07"),
C01 = c(132L, 0L, 0L, 0L, 0L, 0L, 220L, 0L, 0L, 0L, 0L, 0L),
C02 = c(0L, 21L, 0L, 6L, 0L, 0L, 9L, 54L, 0L, 6L, 0L, 0L),
C03 = c(3L, 0L, 56L, 0L, 1L, 1L, 2L, 0L, 24L, 0L, 2L, 1L),
C04 = c(1L, 57L, 0L, 40L, 0L, 32L, 4L, 62L, 0L, 224L, 0L, 37L),
C05 = c(4L, 0L, 1L, 0L, 0L, 219L, 0L, 0L, 1L, 0L, 0L, 2L),
C06 = c(9L, 0L, 1L, 26L, 0L, 6L, 0L, 2L, 1L, 2L, 0L, 3L)),
.Names = c("YEAR", "CountyCode", "C01", "C02", "C03", "C04", "C05", "C06"), class = "data.frame", row.names = c(NA, -12L))
In base R, to see which columns are not in df$CountyCode
, you can do
names(df[,c(-1,-2)])[!(names(df[,c(-1,-2)]) %in% df$CountyCode)]
If you just want to retain the columns that are in df$CountyCode
perhaps you can do something like
library(tidyverse)
correct <- names(df[,c(-1,-2)])[names(df[,c(-1,-2)]) %in% df$CountyCode]
df %>% select(1,2, correct)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With