Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all the zeroes over a diagonal of a data frame

Tags:

r

dplyr

I don't know if this question has been asked before so I wanted to go ahead and post it. I have a data frame which looks like this:

YEAR    CountyCode         C01   C02 C03 C04 C05 C06 . . .  
2012    C01                132    0   3   1   4   9 
2012    C02                 0    21   0   57  0   0 
2012    C03                 0    0   56   0  1    1
2012    C04                 0    6    0  40 0    26 
2012    C06                 0    0    1   0  0    0 
2012    C07                 0    0    1  32  219  6
.        .      
.        .
.        .
2013    C01                220    9   2   4   0   0 
2013    C02                 0    54   0   62  0   2 
2013    C03                 0    0   24   0   1   1
2013    C04                 0    6    0  224  0   2 
2013    C06                 0    0    2   0   0   0 
2013    C07                 0    0    1  37   2   3

And so on for 5 years of data. Each year is associated with 75 columns with entries "C0x" associated with a specific "C0x" under county code. Basically its sort of a payer matrix.

All the rows under CountyCode are correct (there are about 65 correct ones) while the vertical columns are not all correct. There are very specific county codes (for example, 1,2,3,4,6 and so on).

So as you can see C05 doesn't match a C05 because it doesn't actually exist, and there is a break on the diagonal. Because the table format was made so that anyone can use it, the person who entered in the values obviously made major errors.

The only goal is to find all values on the diagonal where the columns are incorrect and break the diagonal. I want to weed out these. for example., under C05 (column) and row C06 (under County Code) is zero for 2012-2016. The output can just be a simple list like this : [1] C05,C40, ... so I can find these outliers and better position them.

The point is to find the wrong counties in the columns.

PS: There are about 65 (rows) x 75 (columns) for each year and while I can try doing it manually, I would like to code it up so that it can be used further.

Thank you!

like image 522
Anurag Kaushik Avatar asked Mar 06 '23 04:03

Anurag Kaushik


2 Answers

A possible solution using the data.table-package:

library(data.table)
setDT(df)

df[, .(cols_not_in_rows = setdiff(names(.SD)[-1], CountyCode),
       rows_not_in_cols = setdiff(CountyCode, names(.SD)[-1]))
   , by = YEAR]

which gives:

   YEAR cols_not_in_rows rows_not_in_cols
1: 2012              C05              C07
2: 2013              C05              C07

Used data:

df <- structure(list(YEAR = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L),
                     CountyCode = c("C01", "C02", "C03", "C04", "C06", "C07", "C01", "C02", "C03", "C04", "C06", "C07"),
                     C01 = c(132L, 0L, 0L, 0L, 0L, 0L, 220L, 0L, 0L, 0L, 0L, 0L),
                     C02 = c(0L, 21L, 0L, 6L, 0L, 0L, 9L, 54L, 0L, 6L, 0L, 0L),
                     C03 = c(3L, 0L, 56L, 0L, 1L, 1L, 2L, 0L, 24L, 0L, 2L, 1L),
                     C04 = c(1L, 57L, 0L, 40L, 0L, 32L, 4L, 62L, 0L, 224L, 0L, 37L),
                     C05 = c(4L, 0L, 1L, 0L, 0L, 219L, 0L, 0L, 1L, 0L, 0L, 2L),
                     C06 = c(9L, 0L, 1L, 26L, 0L, 6L, 0L, 2L, 1L, 2L, 0L, 3L)),
                .Names = c("YEAR", "CountyCode", "C01", "C02", "C03", "C04", "C05", "C06"), class = "data.frame", row.names = c(NA, -12L))
like image 199
Jaap Avatar answered Mar 13 '23 05:03

Jaap


In base R, to see which columns are not in df$CountyCode, you can do

names(df[,c(-1,-2)])[!(names(df[,c(-1,-2)]) %in% df$CountyCode)]

If you just want to retain the columns that are in df$CountyCode perhaps you can do something like

library(tidyverse)
correct <- names(df[,c(-1,-2)])[names(df[,c(-1,-2)]) %in% df$CountyCode]
df %>% select(1,2, correct)
like image 35
hpesoj626 Avatar answered Mar 13 '23 03:03

hpesoj626