Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Histogram as input in R

Tags:

r

input

histogram

This is admittedly a very simple question that I just can't find an answer to.

In R, I have a file that has 2 columns: 1 of categorical data names, and the second a count column (count for each of the categories). With a small dataset, I would use 'reshape' and the function 'untable' to make 1 column and do analysis that way. The question is, how to handle this with a large data set?

In this case, my data is humungous and that just isn't going to work.

My question is, how do I tell R to use something like the following as distribution data:

Cat Count
A   5
B   7
C   1

That is, I give it a histogram as an input and have R figure out that it means there are 5 of A, 7 of B and 1 of C when calculating other information about the data.

The desired input rather than output would be for R to understand that the data would be the same as follows,

A A A A A B B B B B B B C

In reasonable size data, I can do this on my own, but what do you do when the data is very large?

Edit

The total sum of all the counts is 262,916,849.

In terms of what it would be used for:

This is new data, trying to understand the correlation between this new data and other pieces of data. Need to work on linear regressions and mixed models.

like image 870
Lillian Milagros Carrasquillo Avatar asked Dec 11 '22 22:12

Lillian Milagros Carrasquillo


2 Answers

I think what you're asking is to reshape a data frame of categories and counts into a single vector of observations, where categories are repeated. Here's one way:

dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1))
#  Cat Count
#1   A     5
#2   B     7
#3   C     1
rep.int(dat$Cat,times=dat$Count)
# [1] A A A A A B B B B B B B C
#Levels: A B C
like image 80
Blue Magister Avatar answered Jan 05 '23 19:01

Blue Magister


To follow up on @Blue Magister's excellent answer, here's a 100,000 row histogram with a total count of 551,245,193:

set.seed(42)
Cat <- sapply(rep(10, 100000), function(x) {
  paste(sample(LETTERS, x, replace=TRUE), collapse='')
  })
dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE))
> head(dat)
         Cat Count
1 XYHVQNTDRS  5154
2 LSYGMYZDMO  4724
3 XDZYCNKXLV  8691
4 TVKRAVAFXP  2429
5 JLAZLYXQZQ  5704
6 IJKUBTREGN  4635

This is a pretty big dataset by my standards, and the operation Blue Magister describes is very quick:

> system.time(x <- rep(dat$Cat,times=dat$Count))
   user  system elapsed 
   4.48    1.95    6.42 

It uses about 6GB of RAM to complete the operation.

like image 27
Zach Avatar answered Jan 05 '23 18:01

Zach