I have been using R for a little while, but I am still struggling with factors and data frames. Here's my question.
I am trying to pre-allocate a data frame composed of several columns of different types, as follows:
cb <- data.frame(S=character(1000), I=numeric(1000), A=as.Date(rep(0,1000), origin = "1900-01-01"), SD=as.POSIXct(rep(0,1000), origin = "1900-01-01 00:00:00"), CC=numeric(1000), stringsAsFactors=FALSE)
which gets met the data frame types that I want (output of str(cb)):
'data.frame': 1000 obs. of 5 variables:
$ S : chr "" "" "" "" ...
$ I : num 0 0 0 0 0 0 0 0 0 0 ...
$ A : Date, format: "1900-01-01" "1900-01-01" "1900-01-01" "1900-01-01" ...
$ SD: POSIXct, format: "1900-01-01" "1900-01-01" "1900-01-01" "1900-01-01" ...
$ CC: num 0 0 0 0 0 0 0 0 0 0 ...
When I assign the first item in the data frame, CC and I become characters:
cb[1, ] <- c("ABCD", 4, "2005-12-12", "2008-04-03 20:30", 3)
output of str(cb):
'data.frame': 1000 obs. of 5 variables:
$ S : chr "ABCD" "" "" "" ...
$ I : chr "4" "0" "0" "0" ...
$ A : Date, format: "2005-12-12" "1900-01-01" "1900-01-01" "1900-01-01" ...
$ SD: POSIXct, format: "2008-04-03 20:30:00" "1900-01-01 00:00:00" "1900-01-01 00:00:00" "1900-01-01 00:00:00" ...
$ CC: chr "3" "0" "0" "0" ...
which makes it rather unusable for my purposes.
When I omit stringsAsFactors=FALSE in the data.frame definition, I (obviously) get a different error message (having set warn to 2):
Error in `[<-.factor`(`*tmp*`, iseq, value = "ABCD") :
(converted from warning) invalid factor level, NAs generated
which I understand but I am not sure how to overcome either.
What am I doing wrong? How can I make sure to keep the numeric type for columns I and SD? Thanks so much for your help.
Cheers
B
A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .
The data in the data frame can be spread across various columns, having different data types.
In the overview page of the pandas documentation the Series data structure is described as 'homogeneously-typed'. However it is possible to create Series objects with multiple data-types.
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
You can't mix types in a vector, so your vector is being coerced to character.
R> c("ABCD", 4, "2005-12-12", "2008-04-03 20:30", 3)
[1] "ABCD" "4"
[3] "2005-12-12" "2008-04-03 20:30"
[5] "3"
[<-.data.frame
then coerces the numeric columns of your data.frame to character, so the column will be one type; though I find it a bit inconsistent that it doesn't also convert the Date/POSIXt fields to character as well...
You can mix types in a list. This replacement works because data.frames are lists underneath.
cb[1, ] <- list("ABCD", 4, "2005-12-12", "2008-04-03 20:30", 3)
When you look back at your code later, it might make more sense to replace one row of your data.frame with a 1-row data.frame:
cb[1, ] <- data.frame("ABCD", 4, "2005-12-12", "2008-04-03 20:30", 3,
stringsAsFactors=FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With