I am trying to build a DataFrame in C++. I'm facing some problems, such as dealing with variable data type.
I am thinking in a DataFrame inspired by Pandas DataFrame (from python). So my design idea is:
The item 1. is just a regular vector. So, for instance, the user would call
Series.fill({1,2,3,4}) and it would store the vector {1,2,3,4} in some attribute of Series, say Series.data.
Problem 1. How I would make a class that understands {1,2,3,4} as a vector of 4 integers. Is it possible?
The next problem is:
About 2., I can see the DataFrame as a matrix of n columns and m rows, but the columns can have different data types.
I tried to design this as a vector of n pointers, where each pointer would point to a vector of dimension m with different data types.
I tried to do something like
vector<void*> columns(10)
and fill it with something like
columns[0] = (int*) malloc(8*sizeof(int))
But this does not work, if I try to fill the vector, like
(*columns[0])[0] = 5;
I get an error
::value_type {aka void*}’ is not a pointer-to-object type
(int *) (*a[0])[0] = 5;
How can I do it properly? I still have other questions like, how would I append an undetermined number of Series into a DataFrame, but for now, just building a matrix with columns with different data types is a great start.
I know that I must keep track of the types of pointers inside my void vector but I can create a parallel list with all data types and make this an attribute of my class DataFrame.
Building a heterogeneous container (which a dataframe is supposed to be) in C++ is more complex than you think, because C++ is statically typed. It means you have to know all the types at compile time. Your approach uses a vector of pointers (there are a few variations of this approach, which I am not going into). This approach is very inefficient, because pointers are pointing to all over the memory and trashing your cache locality. I do not recommend even attempting to implement such a dataframe because there is really no point to it.
Look at this implementation of DataFrame in C++: https://github.com/hosseinmoein/DataFrame. You might be able to just use it as is. Or get insight from it how to implement a true heterogeneous DataFrame. It uses a collection of static vectors in a hash table to implement a true heterogeneous container. It also uses contiguous memory space, so it avoids the pointer effect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With