Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

convert python xgboost dMatrix to numpy ndarray or pandas DataFrame

I'm following a xgboost example on their main git at - https://github.com/dmlc/xgboost/blob/master/demo/guide-python/basic_walkthrough.py#L64

in this example they are reading files directly put into dMatrix -

dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')

I looked at dMatrix code, seems there is no way to briefly look at how the data is structured - as we normally do in pandas with pandas.DataFrame.head()

in xgboost documentation it mentions that we can convert numpy.ndarray to xgboost.dMatrix - can we somehow convert it back - from xgboost.dMatrix to numpy.ndarray, or perhaps pandas dataFrame? I don't see possible way from their code - but perhaps someone knows a way?

Or is there a way to briefly look at how data is like in xgboost.dMatrix?

Thanks in advance, Howard

like image 886
howard Avatar asked May 18 '16 20:05

howard


People also ask

How do I use XGBoost DMatrix?

To train on the dataset using a DMatrix, we need to use the XGBoost train() method. The train() method takes two required arguments, the parameters, and the DMatrix. Following is the code for training using DMatrix. Using the above model, we can also predict the survival classes on our validation set.

Does XGBoost need DMatrix?

Data Matrix used in XGBoost. DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.

How do I convert Ndarray to data frame?

How do you convert an array to a DataFrame in Python? To convert an array to a dataframe with Python you need to 1) have your NumPy array (e.g., np_array), and 2) use the pd. DataFrame() constructor like this: df = pd. DataFrame(np_array, columns=['Column1', 'Column2']) .


2 Answers

To elaborate on @jcaine's answer, you can use sklearn to load the files, then convert them to ordinary numpy arrays:

from sklearn.datasets import load_svmlight_file
train_data = load_svmlight_file('demo/data/agaricus.txt.train')
X = train_data[0].toarray()
y = train_data[1]

I haven't found a way to directly convert from dMatrix to numpy arrays yet.

like image 121
Peter Avatar answered Sep 24 '22 12:09

Peter


Howard,

I believe that the xgb.DMatrix assumes the libsvm data format. You can get this data into a sparse CSR matrix using scikit's load_svmlight_file: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html.

You can then partition the response variable and the features using the example at the bottom of the page.

like image 33
jcaine Avatar answered Sep 25 '22 12:09

jcaine