I'm following a xgboost example on their main git at - https://github.com/dmlc/xgboost/blob/master/demo/guide-python/basic_walkthrough.py#L64
in this example they are reading files directly put into dMatrix
-
dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')
I looked at dMatrix
code, seems there is no way to briefly look at how the data is structured - as we normally do in pandas with pandas.DataFrame.head()
in xgboost documentation it mentions that we can convert numpy.ndarray
to xgboost.dMatrix
- can we somehow convert it back - from xgboost.dMatrix
to numpy.ndarray
, or perhaps pandas dataFrame? I don't see possible way from their code - but perhaps someone knows a way?
Or is there a way to briefly look at how data is like in xgboost.dMatrix
?
Thanks in advance, Howard
To train on the dataset using a DMatrix, we need to use the XGBoost train() method. The train() method takes two required arguments, the parameters, and the DMatrix. Following is the code for training using DMatrix. Using the above model, we can also predict the survival classes on our validation set.
Data Matrix used in XGBoost. DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.
How do you convert an array to a DataFrame in Python? To convert an array to a dataframe with Python you need to 1) have your NumPy array (e.g., np_array), and 2) use the pd. DataFrame() constructor like this: df = pd. DataFrame(np_array, columns=['Column1', 'Column2']) .
To elaborate on @jcaine's answer, you can use sklearn to load the files, then convert them to ordinary numpy arrays:
from sklearn.datasets import load_svmlight_file
train_data = load_svmlight_file('demo/data/agaricus.txt.train')
X = train_data[0].toarray()
y = train_data[1]
I haven't found a way to directly convert from dMatrix to numpy arrays yet.
Howard,
I believe that the xgb.DMatrix assumes the libsvm data format. You can get this data into a sparse CSR matrix using scikit's load_svmlight_file: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html.
You can then partition the response variable and the features using the example at the bottom of the page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With