Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sagemaker KMeans Built-In - List of files csv as input

I Want to use Sagemaker KMeans BuilIn Algorithm in one of my applications. I have a large CSV file in S3 (raw data) that I split into several parts to be easy to clean. Before I had cleaned, I tried to use it as the input of Kmeans to perform the training job but It doesn't work.

My manifest file:

[
    {"prefix": "s3://<BUCKET_NAME>/kmeans_data/KMeans-2019-28-07-13-40-00-001/"}, 
    "file1.csv", 
    "file2.csv"
]

The error I've got:

Failure reason: ClientError: Unable to read data channel 'train'. Requested content-type is 'application/x-recordio-protobuf'. Please verify the data matches the requested content-type. (caused by MXNetError) Caused by: [16:47:31] /opt/brazil-pkg-cache/packages/AIAlgorithmsCppLibs/AIAlgorithmsCppLibs-2.0.1620.0/AL2012/generic-flavor/src/src/aialgs/io/iterator_base.cpp:100: (Input Error) The header of the MXNet RecordIO record at position 0 in the dataset does not start with a valid magic number. Stack trace returned 10 entries: [bt] (0) /opt/amazon/lib/libaialgs.so(+0xb1f0) [0x7fb5674c31f0] [bt] (1) /opt/amazon/lib/libaialgs.so(+0xb54a) [0x7fb5674c354a] [bt] (2) /opt/amazon/lib/libaialgs.so(aialgs::iterator_base::Next()+0x4a6) [0x7fb5674cc436] [bt] (3) /opt/amazon/lib/libmxnet.so(MXDataIterNext+0x21) [0x7fb54ecbcdb1] [bt] (4) /opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7fb567a1e858] [bt] (5) /opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call+0x15f) [0x7fb567a1d95f

My question is: It's possible to use multiple CSV files as input in Sagemaker KMeans BuilIn Algorithm only in GUI? If it's possible, How should I format my manifest?

like image 633
bcosta12 Avatar asked Dec 09 '25 09:12

bcosta12


1 Answers

the manifest looks fine, but based on the error message, it looks like you haven't set the right data format for you S3 data. It's expecting protobuf, which is the default format :)

You have to set the CSV data format explicitly. See https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.s3_input.

It should look something like this:

s3_input_train = sagemaker.s3_input(
  s3_data='s3://{}/{}/train/manifest_file'.format(bucket, prefix),    
  s3_data_type='ManifestFile',
  content_type='csv')

...

kmeans_estimator = sagemaker.estimator.Estimator(kmeans_image, ...)
kmeans_estimator.set_hyperparameters(...)

s3_data = {'train': s3_input_train}
kmeans_estimator.fit(s3_data)

Please note the KMeans estimator in the SDK only supports protobuf, see https://sagemaker.readthedocs.io/en/stable/kmeans.html

like image 129
Julien Simon Avatar answered Dec 11 '25 17:12

Julien Simon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!