Sagemaker KMeans Built-In - List of files csv as input

Question

I Want to use Sagemaker KMeans BuilIn Algorithm in one of my applications. I have a large CSV file in S3 (raw data) that I split into several parts to be easy to clean. Before I had cleaned, I tried to use it as the input of Kmeans to perform the training job but It doesn't work.

My manifest file:

[
    {"prefix": "s3://<BUCKET_NAME>/kmeans_data/KMeans-2019-28-07-13-40-00-001/"}, 
    "file1.csv", 
    "file2.csv"
]

The error I've got:

Failure reason: ClientError: Unable to read data channel 'train'. Requested content-type is 'application/x-recordio-protobuf'. Please verify the data matches the requested content-type. (caused by MXNetError) Caused by: [16:47:31] /opt/brazil-pkg-cache/packages/AIAlgorithmsCppLibs/AIAlgorithmsCppLibs-2.0.1620.0/AL2012/generic-flavor/src/src/aialgs/io/iterator_base.cpp:100: (Input Error) The header of the MXNet RecordIO record at position 0 in the dataset does not start with a valid magic number. Stack trace returned 10 entries: [bt] (0) /opt/amazon/lib/libaialgs.so(+0xb1f0) [0x7fb5674c31f0] [bt] (1) /opt/amazon/lib/libaialgs.so(+0xb54a) [0x7fb5674c354a] [bt] (2) /opt/amazon/lib/libaialgs.so(aialgs::iterator_base::Next()+0x4a6) [0x7fb5674cc436] [bt] (3) /opt/amazon/lib/libmxnet.so(MXDataIterNext+0x21) [0x7fb54ecbcdb1] [bt] (4) /opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) [0x7fb567a1e858] [bt] (5) /opt/amazon/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call+0x15f) [0x7fb567a1d95f

My question is: It's possible to use multiple CSV files as input in Sagemaker KMeans BuilIn Algorithm only in GUI? If it's possible, How should I format my manifest?

Julien Simon · Accepted Answer

the manifest looks fine, but based on the error message, it looks like you haven't set the right data format for you S3 data. It's expecting protobuf, which is the default format :)

You have to set the CSV data format explicitly. See https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.s3_input.

It should look something like this:

s3_input_train = sagemaker.s3_input(
  s3_data='s3://{}/{}/train/manifest_file'.format(bucket, prefix),    
  s3_data_type='ManifestFile',
  content_type='csv')

...

kmeans_estimator = sagemaker.estimator.Estimator(kmeans_image, ...)
kmeans_estimator.set_hyperparameters(...)

s3_data = {'train': s3_input_train}
kmeans_estimator.fit(s3_data)

Please note the KMeans estimator in the SDK only supports protobuf, see https://sagemaker.readthedocs.io/en/stable/kmeans.html

Sagemaker KMeans Built-In - List of files csv as input

Tags:

amazon-web-services

amazon-sagemaker

bcosta12

1 Answers

Julien Simon

Recent Activity

Donate For Us

Sagemaker KMeans Built-In - List of files csv as input

Tags:

amazon-web-services

amazon-sagemaker

bcosta12

1 Answers

Julien Simon

Related questions

Recent Activity

Donate For Us