Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

COPY cassandra table from csv file

I'm setting up a demo landscape for Cassandra, Apache Spark and Flume on my Mac (Mac OS X Yosemite with Oracle jdk1.7.0_55). The landscape shall work as a proof of concept for a new analytics platform and therefore I need some test data in my cassandra db. I am using cassandra 2.0.8.

I created some demo data in excel and exported that as a CSV file. The structure is like this:

ProcessUUID;ProcessID;ProcessNumber;ProcessName;ProcessStartTime;ProcessStartTimeUUID;ProcessEndTime;ProcessEndTimeUUID;ProcessStatus;Orderer;VorgangsNummer;VehicleID;FIN;Reference;ReferenceType
0F0D1498-D149-4FCC-87C9-F12783FDF769;AbmeldungKl‰rfall;1;Abmeldung Kl‰rfall;2011-02-03 04:05+0000;;2011-02-17 04:05+0000;;Finished;SIXT;4278;A-XA 1;WAU2345CX67890876;KLA-BR4278;internal

I then created a keyspace and a column family in cqlsh using:

CREATE KEYSPACE dadcargate 
WITH REPLICATAION  = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

use dadcargate;

CREATE COLUMNFAMILY Process (
  ProcessUUID uuid, ProcessID varchar, ProcessNumber bigint, ProcessName varchar, 
  ProcessStartTime timestamp, ProcessStartTimeUUID timeuuid, ProcessEndTime timestamp, 
  ProcessEndTimeUUID timeuuid, ProcessStatus varchar, Orderer varchar,
  VorgangsNummer varchar, VehicleID varchar, FIN varchar, Reference varchar,
  ReferenceType varchar, 
PRIMARY KEY (ProcessUUID))
WITH COMMENT='A process is like a bracket around multiple process steps';

The column family name and all columns in it are created with all lower case - will have to investigate into this as well some day, but that is not so relevant at the moment.

Now I take my CSV file, which has around 1600 entries and want to import that in my table named process like this:

cqlsh:dadcargate> COPY process (processuuid, processid, processnumber, processname, 
processstarttime, processendtime, processstatus, orderer, vorgangsnummer, vehicleid,
fin, reference, referencetype) 
FROM 'Process_BulkData.csv' WITH DELIMITER = ';' AND HEADER = TRUE;

It gives the following error:

Record #0 (line 1) has the wrong number of fields (15 instead of 13).
0 rows imported in 0.050 seconds.

Which is essentially true, As I do NOT have the timeUUID Fields in my cvs-export.

If I try the COPY command without explicit column-names like this (given the fact, that I actually do miss two fields):

cqlsh:dadcargate> COPY process from 'Process_BulkData.csv' 
WITH DELIMITER = ';' AND HEADER = TRUE;

I end up with another error:

Bad Request: Input length = 1
Aborting import at record #0 (line 1). Previously-inserted values still present.
0 rows imported in 0.009 seconds.

Hm. Kinda strange, but okay. Maybe the COPY command does not like the fact that there are two fields missing. I still think this to be strange, as the missing fields are of course there (from a structural point of view) but only empty.

I still have another shot: I deleted the missing columns in excel, exported the file again as cvs and try to import WITHOUT header line in my csv BUT explicit column names, like this:

cqlsh:dadcargate> COPY process (processuuid, processid, processnumber, processname, 
processstarttime, processendtime, processstatus, orderer, vorgangsnummer, vehicleid, 
fin, reference, referencetype) 
FROM 'Process_BulkData-2.csv' WITH DELIMITER = ';' AND HEADER = TRUE;

I get this error:

Bad Request: Input length = 1
Aborting import at record #0 (line 1). Previously-inserted values still present.
0 rows imported in 0.034 seconds.

Can ANYONE tell me what I'm doing wrong here? According to the documentation of copy-command, the way I setup my commands, should work for at least two of them. Or so I would think.

But nah, I'm obviously missing something important here.

like image 758
siliconchris Avatar asked Feb 21 '15 12:02

siliconchris


People also ask

How do I import a CSV file into Cassandra?

If you have data in a file so, you can directly insert your data into the database by using the COPY command in Cassandra. It will be very useful when you have a very large database, and you want to store data quickly and your data is in a CSV file then you can directly insert your data.

How do I export Cassandra data to CSV?

Right click on the output and select "Copy All as CSV" to paste the output in CSV.


1 Answers

cqlsh's COPY command can be touchy. However, in the COPY documentation is this line:

The number of columns in the CSV input is the same as the number of columns in the Cassandra table metadata.

Keeping that in-mind, I did manage to get your data to import with a COPY FROM, by naming the empty fields (processstarttimeuuid and processendtimeuuid, respectively):

aploetz@cqlsh:stackoverflow> COPY process (processuuid, processid, processnumber, 
processname, processstarttime, processstarttimeuuid, processendtime, 
processendtimeuuid, processstatus, orderer, vorgangsnummer, vehicleid, fin, reference, 
referencetype) FROM 'Process_BulkData.csv' WITH DELIMITER = ';' AND HEADER = TRUE;

1 rows imported in 0.018 seconds.
aploetz@cqlsh:stackoverflow> SELECT * FROM process ;

 processuuid                          | fin               | orderer | processendtime            | processendtimeuuid | processid         | processname        | processnumber | processstarttime          | processstarttimeuuid | processstatus | reference  | referencetype | vehicleid | vorgangsnummer
--------------------------------------+-------------------+---------+---------------------------+--------------------+-------------------+--------------------+---------------+---------------------------+----------------------+---------------+------------+---------------+-----------+----------------
 0f0d1498-d149-4fcc-87c9-f12783fdf769 | WAU2345CX67890876 |    SIXT | 2011-02-16 22:05:00+-0600 |               null | AbmeldungKl‰rfall | Abmeldung Kl‰rfall |             1 | 2011-02-02 22:05:00+-0600 |                 null |      Finished | KLA-BR4278 |      internal |    A-XA 1 |           4278

(1 rows)
like image 141
Aaron Avatar answered Sep 17 '22 20:09

Aaron