Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a simple way to load parquet files directly into Cassandra?

I have got a parquet file / folder (about 1GB) that I would like to load into my local Cassandra DB. Unfortunately I could not find any way (except via SPARK (in Scala)) to directly load this file into CDB. If I blow out the parquet file into CSV it'll just get way too huge for my laptop.

I am setting up a Cassandra DB for a big data analytics case (I've got about 25TB in raw data that we need to get searchable fast). Right now I am running some local tests on how to optimally design the keyspaces, indices and tables before move to Cassandra as a Service on a Hyperscaler. Converting the data to CSV is not an option as this blows up too much.

COPY firmographics.company (col1,col2,col3.....) FROM 'C:\Users\Public\Downloads\companies.csv' WITH DELIMITER='\t' AND HEADER=TRUE;
like image 222
Jonathan Avatar asked Oct 16 '25 13:10

Jonathan


1 Answers

Turns out, like Alex Ott said, it's easy enough to just write this up in SPARK. Below my code:

import findspark

from pyspark.sql import SparkSession  
findspark.init()

spark = SparkSession\
    .builder\
    .appName("Spark Exploration App")\
    .config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.11:2.3.2')\
    .getOrCreate()

import pandas as pd
df = spark.read.parquet("/PATH/TO/FILE/")

import time
start = time.time()

df.drop('filename').write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="few_com", keyspace="bmbr")\
    .save()

end = time.time()
print(end - start)
like image 186
Jonathan Avatar answered Oct 19 '25 08:10

Jonathan