Is there a simple way to load parquet files directly into Cassandra?

Question

I have got a parquet file / folder (about 1GB) that I would like to load into my local Cassandra DB. Unfortunately I could not find any way (except via SPARK (in Scala)) to directly load this file into CDB. If I blow out the parquet file into CSV it'll just get way too huge for my laptop.

I am setting up a Cassandra DB for a big data analytics case (I've got about 25TB in raw data that we need to get searchable fast). Right now I am running some local tests on how to optimally design the keyspaces, indices and tables before move to Cassandra as a Service on a Hyperscaler. Converting the data to CSV is not an option as this blows up too much.

COPY firmographics.company (col1,col2,col3.....) FROM 'C:\Users\Public\Downloads\companies.csv' WITH DELIMITER='	' AND HEADER=TRUE;

Jonathan · Accepted Answer

Turns out, like Alex Ott said, it's easy enough to just write this up in SPARK. Below my code:

import findspark

from pyspark.sql import SparkSession  
findspark.init()

spark = SparkSession\
    .builder\
    .appName("Spark Exploration App")\
    .config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.11:2.3.2')\
    .getOrCreate()

import pandas as pd
df = spark.read.parquet("/PATH/TO/FILE/")

import time
start = time.time()

df.drop('filename').write\
    .format("org.apache.spark.sql.cassandra")\
    .mode('append')\
    .options(table="few_com", keyspace="bmbr")\
    .save()

end = time.time()
print(end - start)

Is there a simple way to load parquet files directly into Cassandra?

Tags:

import

csv

cassandra

parquet

Jonathan

1 Answers

Jonathan

Recent Activity

Donate For Us

Is there a simple way to load parquet files directly into Cassandra?

Tags:

import

csv

cassandra

parquet

Jonathan

1 Answers

Jonathan

Related questions

Recent Activity

Donate For Us