Trying to parse a fixed width text file.
my text file looks like the following and I need a row id, date, a string, and an integer:
00101292017you1234
00201302017 me5678
I can read the text file to an RDD using sc.textFile(path). I can createDataFrame with a parsed RDD and a schema. It's the parsing in between those two steps.
Spark's substr function can handle fixed-width columns, for example:
df = spark.read.text("/tmp/sample.txt")
df.select(
df.value.substr(1,3).alias('id'),
df.value.substr(4,8).alias('date'),
df.value.substr(12,3).alias('string'),
df.value.substr(15,4).cast('integer').alias('integer')
).show()
will result in:
+---+--------+------+-------+
| id| date|string|integer|
+---+--------+------+-------+
|001|01292017| you| 1234|
|002|01302017| me| 5678|
+---+--------+------+-------+
Having splitted columns you can reformat and use them as in normal spark dataframe.
Someone asked how to do it based on a schema. Based on above responses, here is a simple example:
x= ''' 1 123121234 joe
2 234234234jill
3 345345345jane
4abcde12345jack'''
schema = [
("id",1,5),
("ssn",6,10),
("name",16,4)
]
with open("personfixed.csv", "w") as f:
f.write(x)
df = spark.read.text("personfixed.csv")
df.show()
df2 = df
for colinfo in schema:
df2 = df2.withColumn(colinfo[0], df2.value.substr(colinfo[1],colinfo[2]))
df2.show()
Here is the output:
+-------------------+
| value|
+-------------------+
| 1 123121234 joe|
| 2 234234234jill|
| 3 345345345jane|
| 4abcde12345jack|
+-------------------+
+-------------------+-----+----------+----+
| value| id| ssn|name|
+-------------------+-----+----------+----+
| 1 123121234 joe| 1| 123121234| joe|
| 2 234234234jill| 2| 234234234|jill|
| 3 345345345jane| 3| 345345345|jane|
| 4abcde12345jack| 4|abcde12345|jack|
+-------------------+-----+----------+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With