I want to remove the new line character in CSV file field's data. The same question is asked by multiple people in SO/other places. However the provided solutions are in scripting. I'm looking for a solution in programming languages like PYTHON or in Spark(not only these two) as I have pretty big files.
Previously asked questions on the same topic:
Remove New Line Character from CSV file's string column
Replace new line character between double quotes with space
Remove New Line from CSV file's string column
https://unix.stackexchange.com/questions/222049/how-to-detect-and-remove-newline-character-within-a-column-in-a-csv-file
I have a CSV file of size ~1GB and want to remove the new line characters in field's data. The schema of the CSV file varies dynamically, so I can't hard code the schema. The line break doesn't always appear before a comma, it appears randomly even within a field.
Sample Data:
playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,"This is
Team2",BOS,AL,1
gehrilo01,1933,3,"Game name is
Cricket"
,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,"Here it is
Team4",DET,AL,1
dykesji01,1933,5,"Game name is
Hockey"
,"Team name
Team5",CHA,AL,1
Expected Output:
playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,"This is Team2",BOS,AL,1
gehrilo01,1933,3,"Game name is Cricket" ,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,"Here it is Team4",DET,AL,1
dykesji01,1933,5,"Game name is Hockey","Team name Team5",CHA,AL,1
Newline character can be in any field's data.
Edit: Screenshot as per the code:
If you are using pyspark then I would suggest you to go with sparkContext's wholeTextFiles
function to read the file, since your file needs to be read as whole text for parsing appropriately.
After reading it using wholeTextFiles
, you should parse by replacing end of line characters by , and do some additional formattings so that whole text can be broken down into groups of eight strings.
import re
rdd = sc.wholeTextFiles("path to your csv file")\
.map(lambda x: re.sub(r'(?!(([^"]*"){2})*[^"]*$),', ' ', x[1].replace("\r\n", ",").replace(",,", ",")).split(","))\
.flatMap(lambda x: [x[k:k+8] for k in range(0, len(x), 8)])
You should get output as
[u'playerID', u'yearID', u'gameNum', u'gameName', u'teamName', u'lgID', u'GP', u'startingPos']
[u'gomezle01', u'1933', u'1', u'Cricket', u'Team1', u'NYA', u'AL', u'1']
[u'ferreri01', u'1933', u'2', u'Hockey', u'"This is Team2"', u'BOS', u'AL', u'1']
[u'gehrilo01', u'1933', u'3', u'"Game name is Cricket"', u'Team3', u'NYA', u'AL', u'1']
[u'gehrich01', u'1933', u'4', u'Hockey', u'"Here it is Team4"', u'DET', u'AL', u'1']
[u'dykesji01', u'1933', u'5', u'"Game name is Hockey"', u'"Team name Team5"', u'CHA', u'AL', u'1']
If you would like to convert all the array rdd rows into strings of rows then you can add
.map(lambda x: ", ".join(x))
and you should get
playerID, yearID, gameNum, gameName, teamName, lgID, GP, startingPos
gomezle01, 1933, 1, Cricket, Team1, NYA, AL, 1
ferreri01, 1933, 2, Hockey, "This is Team2", BOS, AL, 1
gehrilo01, 1933, 3, "Game name is Cricket", Team3, NYA, AL, 1
gehrich01, 1933, 4, Hockey, "Here it is Team4", DET, AL, 1
dykesji01, 1933, 5, "Game name is Hockey", "Team name Team5", CHA, AL, 1
You can use re
, pandas
and io
modules as follows:
import re
import io
import pandas as pd
with open('data.csv','r') as f:
data = f.read()
df = pd.read_csv(io.StringIO(re.sub('"\s*\n','"',data)))
for col in df.columns: #To replace all line breaks in all textual columns
if df[col].dtype == np.object_:
df[col] = df[col].str.replace('\n','');
In [78]: df
Out[78]:
playerID yearID gameNum gameName teamName lgID GP startingPos
0 gomezle01 1933 1 Cricket Team1 NYA AL 1
1 ferreri01 1933 2 Hockey This is Team2 BOS AL 1
2 gehrilo01 1933 3 Game name is Cricket Team3 NYA AL 1
3 gehrich01 1933 4 Hockey Here it is Team4 DET AL 1
4 dykesji01 1933 5 Game name is Hockey Team name Team5 CHA AL 1
If you want this DataFrame
as an output CSV
file use:
df.to_csv('./output.csv')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With