I have a fairly large csv, looking like this:
+---------+---------+
| Column1 | Column2 |
+---------+---------+
| 1 | 93644 |
| 2 | 63246 |
| 3 | 47790 |
| 3 | 39644 |
| 3 | 32585 |
| 1 | 19593 |
| 1 | 12707 |
| 2 | 53480 |
+---------+---------+
My intent is to
For example, I want to end up with multiple files that look like:
+---+-------+----------------+
| 1 | 19593 | NewColumnValue |
| 1 | 93644 | NewColumnValue |
| 1 | 12707 | NewColumnValue |
+---+-------+----------------+
+---+-------+-----------------+
| 2 | 63246 | NewColumnValue |
| 2 | 53480 | NewColumnValue |
+---+-------+-----------------+
+---+-------+-----------------+
| 3 | 47790 | NewColumnValue |
| 3 | 39644 | NewColumnValue |
| 3 | 32585 | NewColumnValue |
+---+-------+-----------------+
I have managed to do this using separate .py files:
Step1
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
df.to_csv('ready.csv', index=False, header=False)
Step2
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("ready.csv")),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
But I'd really like to learn how to accomplish everything in a single .py file. I tried this:
# -*- coding: utf-8 -*-
#This processes a large CSV file.
#It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers
#Then it will split the single large CSV into multiple CSVs based on the value in column 0
import pandas as pd
import csv
from itertools import groupby
df = pd.read_csv('source.csv')
df = df.sort_values('Column1')
df['NewColumn'] = 'NewColumnValue'
for key, rows in groupby(csv.reader((df)),
lambda row: row[0]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
but instead of working as intended, it's giving me multiple CSVs named after each column header.
Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.
Why not just groupby Column1 and save each group?
df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)
Column1 Column2 NewColumn
0 1 93644 NewColumnValue
5 1 19593 NewColumnValue
6 1 12707 NewColumnValue
1 2 63246 NewColumnValue
7 2 53480 NewColumnValue
2 3 47790 NewColumnValue
3 3 39644 NewColumnValue
4 3 32585 NewColumnValue
for i, g in df.groupby('Column1'):
g.to_csv('{}.csv'.format(i), header=False, index_label=False)
Thanks to Unatiel for the improvement. header=False will not write headers and index_label=False will not write an index column.
This creates 3 files:
1.csv
2.csv
3.csv
Each having data corresponding to each Column1 group.
You don't need to switch to itertools for the filtering, pandas has all of the necessary functionality built-in.
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1') # Sorting isn't needed
df['NewColumn'] = 'NewColumnValue'
for key in df['Column1'].unique(): # For each value in Column1
# These two steps can be combined into a single call
# I'll separate for clarity:
# 1) filter the dataframe on the unique value
dw = df[df['Column1']==key]
# 2) write the resulting dataframe without headers
dw.to_csv("%s.csv" % key, header=False)
pandas.DataFrame supports a method to write it's data as a csv to_csv(). You have no need for csv module in this case.
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1').set_index('Column1')
df['NewColumn'] = 'NewColumnValue'
for key in df.index.unique():
df.loc[key].to_csv('%d.csv' % int(key), header=False)
for key df.index.unique(): will loop over every unique value in the index. In your example, it will loop over (1, 2 , 3). header=False willmake sure the header isn't written to the output file.
And to explain why you get the wrong output in your example, try print(list(df)). This should output all the columns in df. This is why for key, rows in csv.reader((df)): iterates over the columns in df.
Actually, you should get 1 csv for every column in your dataframe, and their contents are likely something like ,[NAME_OF_COLUMN] or maybe ,<itertools.... object at 0x.....>.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With