I have a large csv file with lines that looks like
stringa,stringb
stringb,stringc
stringd,stringa
I need to convert it so the ids are consecutively numbered from 0. In this case the following would work
0,1
1,2
3,0
My current code looks like:
import csv
names = {}
counter = 0
with open('foo.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if row[0] in names:
id1 = row[0]
else:
names[row[0]] = counter
id1 = counter
counter += 1
if row[1] in names:
id2 = row[1]
else:
names[row[1]] = counter
id2 = counter
counter += 1
print id1, id2
Python dicts use a lot of memory sadly and my input is large.
What can I do when the input is too large for the dict to fit in memory
I would also be interested if there is a better/faster way to solve this problem in general.
df = pd.DataFrame([['a', 'b'], ['b', 'c'], ['d', 'a']])
v = df.stack().unique()
v.sort()
f = pd.factorize(v)
m = pd.Series(f[0], f[1])
df.stack().map(m).unstack()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With