I have two csv files, one is a single column with a header and the other is multiple columns with headers. I want to lookup the values in the one column file and search the column with the same header in the other file. Once a match is found, I want that whole row to be printed. I know this is a vlookup function, but the csv file with multiple columns is extremely large and always crashes excel when I try to use formulas to accomplish this. So I've been trying to use python as a solution.
I'm taking this column:
age
23
43
18
Searching this table:
Name, age,number,AA,BB,CC,DD,EE
John, 23, 1, 34,35,36,37,38
Mary, 32, 2, 33,34,35,36,37
Jacob , 43, 3, 32,33,34,35,36
Matthew,22, 4, 31,32,33,34,35
Jean, 18, 5, 30,31,32,33,34
trying to print this:
Name, age,number,AA,BB,CC,DD,EE
John, 23, 1, 34,35,36,37,38
Jacob , 43, 3, 32,33,34,35,36
Jean, 18, 5, 30,31,32,33,34
I've been trying to use this code, but I got all mixed up and it just prints the first row as a column:
with open('/home/s/Untitled 1.csv') as f:
r=pandas.read_csv(f)
with open('/home/s/Test1.csv','r') as w:
x=pd.read_csv(w)
col=w['age']
for line in w:
for col in w:
for row in r:
if row in col:
print(line)
I basically want the script to use the first entry in the query column to search the column with the same heading in the data table and print that row, looping for the rest of the entries in the rows below.
Any advice would be appreciated!
There are a lot of problems with your code which shows that you are very confused.
with open('/home/s/Untitled 1.csv') as f:
r = pandas.read_csv(f)
with open('/home/s/Test1.csv','r') as w:
x = pandas.read_csv(w)
# w is not indexable
col = w['age']
for line in w:
# w is not a table.
for col in w:
for row in r:
if row in col:
print(line)
I think it would help if I break down the problem for you:
pandas data framepandas data frame to find matches of agesThis can be done by using a list instead of a data frame. You will be able to see why later on.
ages = []
with open("incsv1.csv", "r") as f:
r = pandas.read_csv(f)
ages = list(r["age"])
You have already done it:
with open("incsv2.csv", "r") as f:
x = pandas.read_csv(f)
Since you know that you are only going to look through the ages column, just index to that and iterate through it:
for i, age in enumerate(x["ages"]):
# You can't do this without a numpy int64
if age in ages:
print x.loc[i]
The whole program will output:
Name John
age 23
number 1
AA 34
BB 35
CC 36
DD 37
EE 38
Name: 0, dtype: object
Name Jacob
age 43
number 3
AA 32
BB 33
CC 34
DD 35
EE 36
Name: 2, dtype: object
Name Jean
age 18
number 5
AA 30
BB 31
CC 32
DD 33
EE 34
Name: 4, dtype: object
Now, I know you want it to print out in a straight line, so I will show you what I think would be much better:
import pandas
ages = []
with open("incsv1.csv", "r") as f:
r = pandas.read_csv(f)
ages = list(r["age"])
with open("incsv2.csv", "r") as f:
# Skip the first line
f.readline()
for line in f:
if int(line.split(",")[1]) in ages:
print line,
As you can see, you don't really need pandas in this problem. In fact, I could remove it:
ages = []
with open("incsv1.csv", "r") as f:
# Skip the first line
f.readline()
for line in f:
ages.append(int(line.strip("\n")))
with open("incsv2.csv", "r") as f:
# Skip the first line
f.readline()
for line in f:
if int(line.split(",")[1]) in ages:
print line,
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With