I am trying to write a python script to manipulate the excel spreadsheet.
Suppose if,Ihave the sample data:
Gene chrom strand TSS TES Name
NM_145215 chr5 + 135485168 135488045 Abhd11
NM_1190437 chr5 + 135485021 135488045 Abhd11
NM_1205181 chr14 + 54873803 54888844 Abhd4
NM_134076 chr14 + 54878906 54888844 Abhd4
NM_9594 chr2 + 31615464 31659747 Abl1
NM_1112703 chr2 + 31544075 31659747 Abl1
NM_207624 chr11 + 105829258 105851278 Abl1
NM_9598 chr11 + 105836521 105851278 Ace2
NM_1130513 chrX + 160577273 160626350 Ace2
NM_27286 chrX + 160578411 160626350 Ace2
For those similar names(column 6), I want to retrieve the whole row with least TSS. Example, for first 2 rows-Abhd11 name, I want to save the 2nd row in my result since the TSS 135485021 < 135485168. So on for all the sets with same NAMES.
Any ideas and comments are appreciated.
Input
If possible I would save the excel file as a csv file and then load into python using the csv module.
Alternatively you could use the xlrd module for reading excel files - though I haven't used this and don't know much about it.
openpyxl is an additional option for parsing the excel file (cheers Just another dunce).
Manipulation
ernie's idea seems workable, and I would implement it as follows.
Assuming that linesreadfromfile is a list of lists as read using csv.reader i.e. each list element is a list of values that corresponds to the delimited entries of that row in the file,
finaldict = {}
for row in linesreadfromfile:
if finaldict.has_key(row[5]):
if finaldict[row[5]][3] > row[3]:
finaldict[row[5]] = row
else:
finaldict[row[5]] = row
I agree with mutzmatron and would recommend the xlrd module. Here's a simple example:
import xlrd
# Create your file handle
file_handle = xlrd.open_workbook(file_name)
# Use the first page in the spreadsheet (0-based indexes)
sheet = file_handle.sheet_by_index(0)
# Create dictionary for storing values
abc = {}
# Loop through every row
for i in range(sheet.nrows):
line = sheet.row_values(i)
# Get your 'Name' and 'TSS' columns
name = line[5]
tss = line[3]
# Add this 'Name' to your dictionary if it's new, or keep the max value
if name not in abc.keys():
abc[name] = tss
else:
abc[name] = max(abc[name],tss)
Obviously changing what you'd need to save (full row, certain values, etc.) based on your spec.
--- EDIT ---
# If this 'Name' is new, save this line
if name not in abc.keys():
abc[name] = {'tss': tss, 'line': line}
# Else, if this 'Name' is not new and the TSS is less, keep this new line
elif tss < abc[name]['tss']:
abc[name]['line'] = line
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With