I'm trying to use it to manipulate data in large txt-files. I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that? I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before. EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed. EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future: <img src="https://i.stack.imgur.com/C3K9h.jpg" alt="Sample from original txt-file">

You can use pandas filter function to select few columns based on regex <pre class="prettyprint"><code>data_filtered = data.filter(regex='net') </code></pre>

Extracting columns containing a certain name

Tags:

extraction

I'm trying to use it to manipulate data in large txt-files.

I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?

I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.

EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.

EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:

Sample from original txt-file

271

asked May 04 '15 11:05

Rickyboy

3 Answers

One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:

a,b,c_net,d,e_net

0,0,1,0,1

(remove the blank lines in between, they are just for formatting the content in this post)

The following code does what you want.

import csv


input_filename = 'input.csv'
output_filename = 'output.csv'

# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')

# Get the first row (assuming this row contains the header)
input_header = reader.next()

# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
    if 'net' in name:
        columns_to_keep.append(i)

# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')

# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
    output_header.append(input_header[column_index])

# Write the header to the output file
writer.writerow(output_header)

# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
    new_row = []
    for column_index in columns_to_keep:
        new_row.append(row[column_index])
    writer.writerow(new_row)

Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.

137

answered Oct 01 '22 00:10

Marco Nawijn

This could be done for instance with Pandas,

import pandas as pd

df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns)  # check that the  columns are parsed correctly 
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')

Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).

This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.

answered Sep 30 '22 23:09

rth

You can use pandas filter function to select few columns based on regex

data_filtered = data.filter(regex='net')

answered Sep 30 '22 23:09

Kathirmani Sukumar

Related questions
                            
                                Executing shell mail command using python
                            
                                How to iterate over a dictionary - n key-value pairs at a time
                            
                                How can I integrate Tkinter with Python log in screen?
                            
                                Python Format Best Practices
                            
                                How to Multiply Decimals in Python
                            
                                Invalid block tag: 'bootstrap_icon', expected 'endblock'
                            
                                How to turn a list/tuple into a space separated string in python using a single line?
                            
                                Log Normal Random Variables with Scipy
                            
                                Loading global data for server using Flask and gunicorn
                            
                                PyPI API - How to get stable package version
                            
                                How can I format a float with given precision and zero padding?
                            
                                how to Count the number of non zero pixels of the canny image in my python program
                            
                                Distinguish matches in pyparsing
                            
                                Apply function row wise on pandas data frame on columns with numerical values
                            
                                Exception gevent.hub.LoopExit: LoopExit('This operation would block forever',)
                            
                                Python for key, value in dictionary
                            
                                Same view with multiple URL patterns and optional arguments
                            
                                Combinatoric / cartesian product of Numpy arrays without iterators and/or loop(s) [duplicate]
                            
                                How to suppress the display of passwords?
                            
                                statsmodels summary to latex

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting columns containing a certain name

Tags:

python

text-files