Create Pandas DataFrame from txt file with specific pattern

Tags:

I need to create a Pandas DataFrame based on a text file based on the following structure:

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]

The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.

Index          State          Region Name
0              Alabama        Aurburn...
1              Alabama        Florence...
2              Alabama        Jacksonville...
...
9              Alaska         Fairbanks...
10             Alaska         Arizona...
11             Alaska         Flagstaff...

Pandas DataFrame

I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.

845

asked Dec 29 '16 20:12

Peter Wilson

2 Answers

You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])

Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.

df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')

Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:

df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State   Region Name
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Jacksonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

If need all values solution is easier:

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State                                        Region Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Jacksonville (Jacksonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)

answered Sep 20 '22 12:09

jezrael

You could parse the file into tuples first:

import pandas as pd
from collections import namedtuple

Item = namedtuple('Item', 'state area')
items = []

with open('unis.txt') as f: 
    for line in f:
        l = line.rstrip('\n') 
        if l.endswith('[edit]'):
            state = l.rstrip('[edit]')
        else:            
            i = l.index(' (')
            area = l[:i]
            items.append(Item(state, area))

df = pd.DataFrame.from_records(items, columns=['State', 'Area'])

print df

output:

      State          Area
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Jacksonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

answered Sep 19 '22 12:09

ultra909

Related questions
                            
                                How do I call a specific Method from a Python Script in C#?
                            
                                How do I install Socks / SocksIPy on Ubuntu?
                            
                                Ignore KeyError and continue program
                            
                                How to find integer nth roots?
                            
                                Interactive plotting with Python via command line
                            
                                Pip install error. Setuptools.command not found
                            
                                Changing marker style in scatter plot according to third variable
                            
                                Getting PyCharm to recognize Anaconda's SciPy
                            
                                Two different color colormaps in the same imshow matplotlib
                            
                                Django 1.7 where to put the code to add Groups programmatically?
                            
                                How To Resize a Video Clip in Python
                            
                                What do [] brackets in a for loop in python mean?
                            
                                Extracting polygon given coordinates from an image using OpenCV
                            
                                How to download a full webpage with a Python script?
                            
                                How do you perform basic joins of two RDD tables in Spark using Python?
                            
                                How to show node name in graphs using networkx? [duplicate]
                            
                                Integer division in Python 3 - strange result with negative number [duplicate]
                            
                                Logging module not writing to file
                            
                                How do i search directories and find files that match regex?
                            
                                Pandas join on columns with different names [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create Pandas DataFrame from txt file with specific pattern

Tags:

python

regex

text

pandas

extract

Peter Wilson

People also ask

2 Answers

jezrael

ultra909

Recent Activity

Donate For Us