From .csv, read only or split into sections separated by ""

Question

I have a .csv file that is split in sections, each starting with < string > on a row of its own as in this example. This is followed by a set of columns and their respective rows of values. Columns are not consistent between sections.

< section1 ><br>
col1 col2 col3<br>
val1 val2 val3

< section2 ><br>
col3 col4 col5<br>
val4 val5 val6<br>
val7 val8 val9

...etc. Is there a way in which I can, either when the file's in .txt or .csv, import each section either: 1) into seperate dataframes? 2) into the same dataframe, but something like df[section][col]?

Many thanks!

sammywemmy · Accepted Answer

Depending on the size of your csv, you could read in the entire file into Pandas and split the dataframe into multiple dataframes via a list comprehension.

data = '''ï»¿<Network>;;;;;;;;;;;;;;;;;;;;;
            Property;Value;;;;;;;;;;;;;;;;;;;;
            Title;;;;;;;;;;;;;;;;;;;;;
            Version;6.4;;;;;;;;;;;;;;;;;;;;
            ;;;;;;;;;;;;;;;;;;;;;
            <Sites>;;;;;;;;;;;;;;;;;;;;;
            Name;LocationCode;Longitude;Latitude;;;;;;;;;;...'''

df = pd.read_csv(StringIO(data), header=None)

create a list of dataframe names (the headers of each df)

df_names = df[0].str.extract(r'(<[a-zA-Z]+>)')[0].str.strip('<>').dropna().tolist()

find the indices for the headers regions = df.loc[df[0].str.contains(r'<[a-zA-Z]+')].index.tolist()

last_row = df.index[-1]

regions.append(last_row)

from more_itertools import windowed

create windows for each 'sub' dataframe

regions_window = list(windowed(regions,2))

the function helps with some cleanup during the dataframe extraction

def some_cleanup(df):
    df.columns = df.iloc[0].str.extract(r'(<[a-zA-z]+>)')[0].str.strip('<>')
    df = df.iloc[1:]
    return df

extract the dataframes

M = [df.loc[start:end].pipe(some_cleanup) for start,end in regions_window]

create a dict with the keys as the dataframe names

dataframe_dict = dict(zip(df_names,M))

Josh Friedlander · Answer

There are some great answers here already but I'd recommend a Unix tool! It is shorter and will scale to very large files that don't fit into Pandas.

Assuming your file is called foo.csv:

awk '/< section/{x=i++"foo_mini";next}{print > x;}' foo.csv

Creates as many (numbered) {n}foo_mini.csv files as you have sections. (It seeks the pattern < section, and then starts a new file from the following line.)

Then for completeness' sake, add the csv extension:

for file in *foo_mini; do mv "$file" "${file/foo_mini/foo_mini.csv}"; done

You thus have:

0foo_mini.csv
1foo_mini.csv
etc...

It's then a cinch to read them in with Pandas as separate dataframes, and concat them if you like.

From .csv, read only or split into sections separated by "<string>"

Tags:

python

pandas

JG89

2 Answers

sammywemmy

Josh Friedlander

Recent Activity

Donate For Us