I have a .csv file that is split in sections, each starting with < string > on a row of its own as in this example. This is followed by a set of columns and their respective rows of values. Columns are not consistent between sections.
< section1 ><br>
col1 col2 col3<br>
val1 val2 val3
< section2 ><br>
col3 col4 col5<br>
val4 val5 val6<br>
val7 val8 val9
...etc. Is there a way in which I can, either when the file's in .txt or .csv, import each section either: 1) into seperate dataframes? 2) into the same dataframe, but something like df[section][col]?
Many thanks!
Depending on the size of your csv, you could read in the entire file into Pandas and split the dataframe into multiple dataframes via a list comprehension.
data = '''<Network>;;;;;;;;;;;;;;;;;;;;;
Property;Value;;;;;;;;;;;;;;;;;;;;
Title;;;;;;;;;;;;;;;;;;;;;
Version;6.4;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;
<Sites>;;;;;;;;;;;;;;;;;;;;;
Name;LocationCode;Longitude;Latitude;;;;;;;;;;...'''
df = pd.read_csv(StringIO(data), header=None)
create a list of dataframe names (the headers of each df)
df_names = df[0].str.extract(r'(<[a-zA-Z]+>)')[0].str.strip('<>').dropna().tolist()
find the indices for the headers regions = df.loc[df[0].str.contains(r'<[a-zA-Z]+')].index.tolist()
last_row = df.index[-1]
regions.append(last_row)
from more_itertools import windowed
create windows for each 'sub' dataframe
regions_window = list(windowed(regions,2))
the function helps with some cleanup during the dataframe extraction
def some_cleanup(df):
df.columns = df.iloc[0].str.extract(r'(<[a-zA-z]+>)')[0].str.strip('<>')
df = df.iloc[1:]
return df
extract the dataframes
M = [df.loc[start:end].pipe(some_cleanup) for start,end in regions_window]
create a dict with the keys as the dataframe names
dataframe_dict = dict(zip(df_names,M))
There are some great answers here already but I'd recommend a Unix tool! It is shorter and will scale to very large files that don't fit into Pandas.
Assuming your file is called foo.csv:
awk '/< section/{x=i++"foo_mini";next}{print > x;}' foo.csv
Creates as many (numbered) {n}foo_mini.csv files as you have sections. (It seeks the pattern < section, and then starts a new file from the following line.)
Then for completeness' sake, add the csv extension:
for file in *foo_mini; do mv "$file" "${file/foo_mini/foo_mini.csv}"; done
You thus have:
0foo_mini.csv
1foo_mini.csv
etc...
It's then a cinch to read them in with Pandas as separate dataframes, and concat them if you like.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With