How to read multiple data sets, and create a single dataframe with a year column

Question

I would like to read multiple data sets and combine them into a single Pandas dataframe with a year column.

My sample data sets include newyork2000.txt, newyork2001.txt, newyork2002.txt.

Each data set contains 'address' and 'price'.

Below is the newyork2000.txt:

253 XXX st, 150000
2567 YYY st, 200000
...
3896 ZZZ rd, 350000

My final single dataframe should look like this:

year address      price
2000 253 XXX st   150000
2000 2567 YYY st  200000
...
2000 3896 ZZZ rd  350000
...
2002 789 XYZ ave  450000

So, I need to combine all data sets, create the year column, and name the columns.

Here is my code to create a single dataframe:

years=[2000,2001,2002]
df=[]
for i years:
    df.append(pd.read_csv("newyork" + str(i) + ".txt", header=None))
dfs=pd.concat(df)

But, I could not create the year column and name the columns. Please help me solve this problem.

Trenton McKinney · Accepted Answer

It is preferred to programmatically extract the year from the filename, than to manually create a list of years.
Use pathlib with .glob to find the files, use the .stem method to extract the filename, and then slice the year from the stem, with [-4:], providing the names of the files are consistent, with the year as the last 4 characters of the filename.
- The .stem method will extract the final path component (e.g. 'newyork2000'), without its suffix (e.g. '.txt')
Use pandas.DataFrame.insert to add the 'year' column to a specific location in the dataframe. This method inserts the column inplace, so do not use x = x.insert(...),

import pandas as pd
from pathlib import Path

# set the file path
file_path = Path('e:/PythonProjects/stack_overflow/data/example')

# find your files
files = file_path.glob('newyork*.txt')

# create a list of dataframes
df_list = list()

for f in files:
    # extract year from filename, by slicing the last four characters off the stem
    year = (f.stem)[-4:]
    
    # read the file and add column names
    x = pd.read_csv(f, header=None, names=['address', 'price'])
    
    # add a year column at index 0; use int(year) if the year should be an int, otherwise use only year
    x.insert(0, 'year', int(year))
    
    # append to the list
    df_list.append(x)
    
# create one dataframe from the list of dataframes
df = pd.concat(df_list).reset_index(drop=True)

Result

 year      address   price
 2000   253 XXX st  150000
 2000  2567 YYY st  200000
 2000  3896 ZZZ rd  350000
 2001  456 XYZ ave  650000
 2002  789 XYZ ave  450000

Sample data files

'newyork2000.txt'

253 XXX st, 150000
2567 YYY st, 200000
3896 ZZZ rd, 350000

'newyork2001.txt'

456 XYZ ave, 650000

'newyour2002.txt'

789 XYZ ave, 450000

How to read multiple data sets, and create a single dataframe with a year column

Tags:

python

for-loop

pandas

ph7see

1 Answers

Result

Sample data files

Trenton McKinney

Recent Activity

Donate For Us

How to read multiple data sets, and create a single dataframe with a year column

Tags:

python

for-loop

pandas

ph7see

1 Answers

Result

Sample data files

Trenton McKinney

Related questions

Recent Activity

Donate For Us