Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read multiple data sets, and create a single dataframe with a year column

I would like to read multiple data sets and combine them into a single Pandas dataframe with a year column.

My sample data sets include newyork2000.txt, newyork2001.txt, newyork2002.txt.

Each data set contains 'address' and 'price'.

Below is the newyork2000.txt:

253 XXX st, 150000
2567 YYY st, 200000
...
3896 ZZZ rd, 350000   

My final single dataframe should look like this:

year address      price
2000 253 XXX st   150000
2000 2567 YYY st  200000
...
2000 3896 ZZZ rd  350000
...
2002 789 XYZ ave  450000

So, I need to combine all data sets, create the year column, and name the columns.

Here is my code to create a single dataframe:

years=[2000,2001,2002]
df=[]
for i years:
    df.append(pd.read_csv("newyork" + str(i) + ".txt", header=None))
dfs=pd.concat(df)

But, I could not create the year column and name the columns. Please help me solve this problem.

like image 429
ph7see Avatar asked Nov 20 '25 20:11

ph7see


1 Answers

  • It is preferred to programmatically extract the year from the filename, than to manually create a list of years.
  • Use pathlib with .glob to find the files, use the .stem method to extract the filename, and then slice the year from the stem, with [-4:], providing the names of the files are consistent, with the year as the last 4 characters of the filename.
    • The .stem method will extract the final path component (e.g. 'newyork2000'), without its suffix (e.g. '.txt')
  • Use pandas.DataFrame.insert to add the 'year' column to a specific location in the dataframe. This method inserts the column inplace, so do not use x = x.insert(...),
import pandas as pd
from pathlib import Path

# set the file path
file_path = Path('e:/PythonProjects/stack_overflow/data/example')

# find your files
files = file_path.glob('newyork*.txt')

# create a list of dataframes
df_list = list()

for f in files:
    # extract year from filename, by slicing the last four characters off the stem
    year = (f.stem)[-4:]
    
    # read the file and add column names
    x = pd.read_csv(f, header=None, names=['address', 'price'])
    
    # add a year column at index 0; use int(year) if the year should be an int, otherwise use only year
    x.insert(0, 'year', int(year))
    
    # append to the list
    df_list.append(x)
    
# create one dataframe from the list of dataframes
df = pd.concat(df_list).reset_index(drop=True)

Result

 year      address   price
 2000   253 XXX st  150000
 2000  2567 YYY st  200000
 2000  3896 ZZZ rd  350000
 2001  456 XYZ ave  650000
 2002  789 XYZ ave  450000

Sample data files

  • 'newyork2000.txt'
253 XXX st, 150000
2567 YYY st, 200000
3896 ZZZ rd, 350000 
  • 'newyork2001.txt'
456 XYZ ave, 650000
  • 'newyour2002.txt'
789 XYZ ave, 450000
like image 52
Trenton McKinney Avatar answered Nov 22 '25 10:11

Trenton McKinney



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!