Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use Regex to extract file path and save it in python

I have a text file which holds lots of files path file.txt:

C:\data\AS\WO\AS_WOP_1PPPPPP20070506.bin
C:\data\AS\WO\AS_WOP_1PPPPPP20070606.bin
C:\data\AS\WO\AS_WOP_1PPPPPP20070708.bin
C:\data\AS\WO\AS_WOP_1PPPPPP20070808.bin
...

What I did with Regex to extract the date from path:

import re

textfile = open('file.txt', 'r')
filetext = textfile.read()
textfile.close()

data = []

for line in filetext:
    matches = re.search("AS_[A-Z]{3}_(.{7})([0-9]{4})([0-9]{2})([0-9]{2})", line)
    data.append(line)

it does not give what I want.

My output should be like this:

year    month
2007     05
2007     06
2007     07
2007     08

and then save it as list of lists:

[['2007', '5'], ['2007', '6'], ['2007', '7'], ['2007', '8']]

or save it as a Pandas series.

is there any way with regex to get what I want !?

like image 283
GeoCom Avatar asked Nov 03 '15 16:11

GeoCom


2 Answers

You can simplify your regex to this:

/(....)(..)..\.bin$/

Group 1 will have the year while Group 2 will have the month. I assume that the format is pertaining throughout the file.

Now, . represents any character and \. represents "dot" or literal .. $ means at the end of the string. So, I'm matching .bin at the end of the line and leaving out day and just grouping year and month.

like image 171
Amit Joki Avatar answered Sep 29 '22 09:09

Amit Joki


try this using pandas:

df = pd.read_csv('yourfile.txt',header=None)
df.columns = ['paths']
# pandas string method extract takes a regex
df['paths'].str.extract('(\d{4})(\d{2})')

output:

       0    1
0   2007    05
1   2007    06
2   2007    07
3   2007    08
like image 45
JAB Avatar answered Sep 29 '22 09:09

JAB