I have a text file with the following format:
1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
I need to covert this text to a DataFrame with the following format:
Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345
How I can do it?
Using read_csv() csv extension. In order to read our text file and load it into a pandas DataFrame all we need to provide to the read_csv() method is the filename, the separator/delimiter (which in our case is a whitespace) and the row containing the columns names which seems to be the first row.
In order to convert data types in pandas, there are three basic options: Use astype() to force an appropriate dtype. Create a custom function to convert the data. Use pandas functions such as to_numeric() or to_datetime()
Here's an optimized way to parse the file with re
, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.
import re
import pandas as pd
SEP_RE = re.compile(r":\s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)\s+(?P<weight>\d+\.\d+)", re.I)
def parse(filepath: str):
def _parse(filepath):
with open(filepath) as f:
for line in f:
id, rest = SEP_RE.split(line, maxsplit=1)
for match in DATA_RE.finditer(rest):
yield [int(id), match["term"], float(match["weight"])]
return list(_parse(filepath))
Example:
>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>>
>>> df
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
>>> df.dtypes
Id int64
Term object
weight float64
dtype: object
SEP_RE
looks for an initial separator: a literal :
followed by one or more spaces. It uses maxsplit=1
to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.
After that, DATA_RE.finditer()
deals with each (term, weight) pair extraxted from rest
. The string rest
itself will look like frack 0.733, shale 0.700,
. .finditer()
gives you multiple match
objects, where you can use ["key"]
notation to access the element from a given named capture group, such as (?P<term>[a-z]+)
.
An easy way to visualize this is to use an example line
from your file as a string:
>>> line = "1: frack 0.733, shale 0.700,\n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,\n']
Now you have the initial ID and rest of the components, which you can unpack into two identifiers.
>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'
The better way to visualize it is with pdb
. Give it a try if you dare ;)
This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.
For instance, it assumes that each each Term
can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re
characters such as \w
.
You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:
import pandas as pd
from itertools import chain
text="""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """
df = pd.DataFrame(
list(
chain.from_iterable(
map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
)
),
columns=["Id", "Term", "weight"]
)
print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345
Explanation
I assume that you've read your file into the string text
. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :
print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'],
# ['10', ' space 0.645, station 0.327, nasa 0.258'],
# ['4', ' celebr 0.262, bahar 0.345']]
The next step is to split on the comma to separate the values, and assign the Id
to each set of values:
print(
[
list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]
Finally, we use itertools.chain.from_iterable
to flatten this output, which can then be passed straight to the DataFrame constructor.
Note: The *
tuple unpacking is a python 3 feature.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With