I need to convert a markdown table into a pandas DataFrame. I've managed to do this using the pd.read_csv function with '|' as the separator, but it seems like there's some additional cleanup required. Specifically, I need to remove the row containing '-----', which is used for table separation, and I also want to get rid of the last column.
Here's a simplified example of what I'm doing:
import pandas as pd
from io import StringIO
# The text containing the table
text = """
| Some Title | Some Description | Some Number |
|------------|------------------------------|-------------|
| Dark Souls | This is a fun game | 5 |
| Bloodborne | This one is even better | 2 |
| Sekiro | This one is also pretty good | 110101 |
"""
# Use StringIO to create a file-like object from the text
text_file = StringIO(text)
# Read the table using pandas read_csv with '|' as the separator
df = pd.read_csv(text_file, sep='|', skipinitialspace=True)
# Remove leading/trailing whitespace from column names
df.columns = df.columns.str.strip()
# Remove the index column
df = df.iloc[:, 1:]
Is there a more elegant and efficient way to convert a markdown table into a DataFrame without needing to perform these additional cleanup steps? I'd appreciate any suggestions or insights on improving this process.
Like this
import re
import pandas as pd
text = """
| Some Title | Some Description | Some Number |
|------------|------------------------------|-------------|
| Dark Souls | This is a fun game | 5 |
| Bloodborne | This one is even better | 2 |
| Sekiro | This one is also pretty good | 110101 |
"""
pattern = r"\| ([\w\s]+) \| ([\w\s]+) \| ([\w\s]+) \|"
# Use the findall function to extract all rows that match the pattern
matches = re.findall(pattern, text)
# Extract the header and data rows
header = matches[0]
data = matches[1:]
# Create a pandas DataFrame using the extracted header and data rows
df = pd.DataFrame(data, columns=header)
# Optionally, convert numerical columns to appropriate types
df['Some Number'] = df['Some Number'].astype(int)
print(df)
this will work. Also it does not require the import io line.
Another method was proposed which uses the re module which is a nice alternative too. So you have 2 methods.
import pandas as pd
text = """
| Some Title | Some Description | Some Number |
|------------|------------------------------|-------------|
| Dark Souls | This is a fun game | 5 |
| Bloodborne | This one is even better | 2 |
| Sekiro | This one is also pretty good | 110101 |
"""
lines = text.split("\n")
header = lines[1].strip("|").split("|")
data = []
# Loop through lines starting from 2
for line in lines[2:]:
# Break once we hit an empty line
if not line.strip():
break
cols = line.strip("|").split("|")
row = dict(zip(header, cols))
data.append(row)
df = pd.DataFrame(data)
print(df)
which will produce this:
Some Title Some Description Some Number
0 ------------ ------------------------------ -------------
1 Dark Souls This is a fun game 5
2 Bloodborne This one is even better 2
3 Sekiro This one is also pretty good 110101
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With