How to Efficiently Convert a Markdown Table to a DataFrame in Python?

Question

I need to convert a markdown table into a pandas DataFrame. I've managed to do this using the pd.read_csv function with '|' as the separator, but it seems like there's some additional cleanup required. Specifically, I need to remove the row containing '-----', which is used for table separation, and I also want to get rid of the last column.

Here's a simplified example of what I'm doing:

import pandas as pd
from io import StringIO

# The text containing the table
text = """
| Some Title | Some Description             | Some Number |
|------------|------------------------------|-------------|
| Dark Souls | This is a fun game           | 5           |
| Bloodborne | This one is even better      | 2           |
| Sekiro     | This one is also pretty good | 110101      |
"""

# Use StringIO to create a file-like object from the text
text_file = StringIO(text)

# Read the table using pandas read_csv with '|' as the separator
df = pd.read_csv(text_file, sep='|', skipinitialspace=True)

# Remove leading/trailing whitespace from column names
df.columns = df.columns.str.strip()

# Remove the index column
df = df.iloc[:, 1:]

Is there a more elegant and efficient way to convert a markdown table into a DataFrame without needing to perform these additional cleanup steps? I'd appreciate any suggestions or insights on improving this process.

SuperStew · Accepted Answer

Like this

import re
import pandas as pd

text = """
| Some Title | Some Description             | Some Number |
|------------|------------------------------|-------------|  
| Dark Souls | This is a fun game           | 5           |
| Bloodborne | This one is even better      | 2           |
| Sekiro     | This one is also pretty good | 110101      |
"""

pattern = r"\| ([\w\s]+) \| ([\w\s]+) \| ([\w\s]+) \|"

# Use the findall function to extract all rows that match the pattern
matches = re.findall(pattern, text)

# Extract the header and data rows
header = matches[0]
data = matches[1:]

# Create a pandas DataFrame using the extracted header and data rows
df = pd.DataFrame(data, columns=header)

# Optionally, convert numerical columns to appropriate types
df['Some Number'] = df['Some Number'].astype(int)

print(df)

D.L · Answer

this will work. Also it does not require the import io line.

Another method was proposed which uses the re module which is a nice alternative too. So you have 2 methods.

import pandas as pd

text = """
| Some Title | Some Description             | Some Number |
|------------|------------------------------|-------------|  
| Dark Souls | This is a fun game           | 5           |
| Bloodborne | This one is even better      | 2           |
| Sekiro     | This one is also pretty good | 110101      |
"""

lines = text.split("
")
header = lines[1].strip("|").split("|")

data = [] 

# Loop through lines starting from 2
for line in lines[2:]:
    
    # Break once we hit an empty line
    if not line.strip():
        break
        
    cols = line.strip("|").split("|")
    row = dict(zip(header, cols))
    data.append(row)
    
df = pd.DataFrame(data)
print(df)

which will produce this:

    Some Title    Some Description                Some Number 
0  ------------  ------------------------------  -------------
1   Dark Souls    This is a fun game              5           
2   Bloodborne    This one is even better         2           
3   Sekiro        This one is also pretty good    110101

How to Efficiently Convert a Markdown Table to a DataFrame in Python?

Tags:

python

markdown

nlp

Kilian Shiliao

2 Answers

SuperStew

D.L

Recent Activity

Donate For Us

How to Efficiently Convert a Markdown Table to a DataFrame in Python?

Tags:

python

markdown

nlp

Kilian Shiliao

2 Answers

SuperStew

D.L

Related questions

Recent Activity

Donate For Us