Parse prettyprinted tabular data with pandas

Question

What is the best way to copy a table that contains different delimeters, spaces in column names etc. The function pd.read_clipboard() cannot manage this task on its own.

Example 1:

| Age Category | A | B  | C  | D |
|--------------|---|----|----|---|
| 21-26        | 2 | 2  | 4  | 1 |
| 26-31        | 7 | 11 | 12 | 5 |
| 31-36        | 3 | 5  | 5  | 2 |
| 36-41        | 2 | 4  | 1  | 7 |
| 41-46        | 0 | 1  | 3  | 2 |
| 46-51        | 0 | 0  | 2  | 3 |

Expected result:

 Age Category  A  B   C   D    
 21-26         2  2   4   1 
 26-31         7  11  12  5 
 31-36         3  5   5   2 
 36-41         2  4   1   7 
 41-46         0  1   3   2 
 46-51         0  0   2   3

EDIT:

Example 2:

+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+

Expected result:

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

I look for a universal approach that can be applied to the most common table types.

jorijnsmit · Accepted Answer

The reason this is so complicated is that these type of ASCII tables or not really designed with data transfer in mind. Their true function is to depict the data in a visually pleasing way.

This doesn't mean it is not possible to use it to transfer into pandas! Let's start with .read_clipboard():

df = pd.read_clipboard(sep='|').iloc[1:,1:-1]

Instead of using a comma as the (default) separator we define | to be the separator.

The .iloc[1:,1:-1] gets rid of the first row (-----------) and the first and last columns: because of the trailing | at the beginning and end of each line pandas sees an 'empty' column there.

Now all that is left is to strip whitespace from the column names and values:

stripped_columns = []
for column_name in df.columns:
    df[column_name] = df[column_name].str.strip()
    stripped_columns.append(column_name.strip())
df.columns = stripped_columns

And if you want Age Category to be your index:

df.set_index('Age Category', inplace=True)

Last pass I would make would be to make sure all your columns are now actually holding numbers and not strings:

df = df.astype('int')

Resulting in:

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 21-26 to 46-51
Data columns (total 4 columns):
A    6 non-null int64
B    6 non-null int64
C    6 non-null int64
D    6 non-null int64
dtypes: int64(4)
memory usage: 400.0+ bytes

I am not sure what your reason is for reading it from the clipboard. A bit more elegant solution might be to paste it into a .csv file and use the more advanced features .read_csv() has to offer. The necessary transformations however would remain the same.

Chris Adams · Answer

Here is another potential solution using re.sub and io.StringIO :

from io import StringIO
import re

text1 = """
| Age Category | A | B  | C  | D |
|--------------|---|----|----|---|
| 21-26        | 2 | 2  | 4  | 1 |
| 26-31        | 7 | 11 | 12 | 5 |
| 31-36        | 3 | 5  | 5  | 2 |
| 36-41        | 2 | 4  | 1  | 7 |
| 41-46        | 0 | 1  | 3  | 2 |
| 46-51        | 0 | 0  | 2  | 3 |
"""

text2= """
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+
"""

df1 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', '  ', text1)), sep='\s{2,}', engine='python')
df2 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', '  ', text2)), sep='\s{2,}', engine='python')

[out]

df1

  Age Category  A   B   C  D
0        21-26  2   2   4  1
1        26-31  7  11  12  5
2        31-36  3   5   5  2
3        36-41  2   4   1  7
4        41-46  0   1   3  2
5        46-51  0   0   2  3

df2

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

Parse prettyprinted tabular data with pandas

Tags:

python

clipboard

copy-paste

pandas

Mykola Zotko

2 Answers

jorijnsmit

Chris Adams

Recent Activity

Donate For Us

Parse prettyprinted tabular data with pandas

Tags:

python

clipboard

copy-paste

pandas

Mykola Zotko

2 Answers

jorijnsmit

Chris Adams

Related questions

Recent Activity

Donate For Us