What is the best way to copy a table that contains different delimeters, spaces in column names etc. The function pd.read_clipboard()
cannot manage this task on its own.
Example 1:
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
Expected result:
Age Category A B C D
21-26 2 2 4 1
26-31 7 11 12 5
31-36 3 5 5 2
36-41 2 4 1 7
41-46 0 1 3 2
46-51 0 0 2 3
EDIT:
Example 2:
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
Expected result:
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
I look for a universal approach that can be applied to the most common table types.
The reason this is so complicated is that these type of ASCII tables or not really designed with data transfer in mind. Their true function is to depict the data in a visually pleasing way.
This doesn't mean it is not possible to use it to transfer into pandas! Let's start with .read_clipboard()
:
df = pd.read_clipboard(sep='|').iloc[1:,1:-1]
Instead of using a comma as the (default) separator we define |
to be the separator.
The .iloc[1:,1:-1]
gets rid of the first row (-----------
) and the first and last columns: because of the trailing |
at the beginning and end of each line pandas
sees an 'empty' column there.
Now all that is left is to strip whitespace from the column names and values:
stripped_columns = []
for column_name in df.columns:
df[column_name] = df[column_name].str.strip()
stripped_columns.append(column_name.strip())
df.columns = stripped_columns
And if you want Age Category
to be your index:
df.set_index('Age Category', inplace=True)
Last pass I would make would be to make sure all your columns are now actually holding numbers and not strings:
df = df.astype('int')
Resulting in:
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 21-26 to 46-51
Data columns (total 4 columns):
A 6 non-null int64
B 6 non-null int64
C 6 non-null int64
D 6 non-null int64
dtypes: int64(4)
memory usage: 400.0+ bytes
I am not sure what your reason is for reading it from the clipboard. A bit more elegant solution might be to paste it into a .csv
file and use the more advanced features .read_csv()
has to offer. The necessary transformations however would remain the same.
Here is another potential solution using re.sub
and io.StringIO
:
from io import StringIO
import re
text1 = """
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
"""
text2= """
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
"""
df1 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', ' ', text1)), sep='\s{2,}', engine='python')
df2 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', ' ', text2)), sep='\s{2,}', engine='python')
[out]
df1
Age Category A B C D
0 21-26 2 2 4 1
1 26-31 7 11 12 5
2 31-36 3 5 5 2
3 36-41 2 4 1 7
4 41-46 0 1 3 2
5 46-51 0 0 2 3
df2
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With