I'm trying to pull a txt file which has two series of data into pandas. So far I've tried the variations below which I've source from other posts on stack. So far it will only read in as one series.
The data I'm using is available here
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", delim_whitespace=True, header=None)
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, sep="/t")
icdencoding = pd.read_table("data/icd10cm_codes_2017.txt", header=None, delimiter=r"\s+")
I'm sure I'm doing something really obviously wrong but I can't see it.
try to use sep=r'\s{2,}'
as separator - it means use as separator two or more spaces or tabs:
In [28]: df = pd.read_csv(url, sep=r'\s{2,}', engine='python', header=None, names=['ID','Name'])
In [29]: df
Out[29]:
ID Name
0 A000 Cholera due to Vibrio cholerae 01, biovar cholerae
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
5 A0102 Typhoid fever with heart involvement
6 A0103 Typhoid pneumonia
7 A0104 Typhoid arthritis
8 A0105 Typhoid osteomyelitis
9 A0109 Typhoid fever with other complications
10 A011 Paratyphoid fever A
11 A012 Paratyphoid fever B
12 A013 Paratyphoid fever C
13 A014 Paratyphoid fever, unspecified
14 A020 Salmonella enteritis
15 A021 Salmonella sepsis
16 A0220 Localized salmonella infection, unspecified
17 A0221 Salmonella meningitis
18 A0222 Salmonella pneumonia
19 A0223 Salmonella arthritis
20 A0224 Salmonella osteomyelitis
21 A0225 Salmonella pyelonephritis
22 A0229 Salmonella with other localized infection
23 A028 Other specified salmonella infections
24 A029 Salmonella infection, unspecified
.. ... ...
671 B188 Other chronic viral hepatitis
672 B189 Chronic viral hepatitis, unspecified
673 B190 Unspecified viral hepatitis with hepatic coma
674 B1910 Unspecified viral hepatitis B without hepatic coma
675 B1911 Unspecified viral hepatitis B with hepatic coma
676 B1920 Unspecified viral hepatitis C without hepatic coma
677 B1921 Unspecified viral hepatitis C with hepatic coma
678 B199 Unspecified viral hepatitis without hepatic coma
679 B20 Human immunodeficiency virus [HIV] disease
680 B250 Cytomegaloviral pneumonitis
681 B251 Cytomegaloviral hepatitis
682 B252 Cytomegaloviral pancreatitis
683 B258 Other cytomegaloviral diseases
684 B259 Cytomegaloviral disease, unspecified
685 B260 Mumps orchitis
686 B261 Mumps meningitis
687 B262 Mumps encephalitis
688 B263 Mumps pancreatitis
689 B2681 Mumps hepatitis
690 B2682 Mumps myocarditis
691 B2683 Mumps nephritis
692 B2684 Mumps polyneuropathy
693 B2685 Mumps arthritis
694 B2689 Other mumps complications
695 B269 Mumps without complication
[696 rows x 2 columns]
alternatively you can use read_fwf() method
Your file is a fixed width file so you can use read_fwf
, here the default params are able to infer the column widths:
In [106]:
df = pd.read_fwf(r'icd10cm_codes_2017.txt', header=None)
df.head()
Out[106]:
0 1
0 A000 Cholera due to Vibrio cholerae 01, biovar chol...
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
If you know the names you want for the column names you can pass these to read_fwf
:
In [107]:
df = pd.read_fwf(r'C:\Users\alanwo\Downloads\icd10cm_codes_2017.txt', header=None, names=['col1', 'col2'])
df.head()
Out[107]:
col1 col2
0 A000 Cholera due to Vibrio cholerae 01, biovar chol...
1 A001 Cholera due to Vibrio cholerae 01, biovar eltor
2 A009 Cholera, unspecified
3 A0100 Typhoid fever, unspecified
4 A0101 Typhoid meningitis
Or just overwrite the columns
attribute after reading:
df.columns = ['col1', 'col2']
As to why what you tried failed, read_table
uses tabs as the default separator but the file just has spaces and is fixed width
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With