I have a big CSV file and it comes with two metadata description files. One has a .sas
extension and the other a .sps
. Opening them, they describe the CSV data format and categories. The files describe the data format and possible categories of each column. For example, a column with values 1 or 2 is mapped to yes and no.
How can I use these metadata files to help me read the CSV file?
I can easily read it using read_csv, but these files are useful to automatically create my columns with the possible categories. I can create a parser for them, but there must be a package or function to do it. Maybe I'm not using the correct search terms.
Here is the .sas
file (sorry, it is Portuguese):
proc format;
Value $SG_AREA
CH='Ciךncias Humanas'
CN='Ciךncias da Natureza'
LC='Linguagens e Cףdigos'
MT='Matemבtica';
Value $TP_LINGUA
0='Inglךs'
1='Espanhol';
Value $IN_ITEM_ADAPTADO
0='Nדo'
1='Sim';
DATA WORK.ITENS_2018;
INFILE 'C:\ITENS_PROVA_2018.csv' /*local do arquivo*/
LRECL=33
FIRSTOBS=2
DLM=';'
MISSOVER
DSD ;
INPUT
CO_POSICAO : BEST2.
SG_AREA : $CHAR2.
CO_ITEM : BEST6.
TX_GABARITO : $CHAR1.
CO_HABILIDADE : BEST2.
TX_COR : $CHAR7.
CO_PROVA : BEST3.
TP_LINGUA : $CHAR1.
IN_ITEM_ADAPTADO : $CHAR1. ;
ATTRIB SG_AREA FORMAT = $SG_AREA20.;
ATTRIB TP_LINGUA FORMAT = $TP_LINGUA8.;
ATTRIB IN_ITEM_ADAPTADO FORMAT = $IN_ITEM_ADAPTADO3.;
LABEL
CO_POSICAO='Posiחדo do Item na Prova'
SG_AREA='ֱrea de Conhecimento do Item'
CO_ITEM='Cףdigo do Item'
TX_GABARITO='Gabarito do Item'
CO_HABILIDADE='Habilidade do Item'
TX_COR='Cor da Prova'
CO_PROVA='Identificador da Prova'
TP_LINGUA='Lםngua Estrangeira '
IN_ITEM_ADAPTADO='Item pertencente א prova adaptada para Ledor'
;RUN;
And here you can see the equivalent .sps
file:
GET DATA
/TYPE=TXT
/FILE= "C:\ITENS_PROVA_2018.csv" /*local do arquivo*/
/DELCASE=LINE
/DELIMITERS=";"
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE= ALL
/VARIABLES=
CO_POSICAO F2.0
SG_AREA A2
CO_ITEM F6.0
TX_GABARITO A1
CO_HABILIDADE F2.0
TX_COR A7
CO_PROVA F3.0
TP_LINGUA A1
IN_ITEM_ADAPTADO A1.
CACHE.
EXECUTE.
DATASET NAME ITENS_18 WINDOW=FRONT.
VARIABLE LABELS
CO_POSICAO Posiחדo do Item na Prova
SG_AREA ֱrea de Conhecimento do Item
CO_ITEM Cףdigo do Item
TX_GABARITO Gabarito do Item
CO_HABILIDADE Habilidade do Item
TX_COR Cor da Prova
CO_PROVA Identificador da Prova
TP_LINGUA Lםngua Estrangeira
IN_ITEM_ADAPTADO Item pertencente א prova adaptada para Ledor.
VALUE LABELS
SG_AREA
"CH" Ciךncias Humanas
"CN" Ciךncias da Natureza
"LC" Linguagens e Cףdigos
"MT" Matemבtica
/TP_LINGUA
0 Inglךs
1 Espanhol
/IN_ITEM_ADAPTADO
0 Nדo
1 Sim.
You can see that they describe the metadata for each column.
Pandas read_csv() function imports a CSV file to DataFrame format. header: this allows you to specify which row will be used as column names for your dataframe. Expected an int value or a list of int values. Default value is header=0 , which means the first row of the CSV file will be treated as column names.
Now, when we have done that, we can read the .sas7bdat file into a Pandas dataframe using the read_sas method. In the read SAS example here, we are importing the same data file as in the previous example. In some cases, you may want to work with your data using other Python packages.
The pandas read_csv () function is used to read a CSV file into a dataframe. It comes with a number of different parameters to customize how you’d like to read the file. The following is the general syntax for loading a csv file to a dataframe: import pandas as pd df = pd.read_csv (path_to_file)
The only metadata a CSV file has is the column (variable) names. You can probably do a much better job at guessing what data types to use than PROC import can. You can definitely do a better job of creating meaningul variable names. Plus the code to read the file with a data step is shorter than the code to call PROC IMPORT.
Here are two simple steps on how to read .sav files in Python using Pandas (more details will be provided in this post): 1 import pandas#N#in your script type “ import pandas as pd “#N#Save 2 use read_spss#N#in your script use the read_spss method:#N#df = read_spss (‘PATH_TO_SAV_FILE”)#N#Save More ...
.sas is the program file extension for SAS, and is designed to be used via SAS. It is essentially a command file serving as a dictionary file.
.sps is the program file extension for SPSS, and is designed to be used via SPSS. It is essentially a command file serving as a dictionary file. I'd give a handy link here too but SPSS is an IBM product and their documentation is a hellish landscape none should tread.
What you're trying to do should be possible despite that. Pandas by itself is insufficient, as it has no functions built in to address these situations. Pandas support for SAS only extends to .sas7bdat data files, and for SPSS only extends to .sav data files.
Python (and Pandas) can read the .sas and .sps extensions because they're plain text files, but can't actually do anything with them.
Here are two paths for you to take to get what you're after.
1) Install SAS or SPSS on a trial, use it to read the data and then export in an alternative format.
2) Install and attempt to use the pyreadstat extension for Pandas.
It sounds like the pandas framework is your preference and for that to work you'll need to expand what it can do. In this case, with the pyreadstat extension. It is designed to work with SAS and SPSS data files and processes them far more efficiency than pandas by itself. This solution comes with a caveat.
Pyreadstat is itself a conversion of ReadStat. Quoting the pyreadstat readme file:
This module is a wrapper around the excellent Readstat C library by Evan Miller.
Readstat is the library used in the back of the R library Haven,
meaning pyreadstat is a python equivalent to R Haven.
If you look only at the pyreadstat files you won't find anything touching on .sas or .sps or dictionary files in general. Instead, you'll want to look at the readme for ReadStat found here. It has a section specifically covering such circumstances.
As of yet I have not tested the ReadStat commands and functions that exist for dictionary files in pyreadstat, so there is a possibility this will not work.
If you attempt this solution and it fails for you, follow up to the thread and I'll help you troubleshoot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With