Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a .SAS or SPS metadata file to read a CSV as a Pandas dataframe?

I have a big CSV file and it comes with two metadata description files. One has a .sas extension and the other a .sps. Opening them, they describe the CSV data format and categories. The files describe the data format and possible categories of each column. For example, a column with values 1 or 2 is mapped to yes and no.

How can I use these metadata files to help me read the CSV file?

I can easily read it using read_csv, but these files are useful to automatically create my columns with the possible categories. I can create a parser for them, but there must be a package or function to do it. Maybe I'm not using the correct search terms.

Here is the .sas file (sorry, it is Portuguese):

proc format;
Value $SG_AREA
        CH='Ciךncias Humanas'
        CN='Ciךncias da Natureza'
        LC='Linguagens e Cףdigos'
        MT='Matemבtica';

Value $TP_LINGUA
        0='Inglךs'
        1='Espanhol';

Value $IN_ITEM_ADAPTADO
        0='Nדo'
        1='Sim';


DATA WORK.ITENS_2018;
INFILE 'C:\ITENS_PROVA_2018.csv' /*local do arquivo*/
        LRECL=33
        FIRSTOBS=2
        DLM=';'
        MISSOVER
        DSD ;

INPUT
        CO_POSICAO       : BEST2.
        SG_AREA          : $CHAR2.
        CO_ITEM          : BEST6.
        TX_GABARITO      : $CHAR1.
        CO_HABILIDADE    : BEST2.
        TX_COR           : $CHAR7.
        CO_PROVA         : BEST3.
        TP_LINGUA        : $CHAR1.
        IN_ITEM_ADAPTADO : $CHAR1. ;

ATTRIB  SG_AREA          FORMAT = $SG_AREA20.;         
ATTRIB  TP_LINGUA        FORMAT = $TP_LINGUA8.;       
ATTRIB  IN_ITEM_ADAPTADO FORMAT = $IN_ITEM_ADAPTADO3.;

LABEL
CO_POSICAO='Posiחדo do Item na Prova'
SG_AREA='ֱrea de Conhecimento do Item'
CO_ITEM='Cףdigo do Item'
TX_GABARITO='Gabarito do Item'
CO_HABILIDADE='Habilidade do Item'
TX_COR='Cor da Prova'
CO_PROVA='Identificador da Prova'
TP_LINGUA='Lםngua Estrangeira '
IN_ITEM_ADAPTADO='Item pertencente א prova adaptada para Ledor'

;RUN;

And here you can see the equivalent .sps file:

GET DATA
  /TYPE=TXT
  /FILE= "C:\ITENS_PROVA_2018.csv" /*local do arquivo*/ 
  /DELCASE=LINE
  /DELIMITERS=";"
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /IMPORTCASE= ALL
  /VARIABLES=
CO_POSICAO F2.0
SG_AREA A2
CO_ITEM F6.0
TX_GABARITO A1
CO_HABILIDADE F2.0
TX_COR A7
CO_PROVA F3.0
TP_LINGUA A1       
IN_ITEM_ADAPTADO A1.
CACHE.
EXECUTE.
DATASET NAME ITENS_18 WINDOW=FRONT.

VARIABLE LABELS
CO_POSICAO  Posiחדo do Item na Prova
SG_AREA     ֱrea de Conhecimento do Item
CO_ITEM     Cףdigo do Item
TX_GABARITO Gabarito do Item
CO_HABILIDADE   Habilidade do Item
TX_COR      Cor da Prova
CO_PROVA    Identificador da Prova
TP_LINGUA       Lםngua Estrangeira
IN_ITEM_ADAPTADO    Item pertencente א prova adaptada para Ledor.


VALUE LABELS
SG_AREA
        "CH"    Ciךncias Humanas
        "CN"    Ciךncias da Natureza
        "LC"    Linguagens e Cףdigos
        "MT"    Matemבtica
/TP_LINGUA
        0   Inglךs
        1   Espanhol
/IN_ITEM_ADAPTADO
        0   Nדo
        1   Sim.

You can see that they describe the metadata for each column.

like image 932
neves Avatar asked Oct 24 '19 04:10

neves


People also ask

Which method is used to read CSV files using pandas?

Pandas read_csv() function imports a CSV file to DataFrame format. header: this allows you to specify which row will be used as column names for your dataframe. Expected an int value or a list of int values. Default value is header=0 , which means the first row of the CSV file will be treated as column names.

How to read sas7bdat file into a pandas Dataframe?

Now, when we have done that, we can read the .sas7bdat file into a Pandas dataframe using the read_sas method. In the read SAS example here, we are importing the same data file as in the previous example. In some cases, you may want to work with your data using other Python packages.

How do I read a CSV file from a Dataframe in Python?

The pandas read_csv () function is used to read a CSV file into a dataframe. It comes with a number of different parameters to customize how you’d like to read the file. The following is the general syntax for loading a csv file to a dataframe: import pandas as pd df = pd.read_csv (path_to_file)

Is there any metadata in a CSV file?

The only metadata a CSV file has is the column (variable) names. You can probably do a much better job at guessing what data types to use than PROC import can. You can definitely do a better job of creating meaningul variable names. Plus the code to read the file with a data step is shorter than the code to call PROC IMPORT.

How to read Sav file in Python using PANDAS?

Here are two simple steps on how to read .sav files in Python using Pandas (more details will be provided in this post): 1 import pandas#N#in your script type “ import pandas as pd “#N#Save 2 use read_spss#N#in your script use the read_spss method:#N#df = read_spss (‘PATH_TO_SAV_FILE”)#N#Save More ...


Video Answer


1 Answers

.sas is the program file extension for SAS, and is designed to be used via SAS. It is essentially a command file serving as a dictionary file.

.sps is the program file extension for SPSS, and is designed to be used via SPSS. It is essentially a command file serving as a dictionary file. I'd give a handy link here too but SPSS is an IBM product and their documentation is a hellish landscape none should tread.

What you're trying to do should be possible despite that. Pandas by itself is insufficient, as it has no functions built in to address these situations. Pandas support for SAS only extends to .sas7bdat data files, and for SPSS only extends to .sav data files.

Python (and Pandas) can read the .sas and .sps extensions because they're plain text files, but can't actually do anything with them.


Here are two paths for you to take to get what you're after.

1) Install SAS or SPSS on a trial, use it to read the data and then export in an alternative format.

2) Install and attempt to use the pyreadstat extension for Pandas.

It sounds like the pandas framework is your preference and for that to work you'll need to expand what it can do. In this case, with the pyreadstat extension. It is designed to work with SAS and SPSS data files and processes them far more efficiency than pandas by itself. This solution comes with a caveat.

Pyreadstat is itself a conversion of ReadStat. Quoting the pyreadstat readme file:

This module is a wrapper around the excellent Readstat C library by Evan Miller. 
Readstat is the library used in the back of the R library Haven, 
meaning pyreadstat is a python equivalent to R Haven.

If you look only at the pyreadstat files you won't find anything touching on .sas or .sps or dictionary files in general. Instead, you'll want to look at the readme for ReadStat found here. It has a section specifically covering such circumstances.

As of yet I have not tested the ReadStat commands and functions that exist for dictionary files in pyreadstat, so there is a possibility this will not work.

If you attempt this solution and it fails for you, follow up to the thread and I'll help you troubleshoot.

like image 142
Stephen R Avatar answered Sep 28 '22 12:09

Stephen R