How to use a .SAS or SPS metadata file to read a CSV as a Pandas dataframe?

Tags:

I have a big CSV file and it comes with two metadata description files. One has a .sas extension and the other a .sps. Opening them, they describe the CSV data format and categories. The files describe the data format and possible categories of each column. For example, a column with values 1 or 2 is mapped to yes and no.

How can I use these metadata files to help me read the CSV file?

I can easily read it using read_csv, but these files are useful to automatically create my columns with the possible categories. I can create a parser for them, but there must be a package or function to do it. Maybe I'm not using the correct search terms.

Here is the .sas file (sorry, it is Portuguese):

proc format;
Value $SG_AREA
        CH='Ciךncias Humanas'
        CN='Ciךncias da Natureza'
        LC='Linguagens e Cףdigos'
        MT='Matemבtica';

Value $TP_LINGUA
        0='Inglךs'
        1='Espanhol';

Value $IN_ITEM_ADAPTADO
        0='Nדo'
        1='Sim';


DATA WORK.ITENS_2018;
INFILE 'C:\ITENS_PROVA_2018.csv' /*local do arquivo*/
        LRECL=33
        FIRSTOBS=2
        DLM=';'
        MISSOVER
        DSD ;

INPUT
        CO_POSICAO       : BEST2.
        SG_AREA          : $CHAR2.
        CO_ITEM          : BEST6.
        TX_GABARITO      : $CHAR1.
        CO_HABILIDADE    : BEST2.
        TX_COR           : $CHAR7.
        CO_PROVA         : BEST3.
        TP_LINGUA        : $CHAR1.
        IN_ITEM_ADAPTADO : $CHAR1. ;

ATTRIB  SG_AREA          FORMAT = $SG_AREA20.;         
ATTRIB  TP_LINGUA        FORMAT = $TP_LINGUA8.;       
ATTRIB  IN_ITEM_ADAPTADO FORMAT = $IN_ITEM_ADAPTADO3.;

LABEL
CO_POSICAO='Posiחדo do Item na Prova'
SG_AREA='ֱrea de Conhecimento do Item'
CO_ITEM='Cףdigo do Item'
TX_GABARITO='Gabarito do Item'
CO_HABILIDADE='Habilidade do Item'
TX_COR='Cor da Prova'
CO_PROVA='Identificador da Prova'
TP_LINGUA='Lםngua Estrangeira '
IN_ITEM_ADAPTADO='Item pertencente א prova adaptada para Ledor'

;RUN;

And here you can see the equivalent .sps file:

GET DATA
  /TYPE=TXT
  /FILE= "C:\ITENS_PROVA_2018.csv" /*local do arquivo*/ 
  /DELCASE=LINE
  /DELIMITERS=";"
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /IMPORTCASE= ALL
  /VARIABLES=
CO_POSICAO F2.0
SG_AREA A2
CO_ITEM F6.0
TX_GABARITO A1
CO_HABILIDADE F2.0
TX_COR A7
CO_PROVA F3.0
TP_LINGUA A1       
IN_ITEM_ADAPTADO A1.
CACHE.
EXECUTE.
DATASET NAME ITENS_18 WINDOW=FRONT.

VARIABLE LABELS
CO_POSICAO  Posiחדo do Item na Prova
SG_AREA     ֱrea de Conhecimento do Item
CO_ITEM     Cףdigo do Item
TX_GABARITO Gabarito do Item
CO_HABILIDADE   Habilidade do Item
TX_COR      Cor da Prova
CO_PROVA    Identificador da Prova
TP_LINGUA       Lםngua Estrangeira
IN_ITEM_ADAPTADO    Item pertencente א prova adaptada para Ledor.


VALUE LABELS
SG_AREA
        "CH"    Ciךncias Humanas
        "CN"    Ciךncias da Natureza
        "LC"    Linguagens e Cףdigos
        "MT"    Matemבtica
/TP_LINGUA
        0   Inglךs
        1   Espanhol
/IN_ITEM_ADAPTADO
        0   Nדo
        1   Sim.

You can see that they describe the metadata for each column.

932

asked Oct 24 '19 04:10

neves

Video Answer

1 Answers

.sas is the program file extension for SAS, and is designed to be used via SAS. It is essentially a command file serving as a dictionary file.

.sps is the program file extension for SPSS, and is designed to be used via SPSS. It is essentially a command file serving as a dictionary file. I'd give a handy link here too but SPSS is an IBM product and their documentation is a hellish landscape none should tread.

What you're trying to do should be possible despite that. Pandas by itself is insufficient, as it has no functions built in to address these situations. Pandas support for SAS only extends to .sas7bdat data files, and for SPSS only extends to .sav data files.

Python (and Pandas) can read the .sas and .sps extensions because they're plain text files, but can't actually do anything with them.

Here are two paths for you to take to get what you're after.

1) Install SAS or SPSS on a trial, use it to read the data and then export in an alternative format.

2) Install and attempt to use the pyreadstat extension for Pandas.

It sounds like the pandas framework is your preference and for that to work you'll need to expand what it can do. In this case, with the pyreadstat extension. It is designed to work with SAS and SPSS data files and processes them far more efficiency than pandas by itself. This solution comes with a caveat.

Pyreadstat is itself a conversion of ReadStat. Quoting the pyreadstat readme file:

This module is a wrapper around the excellent Readstat C library by Evan Miller. 
Readstat is the library used in the back of the R library Haven, 
meaning pyreadstat is a python equivalent to R Haven.

If you look only at the pyreadstat files you won't find anything touching on .sas or .sps or dictionary files in general. Instead, you'll want to look at the readme for ReadStat found here. It has a section specifically covering such circumstances.

As of yet I have not tested the ReadStat commands and functions that exist for dictionary files in pyreadstat, so there is a possibility this will not work.

If you attempt this solution and it fails for you, follow up to the thread and I'll help you troubleshoot.

142

answered Sep 28 '22 12:09

Stephen R

Related questions
                            
                                Getting Python package distribution version from within a package
                            
                                Using Panda's .at function to modify multiple rows
                            
                                Python pytest pytest_exception_interact customize exception information from VCR.py exception
                            
                                How to hide command prompt popup during launching PyLatex or Latexmk
                            
                                How to document options in an INI file with Sphinx
                            
                                Recommendation system with matrix factorization for huge data gives MemoryError
                            
                                How to provide multiple targets to a Seq2Seq model?
                            
                                RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment
                            
                                How do you enable macOS Dark Mode in PyQt5 (5.13)
                            
                                Generating correlated random potential using fast Fourier transform
                            
                                Using numerical values in plotly for creating Gantt-Charts
                            
                                Bool and missing values in pandas
                            
                                How to crawl in desired order or Synchronously in Scrapy?
                            
                                Type hinting values that are multiple types?
                            
                                Why is the 'all' implementation slower than writing a for loop [duplicate]
                            
                                How to combine polynomials in matrix operations in Sympy?
                            
                                Use tf-idf with FastText vectors
                            
                                Making a fullscreen paint program with transparent background of my application
                            
                                ModuleNotFoundError: No module named 'fasttext'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use a .SAS or SPS metadata file to read a CSV as a Pandas dataframe?

Tags:

python

pandas

sas

spss

neves

People also ask

Video Answer

1 Answers

Stephen R

Recent Activity

Donate For Us