Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse a tables with unicode chars in variables from JSON with SAS BASE

I've faced with a problem on parsing JSON with unicode char in vars. So, I have the next JSON (example):

 {  
   "SASJSONExport":"1.0",
   "SASTableData+TEST":[  
      {  
         "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
         "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":4,
         "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0031"
      },
      {  
         "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
         "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":2,
         "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0032"
      },
      {  
         "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":1,
         "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":42,
         "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0033"
      }
   ]
}

To parse the table from JSON I use SAS engine:

libname jsonfl JSON fileref=injson ;

The code higher decode chars in cells, but name of vars looks like missing vals:

+--------------+---------------------------+------------+---------+---------+
| ordinal_root | ordinal_SASTableData_TEST | __________ | _______ | ______  |
+--------------+---------------------------+------------+---------+---------+
|            1 |                         1 |          2 |       4 | Что-то1 |
|            1 |                         2 |          2 |       2 | Что-то2 |
|            1 |                         3 |          1 |      42 | Что-то3 |
+--------------+---------------------------+------------+---------+---------+

The header must look like:

+--------------+---------------------------+------------+---------+---------+
| ordinal_root | ordinal_SASTableData_TEST | Переменная | Среднее | Строка  |
+--------------+---------------------------+------------+---------+---------+

So I've decide to replace unicoded variables chars with names like this DIM_N_. And for that I must find all strings, that agree with next regexp: /([\s\w\d\\]+)\"\:/

But, to get strings from json I need set as delim the next char '{','}','[',']',','. But if set that chars as dlm , I willn't assemble json again. So I've decide to paste before the char ~ to set it as dlm.

data delim;
    infile injson lrecl=1073741823 nopad;
    file  delim;
    input char1 $char1. @@;
        if char1 in ('{','}','[',']',',') then
            put '7E'x;
        put char1 $CHAR1. @@;
run;

I've get the novalid json file:

~
{"SASJSONExport":"1.0"~
,"SASTableData+TEST":~
[  ~
{"\u0056\u0061\u0072":2~
,"\u006d\u0065\u0061\u006e":4~
,"\u004e\u0061\u006d\u0065":"\u0073\u006d\u0074\u0068\u0031"~
}~
,  ~
{"\u0056\u0061\u0072":2~
,"\u006d\u0065\u0061\u006e":2~
,"\u004e\u0061\u006d\u0065":"\u0073\u006d\u0074\u0068\u0032"~
}~
,  ~
{"\u0056\u0061\u0072":1~
,"\u006d\u0065\u0061\u006e":42~
,"\u004e\u0061\u006d\u0065":"\u0073\u006d\u0074\u0068\u0033"~
}  ~
]~
}   

So as the next step I'm parsing JSON and use ~ as the delimiter:

data transfer;
length column $2000;
retain r;
    infile delim  delimiter='7E'x nopad;
    input char1 : $4000. @@;
            r = prxparse('/([\s\w\d\\]+)\"\:/');
            pos = prxmatch(r,char1);
            column = prxposn(r,1,char1);
        n= _n_;
run;

It works... But I feel that those are too bad practices, and It has confines.

UPD1
Option,

options vAlidfmtname=long VALIDMEMNAME=extend VALIDVARNAME=any;

return:

+--------------+---------------------------+----------------------------+---------+--------------+
| ordinal_root | ordinal_SASTableData_TEST |         __________         | _______ |    ______    |
+--------------+---------------------------+----------------------------+---------+--------------+
|            1 |                         1 | авфа2 фвафв = фвыа - тфвыа |       4 | Что-то1 ,,,, |
|            1 |                         2 | авфа2 фвафв = фвыа - тфвыа |       2 | Что-то2      |
|            1 |                         3 | авфа2 фвафв = фвыа - тфвыа |    2017 | Что-то3      |
+--------------+---------------------------+----------------------------+---------+--------------+

So my questions are:

  1. Can I decode the whole file without the infile statement?
  2. Can I use infile delimiter, but set smth options to not delete the delimiter?

Adequate criticism is welcomed.

like image 904
Sanek Zhitnik Avatar asked May 25 '17 08:05

Sanek Zhitnik


1 Answers

UPD
I came to the solution without having to manually edit the json map file, but using a regex.

libname _all_ clear;
filename _all_ clear;
filename _PDFOUT temp;
filename _GSFNAME temp;
proc datasets lib=work kill memtype=data nolist; quit;
filename jsf '~/sasuser.v94/.json' encoding='utf-8';
data _null_;
  file jsf;
  length js varchar(*);
  retain js;
  input;
  js=unicode(_infile_);
  put js;
  datalines;
{
  "SASJSONExport":"1.0",
  "SASTableData+TEST":[
    {
      "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
      "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":4,
      "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0031"
    },
    {
      "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
      "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":2,
      "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0032"
    },
    {
      "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":1,
      "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":42,
      "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0033"
    }
  ]
}
;
run;
filename jsm '~/sasuser.v94/.json.map' encoding='utf-8';
libname jsd json fileref=jsf map=jsm automap=replace;
libname jsm json fileref=jsm;
data jsmm;
  merge jsm.datasets jsm.datasets_variables;
  by ordinal_DATASETS;
run;
proc sort data=jsmm; by ordinal_root ordinal_DATASETS; run;
data _null_;
  set work.jsmm end=last;
  if _N_=1 then do;
    length s varchar(*) ds varchar(*);
    retain s ds prx;
    s='{"DATASETS":[';
    ds='';
    prx=prxparse('/[^_]/');
  end;
  if ds=dsname then s=s||',';
  else do;
    ds=dsname;
    if _N_^=1 then s=s||']},';
    s=cats(s,'{"DSNAME":"',ds,'","TABLEPATH":"',tablepath,'","VARIABLES":[');
  end;
  s=cats(s,'{"NAME":"',name,'","TYPE":"',type,'","PATH":"',path,'"');
  if prxmatch(prx,name) > length(name) then
    s=cats(s,',"LABEL":"',scan(path,-1,'/'),'"');
  s=s||'}';
  if last then do;
    s=s||']}]}';
    file jsm;
    put s;
  end;
run;
libname jsd json fileref=jsf map=jsm;
proc print data=jsd.SASTableData_TEST label noobs; run;

The first variant of the solution
It is the quick'n'dirty solution.
First preparing the input data:

libname _all_ clear;
filename _all_ clear;
filename jsf '~/sasuser.v94/.json' encoding='utf-8';
data _null_;
  file jsf;
  length js varchar(*);
  input;
  js=unicode(_infile_);
  put js;
  datalines;
{
  "SASJSONExport":"1.0",
  "SASTableData+TEST": [
    {
      "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
      "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":4,
      "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0031"
    },
    {
      "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
      "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":2,
      "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0032"
    },
    {
      "\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":1,
      "\u0421\u0440\u0435\u0434\u043d\u0435\u0435":42,
      "\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0033"
    }
  ]
}
;
run;

The output file .json:

{
"SASJSONExport":"1.0",
"SASTableData+TEST": [
{
"Переменная":2,
"Среднее":4,
"Строка":"Что-то1"
},
{
"Переменная":2,
"Среднее":2,
"Строка":"Что-то2"
},
{
"Переменная":1,
"Среднее":42,
"Строка":"Что-то3"
}
]
}

Then create the json map file .json.map:

filename jsmf '~/sasuser.v94/.json.map' encoding='utf-8';
libname jsm json fileref=jsf map=jsmf automap=create;

The .json.map contents:

{
  "DATASETS": [
    {
      "DSNAME": "root",
      "TABLEPATH": "/root",
      "VARIABLES": [
        {
          "NAME": "ordinal_root",
          "TYPE": "ORDINAL",
          "PATH": "/root"
        },
        {
          "NAME": "SASJSONExport",
          "TYPE": "CHARACTER",
          "PATH": "/root/SASJSONExport",
          "CURRENT_LENGTH": 3
        }
      ]
    },
    {
      "DSNAME": "SASTableData_TEST",
      "TABLEPATH": "/root/SASTableData+TEST",
      "VARIABLES": [
        {
          "NAME": "ordinal_root",
          "TYPE": "ORDINAL",
          "PATH": "/root"
        },
        {
          "NAME": "ordinal_SASTableData_TEST",
          "TYPE": "ORDINAL",
          "PATH": "/root/SASTableData+TEST"
        },
        {
          "NAME": "____________________",
          "TYPE": "NUMERIC",
          "PATH": "/root/SASTableData+TEST/Переменная"
        },
        {
          "NAME": "______________",
          "TYPE": "NUMERIC",
          "PATH": "/root/SASTableData+TEST/Среднее"
        },
        {
          "NAME": "____________",
          "TYPE": "CHARACTER",
          "PATH": "/root/SASTableData+TEST/Строка",
          "CURRENT_LENGTH": 12
        }
      ]
    }
  ]
}

Let's change the file a bit by removing the description of the unnesessary dataset and adding labels:

{
  "DATASETS": [
    {
      "DSNAME": "SASTableData_TEST",
      "TABLEPATH": "/root/SASTableData+TEST",
      "VARIABLES": [
        {
          "NAME": "ordinal_root",
          "TYPE": "ORDINAL",
          "PATH": "/root"
        },
        {
          "NAME": "ordinal_SASTableData_TEST",
          "TYPE": "ORDINAL",
          "PATH": "/root/SASTableData+TEST"
        },
        {
          "NAME": "____________________",
          "TYPE": "NUMERIC",
          "PATH": "/root/SASTableData+TEST/Переменная",
          "LABEL": "Переменная"
        },
        {
          "NAME": "______________",
          "TYPE": "NUMERIC",
          "PATH": "/root/SASTableData+TEST/Среднее",
          "LABEL": "Среднее"
        },
        {
          "NAME": "____________",
          "TYPE": "CHARACTER",
          "PATH": "/root/SASTableData+TEST/Строка",
          "LABEL": "Строка",
          "CURRENT_LENGTH": 12
        }
      ]
    }
  ]
}

And try again:

libname jsd json fileref=jsf map=jsmf;
proc print data=jsd.SASTableData_TEST label noobs; run;

The result:

+--------------+---------------------------+- ----------+---------+-----------+
| ordinal_root | ordinal_SASTableData_TEST | Переменная | Среднее |    Строка |
+--------------+---------------------------+------------+---------+-----------+
|            1 |                         1 |          2 |       4 | Что-то1   |
|            1 |                         2 |          2 |       2 | Что-то2   |
|            1 |                         3 |          1 |      42 | Что-то3   |
+--------------+---------------------------+------------+---------+-----------+

All it was done in SAS University Edition.

like image 73
Andrei Odegov Avatar answered Nov 14 '22 02:11

Andrei Odegov