Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best approach for data conversion/mapping [closed]

The task is to map the fields from one data set to another, some fields require some additional parsing/computation.

(I'm using only a few fields in the examples provided below but there is much more in the original data sets).

Approach 1:

Initially I though to use dict for field mappings and just assign functions to keys that require additional data manipulation:

import base64
import hashlib
import json

from datetime import datetime


def str2base64(event):
    md5 = hashlib.md5(event['id'].encode())
    return base64.b64encode(md5.digest())


def ts2iso(event):
    dt = datetime.fromtimestamp(event['timestamp'])
    return dt.isoformat()


MAPPINGS = {
    'id': id2hash,
    'region': 'site',
    'target': 'host',
    'since': ts2iso
}


def parser(event):
    new = dict()
    for k, v in MAPPINGS.items():
        if callable(v):
            value = v(event)
        else:
            value = event.get(v)
        new[k] = value
    return new


def main():
    for event in events:  # dicts
        event = parser(event)
        print(json.dumps(event, indent=2))


if __name__ == '__main__':
    main()

I don't like the fact that I have to add all parsing function at the top so the MAPPING dict can see it and I'm not sure if that's the best approach? Besides I don’t see an easy way to pass default value to dict.get in parser function.

Approach 2 (OOP):

import base64
import hashlib
import json

from datetime import datetime


class Event(object):
    def __init__(self, event):
        self.event = event

    @property
    def id(self):
        md5 = hashlib.md5(self.event['id'].encode())
        return base64.b64encode(md5.digest())

    @property
    def region(self):
        return self.event['site']

    @property
    def target(self):
        return self.event['host']

    @property
    def since(self):
        dt = datetime.fromtimestamp(self.event['timestamp'])
        return dt.isoformat()

    def data(self):
        return {
            attr: getattr(self, attr)
            for attr in dir(self)
            if not attr.startswith('__') and attr not in ['event', 'data']
        }


def main():
    for event in events:  # dicts
        event = Event(event).data()
        print(json.dumps(event, indent=2))


if __name__ == '__main__':
    main()

I'm sure there is a better way to get all properties (property methods only) to avoid this ugly data method? I would like to also avoid adding prefix to relevant methods so I could then filter them with str.startswith or something similar.

What would be the best approach for this task? I was also looking at @functools.singledispatch from functools but I think it won't be helpful in this case.

like image 669
HTF Avatar asked Feb 20 '20 15:02

HTF


People also ask

Which mapping is used to create and persist data?

In a persistent project, you use mappings to persist to a data source. In a nonpersistent project, you use mappings simply to transform between the object format and some other data representation (such as XML).

What is the purpose of data mapping?

Data mapping is the process of matching fields from one database to another. It's the first step to facilitate data migration, data integration, and other data management tasks. Before data can be analyzed for business insights, it must be homogenized in a way that makes it accessible to decision makers.

How is data conversion done?

The basic steps that most data conversions incorporate are as follows: A comprehensive plan is developed based on user requirements. The character/symbol set is extracted from its source. That source data is converted to the format of the destination.


Video Answer


2 Answers

I think that your first approach makes a lot of sense and, if that is important to you, will perform much better than the OO approach. In case you need to process large numbers of events, converting a dict to an object will certainly be quite CPU intensive. I find it also very explicit and clear.

In the OO approach you would convert from dict to object for nothing. There is no benefit of having an object, because all you do later on is converting it to JSON (which you can't do with a custom class unless you write your JSON encoder).

That's why my choice would be option number one, which I would modify slightly like this:

class SimpleConverter:

    def __init__(self, key, default=None):
        self.key = key
        self.default = default

    def __call__(self, event):
        return event.get(self.key, self.default)


class TimestampToISO:

    def __init__(self, key):
        self.key = key

    def __call__(self, event):
        dt = datetime.fromtimestamp(event[self.key])
        return dt.isoformat()


class StringToBase64:

    def __init__(self, key):
        self.key = key

    def __call__(self, event):
        md5 = hashlib.md5(event[self.key].encode())
        return base64.b64encode(md5.digest()).decode()  ## Without .decode() for Python2


def transform_event(event, mapping):
    return {key: convert(event) for key, convert in mapping.items()}


def main(events, mapping):
    for event in events:  # dicts
        event = transform_event(event, mapping)
        print(json.dumps(event, indent=2))


if __name__ == '__main__':
    mapping = {
        'id': StringToBase64("id"),
        'region': SimpleConverter("site"),
        'target': SimpleConverter("region"),
        'with_default': SimpleConverter("missing_key", "Not missing!"),
        'since': TimestampToISO("timestamp"),
    }
    events = [
        {
            'id': 'test',
            'site': 'X',
            'host': 'Y',
            'timestamp': 1582408754.5111449,
        }
    ]
    main(events, mapping)

Which outputs this:

{
  "id": "CY9rzUYh03PK3k6DJie09g==",
  "region": "X",
  "target": null,
  "with_default": "Not missing!",
  "since": "2020-02-22T22:59:14.511145"
}

Notice how with this solution you can reuse all the converter classes for different event keys, which was not possible with pure functions.

like image 171
matino Avatar answered Sep 28 '22 18:09

matino


That's a pretty cool problem you got here, however I feel the solutions are all a bit too code heavy:

MAPPINGS = {
    'id': id2hash,
    'region': ('site', 'default_region'),
    'target': ('host', 'default_target'),
    'since': ts2iso
}
# Unpack tuple if action is not callable. Equivalent to event.get(action[0], action[1])
mapped_event = [
    {key: action(event) if callable(action) else event.get(*action)
    for key, action in mapping} for event in events]

This solution does exactly what your first approach was but in a lot less lines. I agree this is fairly unreadable, so feel free to reuse only the parts you want (maybe have the dict comprehension in a separate function and call that in the list comp).

If what you wanted to express in your mapping for keys like 'target': 'host' is: event.get('target', 'host'), then the comprehension becomes:

mapped_event = [
    {key: action(event) if callable(action) else event.get(key, action)
    for key, action in mapping} for event in events]
like image 32
Cal Avatar answered Sep 28 '22 20:09

Cal