The task is to map the fields from one data set to another, some fields require some additional parsing/computation.
(I'm using only a few fields in the examples provided below but there is much more in the original data sets).
Initially I though to use dict for field mappings and just assign functions to keys that require additional data manipulation:
import base64
import hashlib
import json
from datetime import datetime
def str2base64(event):
md5 = hashlib.md5(event['id'].encode())
return base64.b64encode(md5.digest())
def ts2iso(event):
dt = datetime.fromtimestamp(event['timestamp'])
return dt.isoformat()
MAPPINGS = {
'id': id2hash,
'region': 'site',
'target': 'host',
'since': ts2iso
}
def parser(event):
new = dict()
for k, v in MAPPINGS.items():
if callable(v):
value = v(event)
else:
value = event.get(v)
new[k] = value
return new
def main():
for event in events: # dicts
event = parser(event)
print(json.dumps(event, indent=2))
if __name__ == '__main__':
main()
I don't like the fact that I have to add all parsing function at the top so the MAPPING dict can see it and I'm not sure if that's the best approach? Besides I don’t see an easy way to pass default value to dict.get
in parser
function.
import base64
import hashlib
import json
from datetime import datetime
class Event(object):
def __init__(self, event):
self.event = event
@property
def id(self):
md5 = hashlib.md5(self.event['id'].encode())
return base64.b64encode(md5.digest())
@property
def region(self):
return self.event['site']
@property
def target(self):
return self.event['host']
@property
def since(self):
dt = datetime.fromtimestamp(self.event['timestamp'])
return dt.isoformat()
def data(self):
return {
attr: getattr(self, attr)
for attr in dir(self)
if not attr.startswith('__') and attr not in ['event', 'data']
}
def main():
for event in events: # dicts
event = Event(event).data()
print(json.dumps(event, indent=2))
if __name__ == '__main__':
main()
I'm sure there is a better way to get all properties (property methods only) to avoid this ugly data
method? I would like to also avoid adding prefix to relevant methods so I could then filter them with str.startswith
or something similar.
What would be the best approach for this task? I was also looking at @functools.singledispatch from functools but I think it won't be helpful in this case.
In a persistent project, you use mappings to persist to a data source. In a nonpersistent project, you use mappings simply to transform between the object format and some other data representation (such as XML).
Data mapping is the process of matching fields from one database to another. It's the first step to facilitate data migration, data integration, and other data management tasks. Before data can be analyzed for business insights, it must be homogenized in a way that makes it accessible to decision makers.
The basic steps that most data conversions incorporate are as follows: A comprehensive plan is developed based on user requirements. The character/symbol set is extracted from its source. That source data is converted to the format of the destination.
I think that your first approach makes a lot of sense and, if that is important to you, will perform much better than the OO approach. In case you need to process large numbers of events, converting a dict
to an object
will certainly be quite CPU intensive. I find it also very explicit and clear.
In the OO approach you would convert from dict
to object
for nothing. There is no benefit of having an object
, because all you do later on is converting it to JSON (which you can't do with a custom class unless you write your JSON encoder).
That's why my choice would be option number one, which I would modify slightly like this:
class SimpleConverter:
def __init__(self, key, default=None):
self.key = key
self.default = default
def __call__(self, event):
return event.get(self.key, self.default)
class TimestampToISO:
def __init__(self, key):
self.key = key
def __call__(self, event):
dt = datetime.fromtimestamp(event[self.key])
return dt.isoformat()
class StringToBase64:
def __init__(self, key):
self.key = key
def __call__(self, event):
md5 = hashlib.md5(event[self.key].encode())
return base64.b64encode(md5.digest()).decode() ## Without .decode() for Python2
def transform_event(event, mapping):
return {key: convert(event) for key, convert in mapping.items()}
def main(events, mapping):
for event in events: # dicts
event = transform_event(event, mapping)
print(json.dumps(event, indent=2))
if __name__ == '__main__':
mapping = {
'id': StringToBase64("id"),
'region': SimpleConverter("site"),
'target': SimpleConverter("region"),
'with_default': SimpleConverter("missing_key", "Not missing!"),
'since': TimestampToISO("timestamp"),
}
events = [
{
'id': 'test',
'site': 'X',
'host': 'Y',
'timestamp': 1582408754.5111449,
}
]
main(events, mapping)
Which outputs this:
{
"id": "CY9rzUYh03PK3k6DJie09g==",
"region": "X",
"target": null,
"with_default": "Not missing!",
"since": "2020-02-22T22:59:14.511145"
}
Notice how with this solution you can reuse all the converter classes for different event keys, which was not possible with pure functions.
That's a pretty cool problem you got here, however I feel the solutions are all a bit too code heavy:
MAPPINGS = {
'id': id2hash,
'region': ('site', 'default_region'),
'target': ('host', 'default_target'),
'since': ts2iso
}
# Unpack tuple if action is not callable. Equivalent to event.get(action[0], action[1])
mapped_event = [
{key: action(event) if callable(action) else event.get(*action)
for key, action in mapping} for event in events]
This solution does exactly what your first approach was but in a lot less lines. I agree this is fairly unreadable, so feel free to reuse only the parts you want (maybe have the dict comprehension in a separate function and call that in the list comp).
If what you wanted to express in your mapping for keys like 'target': 'host'
is: event.get('target', 'host')
, then the comprehension becomes:
mapped_event = [
{key: action(event) if callable(action) else event.get(key, action)
for key, action in mapping} for event in events]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With