I have a PySpark RDD imported from JSON files. The data elements contain a number of values that have characters that are not desirable. For the sake of argument only those characters that are string.printable should be in those JSON files.
Given that there are a large number of elements that contain text information I have been trying to find a way of mapping the incoming RDD to a function to clean the data and returning a cleansed RDD as output. I can find ways of printing a cleansed element from the RDD but not the entire collection of elements and returning then as an RDD.
An example document might be as show below and undesirable characters might creep into the userAgent, marketingReference and pageTags elements or indeed any of the text elements.
{
"documentId": "abcdef12-1234-5678-fedc-cba9876543210",
"documentType": "contentSummary",
"dateTimeCreated": "2017-01-01T03:00:22.478Z"
"body": {
"requestUrl": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
"requestMethod": "GET",
"responseCode": "200",
"userAgent": "Mozilla/5.0 etc",
"requestHeaders": {
"connection": "close",
"host": "www.our-web-site.com",
"accept-language": "en-gb",
"via": "1.1 www.our-web-site.com",
"user-agent": "Mozilla/5.0 etc",
"x-forwarded-proto": "https",
"clientIp": "99.99.99.99",
"referer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
"accept-encoding": "gzip, deflate",
"incap-client-ip": "99.99.99.99"
},
"body": {
"pageId": "/content/our-web-site/en-gb/holidays/interstitial",
"pageVersion": "1.0",
"pageClassification": "product-page",
"pageTags": "spark, python, rdd, other words",
"MarketingReference": "BUYMEPLEASE",
"referrer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
"webSessionId": "abcdef12-1234-5678-fedc-cba9876543210"
}
}
}
Resilient Distributed Datasets (RDDs) RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.
RDDs are immutable in nature i.e. we cannot change the RDD, we need to transform it by applying transformation(s).
getNumPartitions() is used to find out the number of partitions in which an RDD is stored. Hence, it is an action operation.
The problem was trying to clean up data downstream for which poor (or totally absent) data quality practices existed upstream.
Eventually it was accepted that we were attempting to address a symptom and not the cause. The cost of retrospectively fixing data was proven to be massively more than the cost of handling data properly in the first place.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With