Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clean invalid characters from data held in a Spark RDD

I have a PySpark RDD imported from JSON files. The data elements contain a number of values that have characters that are not desirable. For the sake of argument only those characters that are string.printable should be in those JSON files.

Given that there are a large number of elements that contain text information I have been trying to find a way of mapping the incoming RDD to a function to clean the data and returning a cleansed RDD as output. I can find ways of printing a cleansed element from the RDD but not the entire collection of elements and returning then as an RDD.

An example document might be as show below and undesirable characters might creep into the userAgent, marketingReference and pageTags elements or indeed any of the text elements.

{
    "documentId": "abcdef12-1234-5678-fedc-cba9876543210",
    "documentType": "contentSummary",
    "dateTimeCreated": "2017-01-01T03:00:22.478Z"
    "body": {
        "requestUrl": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
        "requestMethod": "GET",
        "responseCode": "200",
        "userAgent": "Mozilla/5.0 etc",
        "requestHeaders": {
            "connection": "close",
            "host": "www.our-web-site.com",
            "accept-language": "en-gb",
            "via": "1.1 www.our-web-site.com",
            "user-agent": "Mozilla/5.0 etc",
            "x-forwarded-proto": "https",
            "clientIp": "99.99.99.99",
            "referer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
            "accept-encoding": "gzip, deflate",
            "incap-client-ip": "99.99.99.99"
        },
        "body": {
            "pageId": "/content/our-web-site/en-gb/holidays/interstitial",
            "pageVersion": "1.0",

            "pageClassification": "product-page",
            "pageTags": "spark, python, rdd, other words",
            "MarketingReference": "BUYMEPLEASE",
            "referrer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
            "webSessionId": "abcdef12-1234-5678-fedc-cba9876543210"
        }
    }
}
like image 903
Dave Poole Avatar asked Jan 24 '17 14:01

Dave Poole


People also ask

Can you edit the data of RDD for example case conversion?

Resilient Distributed Datasets (RDDs) RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.

Can data in RDD be changed once RDD is created?

RDDs are immutable in nature i.e. we cannot change the RDD, we need to transform it by applying transformation(s).

What type of operation is getNumPartitions ()?

getNumPartitions() is used to find out the number of partitions in which an RDD is stored. Hence, it is an action operation.


1 Answers

The problem was trying to clean up data downstream for which poor (or totally absent) data quality practices existed upstream.

Eventually it was accepted that we were attempting to address a symptom and not the cause. The cost of retrospectively fixing data was proven to be massively more than the cost of handling data properly in the first place.

like image 109
Dave Poole Avatar answered Sep 19 '22 00:09

Dave Poole