Clean invalid characters from data held in a Spark RDD

Tags:

I have a PySpark RDD imported from JSON files. The data elements contain a number of values that have characters that are not desirable. For the sake of argument only those characters that are string.printable should be in those JSON files.

Given that there are a large number of elements that contain text information I have been trying to find a way of mapping the incoming RDD to a function to clean the data and returning a cleansed RDD as output. I can find ways of printing a cleansed element from the RDD but not the entire collection of elements and returning then as an RDD.

An example document might be as show below and undesirable characters might creep into the userAgent, marketingReference and pageTags elements or indeed any of the text elements.

{
    "documentId": "abcdef12-1234-5678-fedc-cba9876543210",
    "documentType": "contentSummary",
    "dateTimeCreated": "2017-01-01T03:00:22.478Z"
    "body": {
        "requestUrl": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
        "requestMethod": "GET",
        "responseCode": "200",
        "userAgent": "Mozilla/5.0 etc",
        "requestHeaders": {
            "connection": "close",
            "host": "www.our-web-site.com",
            "accept-language": "en-gb",
            "via": "1.1 www.our-web-site.com",
            "user-agent": "Mozilla/5.0 etc",
            "x-forwarded-proto": "https",
            "clientIp": "99.99.99.99",
            "referer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
            "accept-encoding": "gzip, deflate",
            "incap-client-ip": "99.99.99.99"
        },
        "body": {
            "pageId": "/content/our-web-site/en-gb/holidays/interstitial",
            "pageVersion": "1.0",

            "pageClassification": "product-page",
            "pageTags": "spark, python, rdd, other words",
            "MarketingReference": "BUYMEPLEASE",
            "referrer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
            "webSessionId": "abcdef12-1234-5678-fedc-cba9876543210"
        }
    }
}

903

asked Jan 24 '17 14:01

Dave Poole

1 Answers

The problem was trying to clean up data downstream for which poor (or totally absent) data quality practices existed upstream.

Eventually it was accepted that we were attempting to address a symptom and not the cause. The cost of retrospectively fixing data was proven to be massively more than the cost of handling data properly in the first place.

109

answered Sep 19 '22 00:09

Dave Poole

Related questions
                            
                                Apps aren't loaded yet. with signals
                            
                                ipaddress module ValueError('%s has host bits set' % self)
                            
                                Splitting a string into 2-letter segments [duplicate]
                            
                                Install Multiprocessing python3
                            
                                OpenCV cv2.imshow is not working because of the qt
                            
                                How to remove words from a list in python
                            
                                Python - How to get the number of lines in a text file [duplicate]
                            
                                ModuleNotFoundError: No module named 'Crypto' Error
                            
                                Current Screen Size in Python3 with PyQt5
                            
                                Typing static methods returning class instance [duplicate]
                            
                                Python - Properly Kill/Exit Futures Thread?
                            
                                Why does isinstance require a tuple instead of any iterable? [duplicate]
                            
                                Failure to find module in split namespace
                            
                                In Python, how to assert passed-in file object was opened with newline=''?
                            
                                Why is List[str] not a subclass of Sequence[str]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Clean invalid characters from data held in a Spark RDD

Tags:

python-3.x

apache-spark

rdd

pyspark

Dave Poole

People also ask

1 Answers

Dave Poole

Recent Activity

Donate For Us