Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark, looping through DataFrame in a more efficient way?

Tags:

python

pyspark

can someone maybe tell me a better way to loop through a df in Pyspark in my specific case. I am new to spark, so sorry for the question.

What I am doing is selecting the value of the id column of the df where the song_name is null. I append these to a list and get the track_ids for these values. With these track_ids I make an API-Request to get the missing song_names and replace the null value at that index with the returned song_name.

missing_track_name = df.filter(df['track_name'].isNull()).select(df['ID']).collect()

missing_list = [x[0] for x in missing_track_name]

for i in missing_list:

    track_id = df.filter(col('ID')==i).select(df.song_id).collect()

    url = 'https://api.spotify.com/v1/tracks/{0}'.format(track_id)
    request = requests.get(url, headers = header, params = {"limit" : 50})
    data = request.json()

    df = df.withColumn("track_name", when(col("ID") == i, data['name']).otherwise(col("track_name")))
    df = df.withColumn("artist_name", when(col("ID") == i, data['artists'][0]['name']).otherwise(col("artist_name")))

Sample rows from my table are (the music is from a friends Spotify, not exactly my taste):

genre artist_name track_name track_id popularity acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence ID
R&B Mary J. Blige Be Without You - Kendu Mix 2YegxR5As7BeQuVp2U6pek 65 0.083 0.724 246333 0.689 0.0 0.304 -5.922 0.135 146.496 0.693 0
R&B Rihanna Desperado 6KFaHC9G178beAp7P0Vi5S 63 0.323 0.685 186467 0.61 0.0 0.102 -5.221 0.0439 94.384 0.323 1
R&B Yung Bleu Ice On My Baby (feat. Kevin Gates) - Remix 6muW8cSjJ3rusKJ0vH5olw 62 0.0675 0.762 199520 0.52 3.95e-06 0.114 -5.237 0.0959 75.047 0.0862 2
R&B Surfaces Heaven Falls / Fall on Me 7yHqOZfsXYlicyoMt62yC6 61 0.36 0.563 240597 0.366 0.00243 0.0955 -6.896 0.121 85.352 0.768 3
R&B Olivia O'Brien Love Myself 4XzgjxGKqULifVf7mnDIQK 68 0.596 0.653 213947 0.621 0.0 0.0811 -5.721 0.0409 100.006 0.466 4
like image 798
paul773 Avatar asked Dec 21 '25 21:12

paul773


1 Answers

you can resort to an UDF. But I'd suggest you filter the null rows and run the UDF only for them.

here's my attempt

def getSpotifyTrackName(trackId, cId, cSecret):
    import requests, base64
    
    # fetch token -- feel free to do this outside the func
    # fetching within this func as token expires within 3600 secs 
    # and might expire midway through big process
    b64_encode_cred = base64.b64encode((cId+':'+cSecret).encode())
    token_url = 'https://accounts.spotify.com/api/token/'
    token_data = {
        'grant_type': 'client_credentials'
    }
    token_header = {
        'Authorization': 'Basic ' + b64_encode_cred.decode()
    }

    token_req_json = requests.post(token_url, headers=token_header, data=token_data).json()

    # fetch track details using aforementioned token
    track_url = 'https://api.spotify.com/v1/tracks/{0}'.format(trackId)
    track_header = {
        'Authorization': token_req_json['token_type']+' '+token_req_json['access_token']
    }

    track_data_json = requests.get(track_url,
                                   params = {"limit" : 50}, 
                                   headers=track_header
                                   ).json()
    
    return track_data_json['name'].strip()

# create UDF
getSpotifyTrackName_udf = func.udf(lambda t: getSpotifyTrackName(t, spotify_auth['client_id'], spotify_auth['client_secret']), 
                                   StringType()
                                   )

# use UDF to fetch track names in a new column
data_sdf. \
    withColumn('api_track_name', getSpotifyTrackName_udf('track_id')). \
    show(truncate=False)

# +-----+--------------+--------------------------+----------------------+------------------------------------------+
# |genre|artist_name   |track_name                |track_id              |api_track_name                            |
# +-----+--------------+--------------------------+----------------------+------------------------------------------+
# |R&B  |Mary J. Blige |Be Without You - Kendu Mix|2YegxR5As7BeQuVp2U6pek|Be Without You - Kendu Mix                |
# |R&B  |Rihanna       |null                      |6KFaHC9G178beAp7P0Vi5S|Desperado                                 |
# |R&B  |Yung Bleu     |null                      |6muW8cSjJ3rusKJ0vH5olw|Ice On My Baby (feat. Kevin Gates) - Remix|
# |R&B  |Surfaces      |Heaven Falls / Fall on Me |7yHqOZfsXYlicyoMt62yC6|Heaven Falls / Fall on Me                 |
# |R&B  |Olivia O'Brien|Love Myself               |4XzgjxGKqULifVf7mnDIQK|Love Myself                               |
# +-----+--------------+--------------------------+----------------------+------------------------------------------+
like image 83
samkart Avatar answered Dec 24 '25 09:12

samkart



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!