can someone maybe tell me a better way to loop through a df in Pyspark in my specific case. I am new to spark, so sorry for the question.
What I am doing is selecting the value of the id column of the df where the song_name is null. I append these to a list and get the track_ids for these values. With these track_ids I make an API-Request to get the missing song_names and replace the null value at that index with the returned song_name.
missing_track_name = df.filter(df['track_name'].isNull()).select(df['ID']).collect()
missing_list = [x[0] for x in missing_track_name]
for i in missing_list:
track_id = df.filter(col('ID')==i).select(df.song_id).collect()
url = 'https://api.spotify.com/v1/tracks/{0}'.format(track_id)
request = requests.get(url, headers = header, params = {"limit" : 50})
data = request.json()
df = df.withColumn("track_name", when(col("ID") == i, data['name']).otherwise(col("track_name")))
df = df.withColumn("artist_name", when(col("ID") == i, data['artists'][0]['name']).otherwise(col("artist_name")))
Sample rows from my table are (the music is from a friends Spotify, not exactly my taste):
| genre | artist_name | track_name | track_id | popularity | acousticness | danceability | duration_ms | energy | instrumentalness | liveness | loudness | speechiness | tempo | valence | ID |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R&B | Mary J. Blige | Be Without You - Kendu Mix | 2YegxR5As7BeQuVp2U6pek | 65 | 0.083 | 0.724 | 246333 | 0.689 | 0.0 | 0.304 | -5.922 | 0.135 | 146.496 | 0.693 | 0 |
| R&B | Rihanna | Desperado | 6KFaHC9G178beAp7P0Vi5S | 63 | 0.323 | 0.685 | 186467 | 0.61 | 0.0 | 0.102 | -5.221 | 0.0439 | 94.384 | 0.323 | 1 |
| R&B | Yung Bleu | Ice On My Baby (feat. Kevin Gates) - Remix | 6muW8cSjJ3rusKJ0vH5olw | 62 | 0.0675 | 0.762 | 199520 | 0.52 | 3.95e-06 | 0.114 | -5.237 | 0.0959 | 75.047 | 0.0862 | 2 |
| R&B | Surfaces | Heaven Falls / Fall on Me | 7yHqOZfsXYlicyoMt62yC6 | 61 | 0.36 | 0.563 | 240597 | 0.366 | 0.00243 | 0.0955 | -6.896 | 0.121 | 85.352 | 0.768 | 3 |
| R&B | Olivia O'Brien | Love Myself | 4XzgjxGKqULifVf7mnDIQK | 68 | 0.596 | 0.653 | 213947 | 0.621 | 0.0 | 0.0811 | -5.721 | 0.0409 | 100.006 | 0.466 | 4 |
you can resort to an UDF. But I'd suggest you filter the null rows and run the UDF only for them.
here's my attempt
def getSpotifyTrackName(trackId, cId, cSecret):
import requests, base64
# fetch token -- feel free to do this outside the func
# fetching within this func as token expires within 3600 secs
# and might expire midway through big process
b64_encode_cred = base64.b64encode((cId+':'+cSecret).encode())
token_url = 'https://accounts.spotify.com/api/token/'
token_data = {
'grant_type': 'client_credentials'
}
token_header = {
'Authorization': 'Basic ' + b64_encode_cred.decode()
}
token_req_json = requests.post(token_url, headers=token_header, data=token_data).json()
# fetch track details using aforementioned token
track_url = 'https://api.spotify.com/v1/tracks/{0}'.format(trackId)
track_header = {
'Authorization': token_req_json['token_type']+' '+token_req_json['access_token']
}
track_data_json = requests.get(track_url,
params = {"limit" : 50},
headers=track_header
).json()
return track_data_json['name'].strip()
# create UDF
getSpotifyTrackName_udf = func.udf(lambda t: getSpotifyTrackName(t, spotify_auth['client_id'], spotify_auth['client_secret']),
StringType()
)
# use UDF to fetch track names in a new column
data_sdf. \
withColumn('api_track_name', getSpotifyTrackName_udf('track_id')). \
show(truncate=False)
# +-----+--------------+--------------------------+----------------------+------------------------------------------+
# |genre|artist_name |track_name |track_id |api_track_name |
# +-----+--------------+--------------------------+----------------------+------------------------------------------+
# |R&B |Mary J. Blige |Be Without You - Kendu Mix|2YegxR5As7BeQuVp2U6pek|Be Without You - Kendu Mix |
# |R&B |Rihanna |null |6KFaHC9G178beAp7P0Vi5S|Desperado |
# |R&B |Yung Bleu |null |6muW8cSjJ3rusKJ0vH5olw|Ice On My Baby (feat. Kevin Gates) - Remix|
# |R&B |Surfaces |Heaven Falls / Fall on Me |7yHqOZfsXYlicyoMt62yC6|Heaven Falls / Fall on Me |
# |R&B |Olivia O'Brien|Love Myself |4XzgjxGKqULifVf7mnDIQK|Love Myself |
# +-----+--------------+--------------------------+----------------------+------------------------------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With