I'm trying to scrape the WHOLE 'In more languages' table on Wikidata pages, e.g. https://www.wikidata.org/wiki/Q3044
I have tried 2 approaches in R:
library(rvest)
url <- "https://www.wikidata.org/wiki/Q3044"
pg <- url %>% read_html
pg <- pg %>%
html_nodes(".wikibase-entitytermsforlanguagelistview") %>%
html_table()
table <- pg[[1]]
But this return only the English part (1 row).
I have also tried:
library(tidywikidatar)
tw_get_label(id = c("Q3044"),language = "nl")
But this returns only one label. However, I would like all the 'Also known as' category on Wikidata.
Any help would be much appreciated!
What an excellent question. You're only getting the first row of the table because that's all that the page initially loads with, and there's some JavaScript magic happening in the background to load the rest of the table after the page loads. You can see this happen if you reload the page and watch closely - I've included a gif below to show this. Since R doesn't run all that extra magic, all it gets is the original page.

However, all this means is that we need to look for a different URL that's sourcing the full table. Using Chrome's developer tools we learn that the table's coming from https://www.wikidata.org/wiki/Special:EntityData/Q3044.json and that's the page we actually want to scrape. If we download that using jsonLite we don't get the table exactly, but we can reassemble it using some dplyr tools. Here's a snippet of code that does that:
wiki_data <- jsonlite::read_json("https://www.wikidata.org/wiki/Special:EntityData/Q3044.json")
table_data <- wiki_data$entities$Q3044
library(dplyr)
label_col <- bind_rows(table_data$labels) %>% rename(label=value)
desc_col <- bind_rows(table_data$descriptions) %>% rename(description=value)
alias_col <- bind_rows(table_data$aliases) %>%
rename(alias=value) %>%
group_by(language) %>%
summarise(alias=paste(alias, collapse = ", "))
full_table <- label_col %>%
left_join(desc_col) %>%
left_join(alias_col)
with the first few rows of the output shown below:
> full_table
# A tibble: 157 x 4
language label description alias
<chr> <chr> <chr> <chr>
1 fr Charlemagne empereur d'Occident et roi des Francs Char~
2 en Charlemagne King of the Franks, King of Italy, and Holy Roman~ Karo~
3 it Carlo Magno re dei Franchi e dei Longobardi e primo imperator~ NA
4 ilo Karlomagno Ari dagiti Pranko ken Lombardo ken Emperador ti N~ NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With