Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape 'In more languages' table on Wikidata?

Tags:

r

wikidata

I'm trying to scrape the WHOLE 'In more languages' table on Wikidata pages, e.g. https://www.wikidata.org/wiki/Q3044

I have tried 2 approaches in R:

library(rvest)
url <- "https://www.wikidata.org/wiki/Q3044"
pg <- url %>% read_html

pg <- pg %>% 
  html_nodes(".wikibase-entitytermsforlanguagelistview") %>%
  html_table()

table <- pg[[1]]

But this return only the English part (1 row).

I have also tried:

library(tidywikidatar)
tw_get_label(id = c("Q3044"),language = "nl")

But this returns only one label. However, I would like all the 'Also known as' category on Wikidata.

Any help would be much appreciated!

like image 669
user9988485 Avatar asked Oct 29 '25 07:10

user9988485


1 Answers

What an excellent question. You're only getting the first row of the table because that's all that the page initially loads with, and there's some JavaScript magic happening in the background to load the rest of the table after the page loads. You can see this happen if you reload the page and watch closely - I've included a gif below to show this. Since R doesn't run all that extra magic, all it gets is the original page.

Gif showing page refresh with only the first row initially loaded

However, all this means is that we need to look for a different URL that's sourcing the full table. Using Chrome's developer tools we learn that the table's coming from https://www.wikidata.org/wiki/Special:EntityData/Q3044.json and that's the page we actually want to scrape. If we download that using jsonLite we don't get the table exactly, but we can reassemble it using some dplyr tools. Here's a snippet of code that does that:


wiki_data <- jsonlite::read_json("https://www.wikidata.org/wiki/Special:EntityData/Q3044.json")
table_data <- wiki_data$entities$Q3044

library(dplyr)
label_col <- bind_rows(table_data$labels) %>% rename(label=value)
desc_col <- bind_rows(table_data$descriptions) %>% rename(description=value)
alias_col <- bind_rows(table_data$aliases) %>% 
  rename(alias=value) %>%
  group_by(language) %>%
  summarise(alias=paste(alias, collapse = ", "))

full_table <- label_col %>%
  left_join(desc_col) %>%
  left_join(alias_col)

with the first few rows of the output shown below:

> full_table
# A tibble: 157 x 4
   language label                         description                                        alias
   <chr>    <chr>                         <chr>                                              <chr>
 1 fr       Charlemagne                   empereur d'Occident et roi des Francs              Char~
 2 en       Charlemagne                   King of the Franks, King of Italy, and Holy Roman~ Karo~
 3 it       Carlo Magno                   re dei Franchi e dei Longobardi e primo imperator~ NA   
 4 ilo      Karlomagno                    Ari dagiti Pranko ken Lombardo ken Emperador ti N~ NA   
like image 117
Dubukay Avatar answered Oct 31 '25 21:10

Dubukay



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!