Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python wikipedia scraping - getting the links to same page in other languages?

How can I get all the links from a wikipedia page to the same pages in other languages using either wikipedia or wikitools packages?

For example:

I have the page http://en.wikipedia.org/wiki/Stack_overflow and I'm trying to get the links to the same page in the other possible languages, like: http://ko.wikipedia.org/wiki/%EC%8A%A4%ED%83%9D_%EC%98%A4%EB%B2%84%ED%94%8C%EB%A1%9C (In Korean)

and http://zh.wikipedia.org/wiki/%E5%A0%86%E7%96%8A%E6%BA%A2%E4%BD%8D (in Chinese).

I want to get all of the possible pages.

My question is similar to this guy's question: How to get wikipedia page in multi languages?, It's just that I'm trying to figure out whether it's possible to do the same job using the packages stated above (they are easy to get through pip) instead of re-inventing the wheel.

I'd also love to hear if it is not possible, or if there are other packages that do this job easilly. Thanks!

like image 599
nivniv Avatar asked Oct 21 '22 04:10

nivniv


1 Answers

I haven't found this exact functionality in both wikipedia and wikitools packages. wikipedia though allows to switch between languages by using set_lang() method.

I don't see anything bad to get the list of languages via BeautifulSoup and then use wikipedia to get the page content in different languages:

# -*- coding: utf-8 -*-

import urllib2
from bs4 import BeautifulSoup
import wikipedia

# get languages
soup = BeautifulSoup(urllib2.urlopen('http://en.wikipedia.org/wiki/Stack_Overflow'))
links = [(el.get('lang'), el.get('title')) for el in soup.select('li.interlanguage-link > a')]

for language, title in links:
    page_title = title.split(u' – ')[0]
    wikipedia.set_lang(language)
    page = wikipedia.page(page_title)
    print language
    print page.summary
    print "-----"

Prints:

de
Pufferüberläufe (englisch buffer overflow) gehören zu den häufigsten Sicherheitslücken in aktueller Software, die sich u. a. über das Internet ausnutzen lassen können. Im Wesentlichen werden bei einem Pufferüberlauf durch Fehler im Programm zu große Datenmengen in einen dafür zu kleinen reservierten Speicherbereich, den Puffer, geschrieben, wodurch nach dem Ziel-Speicherbereich liegende Speicherstellen überschrieben werden.
Dreht es sich nicht um einen ganzen Datenblock, sondern um eine Zieladresse eines einzelnen Datensatzes, spricht man auch von pointer overflow, nach dem Pointer (Zeiger), der anzeigt, wo der Datensatz im Puffer hingeschrieben werden soll.

-----
es
En informática, un desbordamiento de pila (stack overflow/overrun) es un problema aritmético que hace referencia al exceso de flujo de datos almacenados en la pila de una función, esto permite que la dirección de retorno de la pila pueda ser modificada por otra parte de un atacante para obtener un beneficio propio, que generalmente es malicioso.

...

You may also switch to BeautifulSoup completely, but this can easily lead to reinventing the wheel:

import urllib2
from bs4 import BeautifulSoup

# get languages and links
soup = BeautifulSoup(urllib2.urlopen('http://en.wikipedia.org/wiki/Stack_Overflow'))
links = [(el.get('lang'), el.get('href')) for el in soup.select('li.interlanguage-link > a')]

for language, link in links:
    soup = BeautifulSoup(urllib2.urlopen('http:' + link))
    print language, soup.title.text

Prints:

de Stack Overflow (Website) – Wikipedia
es Stack Overflow - Wikipedia, la enciclopedia libre
fa استک اورفلو - ویکی‌پدیا، دانشنامهٔ آزاد
fr Stack Overflow — Wikipédia
ko 스택 오버플로 (웹사이트) - 위키백과, 우리 모두의 백과사전
it Stack Overflow - Wikipedia
hu Stack Overflow - Wikipédia
ja Stack Overflow - Wikipedia
pl StackOverflow – Wikipedia, wolna encyklopedia
ro Stack Overflow - Wikipedia
ru Stack Overflow — Википедия
ta இசுட்டாக் ஓவர்ஃபுலோ - தமிழ் விக்கிப்பீடியா
uk Stack Overflow — Вікіпедія
zh Stack Overflow - 维基百科,自由的百科全书
like image 99
alecxe Avatar answered Oct 22 '22 16:10

alecxe