I've been bashing my head at this for ages, I must be doing something stupid.
I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias.
Here is my python code so far, which is simply trying to retrieve one of the tables:
import httplib
from lxml import etree
def main():
    conn = httplib.HTTPConnection("meta.wikimedia.org")
    conn.request("GET","/wiki/List_of_Wikipedias")
    res = conn.getresponse()
    root = etree.fromstring(res.read())
    table = root.xpath('//table')
    print table
main()
On my machine this only prints an empty list. To increase speed I cached the page locally and used:
wikipage = open("wikipage.html")
root = lxml.parse(wikipage)
but this makes no impact whatsoever (other than the obvious speedup). I have also tried
lxml.find('table')
and:
for element in root.iter():
    print("%s - %s" % (element.tag, element.text))
which successfully prints out all of the elements, so I know the tree is being created.
What am I doing wrong?
Any help would be appreciated. Thanks.
I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias
Your problem is that the element names in the document are in a default namespace. How to write XPath expressions that involve such element names is the most FAQ in XPath and has numerous good answer in the SO xpath tag. Just search for them.
Here is a complete solution:
Use:
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()
where you have registered the XHTML namespace ("http://www.w3.org/1999/xhtml") bound to the prefix "x".
When I evaluated this XPath expression against the document obtained from: http://s23.org/wikistats/wikipedias_html
I needed to add the following at the start of the document, because I was working locally and didn't have the DTD for XHTML -- maybe you will not need these:
<!DOCTYPE html [
<!ENTITY uarr "↑">
<!ENTITY darr "↓">
<!ENTITY ccedil "Ç">
<!ENTITY oslash "Ø">
<!ENTITY aacute "á">
<!ENTITY aring "å">
<!ENTITY agrave "À">
<!ENTITY egrave "è">
<!ENTITY ograve "Ò">
<!ENTITY ocirc "ô">
]>
The result of applying the above XPath expression to this document is:
                    English
                    German
                    French
                    Polish
                    Italian
                    Japanese
                    Spanish
                    Portuguese
                    Dutch
                    Russian
                    Swedish
                    Chinese
                    Catalan
                    Norwegian (Bokmål)
                    Finnish
                    Ukrainian
                    Czech
                    Hungarian
                    Romanian
                    Korean
                    Turkish
                    Vietnamese
                    Indonesian
                    Danish
                    Arabic
                    Esperanto
                    Serbian
                    Lithuanian
                    Slovak
                    Volapük
                    Persian
                    Hebrew
                    Bulgarian
                    Slovenian
                    Malay
                    Waray-Waray
                    Croatian
                    Estonian
                    Newar / Nepal Bhasa
                    Simple English
                    Hindi
                    Galician
                    Thai
                    Basque
                    Norwegian (Nynorsk)
                    Aromanian
                    Greek
                    Haitian
                    Azerbaijani
                    Tagalog
                    Latin
                    Telugu
                    Georgian
                    Macedonian
                    Cebuano
                    Serbo-Croatian
                    Breton
                    Piedmontese
                    Marathi
                    Latvian
                    Luxembourgish
                    Javanese
                    Belarusian (Taraškievica)
                    Welsh
                    Icelandic
                    Bosnian
                    Albanian
                    Tamil
                    Belarusian
                    Bishnupriya Manipuri
                    Aragonese
                    Occitan
                    Bengali
                    Swahili
                    Ido
                    Lombard
                    West Frisian
                    Gujarati
                    Afrikaans
                    Low Saxon
                    Malayalam
                    Quechua
                    Sicilian
                    Urdu
                    Kurdish
                    Cantonese
                    Sundanese
                    Asturian
                    Neapolitan
                    Samogitian
                    Armenian
                    Yoruba
                    Irish
                    Chuvash
                    Walloon
                    Nepali
                    Ripuarian
                    Western Panjabi
                    Kannada
                    Tajik
                    Tarantino
                    Venetian
                    Yiddish
                    Scottish Gaelic
                    Tatar
                    Min Nan
                    Ossetian
                    Uzbek
                    Alemannic
                    Kapampangan
                    Sakha
                    Egyptian Arabic
                    Kazakh
                    Maori
                    Limburgian
                    Amharic
                    Nahuatl
                    Upper Sorbian
                    Gilaki
                    Corsican
                    Gan
                    Mongolian
                    Scots
                    Interlingua
                    Central_Bicolano
                    Burmese
                    Faroese
                    Võro
                    Dutch Low Saxon
                    Sinhalese
                    Turkmen
                    West Flemish
                    Sanskrit
                    Bavarian
                    Malagasy
                    Manx
                    Ilokano
                    Divehi
                    Norman
                    Pangasinan
                    Banyumasan
                    Sorani
                    Romansh
                    Northern Sami
                    Zazaki
                    Mazandarani
                    Wu
                    Friulian
                    Uyghur
                    Ligurian
                    Maltese
                    Bihari
                    Novial
                    Tibetan
                    Anglo-Saxon
                    Kashubian
                    Sardinian
                    Classical Chinese
                    Fiji Hindi
                    Khmer
                    Ladino
                    Zamboanga Chavacano
                    Pali
                    Franco-Provençal/Arpitan
                    Pashto
                    Hakka
                    Cornish
                    Punjabi
                    Navajo
                    Silesian
                    Kalmyk
                    Pennsylvania German
                    Hawaiian
                    Saterland Frisian
                    Interlingue
                    Somali
                    Komi
                    Karachay-Balkar
                    Crimean Tatar
                    Tongan
                    Acehnese
                    Meadow Mari
                    Picard
                    Erzya
                    Lingala
                    Kinyarwanda
                    Extremaduran
                    Guarani
                    Kirghiz
                    Emilian-Romagnol
                    Assyrian Neo-Aramaic
                    Papiamentu
                    Aymara
                    Chechen
                    Lojban
                    Wolof
                    Banjar
                    Bashkir
                    North Frisian
                    Greenlandic
                    Tok Pisin
                    Udmurt
                    Kabyle
                    Tahitian
                    Sranan
                    Zealandic
                    Hill Mari
                    Komi-Permyak
                    Lower Sorbian
                    Abkhazian
                    Gagauz
                    Igbo
                    Oriya
                    Lao
                    Kongo
                    Avar
                    Moksha
                    Mirandese
                    Romani
                    Old Church Slavonic
                    Karakalpak
                    Samoan
                    Moldovan
                    Tetum
                    Gothic
                    Kashmiri
                    Bambara
                    Inupiak
                    Sindhi
                    Bislama
                    Lak
                    Nauruan
                    Norfolk
                    Inuktitut
                    Pontic
                    Assamese
                    Cherokee
                    Min Dong
                    Swati
                    Palatinate German
                    Hausa
                    Ewe
                    Tigrinya
                    Oromo
                    Zulu
                    Zhuang
                    Venda
                    Tsonga
                    Kirundi
                    Dzongkha
                    Sango
                    Cree
                    Chamorro
                    Luganda
                    Buginese
                    Buryat (Russia)
                    Fijian
                    Chichewa
                    Akan
                    Sesotho
                    Xhosa
                    Fula
                    Tswana
                    Kikuyu
                    Tumbuka
                    Shona
                    Twi
                    Cheyenne
                    Ndonga
                    Sichuan Yi
                    Choctaw
                    Marshallese
                    Afar
                    Kuanyama
                    Hiri Motu
                    Muscogee
                    Kanuri
                    Herero
Do note: Every second selected node is a white-space-only text node. If you don't want these selected, use:
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()]
                        Parse it as html.
from lxml import html
url = 'http://meta.wikimedia.org/wiki/List_of_Wikipedias'
tree = html.parse(url)
languages = tree.xpath('//table/tr/td[2]/a/text()')
print('\n'.join(languages))
English
German
French
Polish
Italian
Japanese
Spanish
Portuguese
Dutch
Russian
Swedish
Chinese
Catalan
Norwegian (Bokmål)
Finnish
Ukrainian
Czech
Hungarian
Romanian
Korean
Turkish
Vietnamese
Indonesian
Danish
Arabic
Esperanto
Serbian
Lithuanian
Slovak
Volapük
Persian
Hebrew
Bulgarian
Slovenian
Malay
Waray-Waray
Croatian
Estonian
Newar / Nepal Bhasa
Simple English
Hindi
Galician
Thai
Basque
Norwegian (Nynorsk)
Aromanian
Greek
Haitian
Azerbaijani
Tagalog
Latin
Telugu
Georgian
Macedonian
Cebuano
Serbo-Croatian
Breton
Piedmontese
Marathi
Latvian
Luxembourgish
Javanese
Belarusian (Taraškievica)
Welsh
Icelandic
Bosnian
Albanian
Tamil
Belarusian
Bishnupriya Manipuri
Aragonese
Occitan
Bengali
Swahili
Ido
Lombard
West Frisian
Gujarati
Afrikaans
Low Saxon
Malayalam
Quechua
Sicilian
Urdu
Kurdish
Cantonese
Sundanese
Asturian
Neapolitan
Samogitian
Armenian
Yoruba
Irish
Chuvash
Walloon
Nepali
Ripuarian
Western Panjabi
Kannada
Tajik
Tarantino
Venetian
Yiddish
Scottish Gaelic
Tatar
Min Nan
Ossetian
Uzbek
Alemannic
Kapampangan
Sakha
Kazakh
Egyptian Arabic
Maori
Amharic
Limburgian
Nahuatl
Upper Sorbian
Gilaki
Corsican
Gan
Mongolian
Scots
Interlingua
Central_Bicolano
Burmese
Faroese
Võro
Dutch Low Saxon
Sinhalese
Turkmen
West Flemish
Sanskrit
Bavarian
Malagasy
Manx
Ilokano
Divehi
Norman
Pangasinan
Banyumasan
Sorani
Romansh
Northern Sami
Zazaki
Mazandarani
Wu
Friulian
Uyghur
Ligurian
Maltese
Bihari
Novial
Tibetan
Anglo-Saxon
Kashubian
Sardinian
Classical Chinese
Fiji Hindi
Khmer
Ladino
Zamboanga Chavacano
Pali
Franco-Provençal/Arpitan
Pashto
Hakka
Cornish
Punjabi
Navajo
Silesian
Kalmyk
Pennsylvania German
Hawaiian
Saterland Frisian
Interlingue
Somali
Komi
Karachay-Balkar
Crimean Tatar
Tongan
Acehnese
Meadow Mari
Picard
Kinyarwanda
Erzya
Lingala
Extremaduran
Guarani
Kirghiz
Emilian-Romagnol
Assyrian Neo-Aramaic
Papiamentu
Aymara
Chechen
Lojban
Wolof
Banjar
Bashkir
North Frisian
Greenlandic
Tok Pisin
Udmurt
Kabyle
Tahitian
Sranan
Zealandic
Hill Mari
Komi-Permyak
Lower Sorbian
Abkhazian
Gagauz
Igbo
Oriya
Lao
Kongo
Avar
Moksha
Mirandese
Romani
Old Church Slavonic
Karakalpak
Samoan
Moldovan
Tetum
Gothic
Kashmiri
Bambara
Inupiak
Sindhi
Bislama
Lak
Nauruan
Norfolk
Inuktitut
Pontic
Assamese
Cherokee
Min Dong
Palatinate German
Swati
Hausa
Ewe
Tigrinya
Oromo
Zulu
Zhuang
Venda
Tsonga
Kirundi
Cree
Dzongkha
Sango
Chamorro
Luganda
Buginese
Buryat (Russia)
Fijian
Chichewa
Akan
Sesotho
Xhosa
Fula
Tswana
Kikuyu
Tumbuka
Shona
Twi
Cheyenne
Ndonga
Sichuan Yi
Choctaw
Marshallese
Afar
Kuanyama
Hiri Motu
Muscogee
Kanuri
Herero
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With