I was attempting to find the movies of 2018 January to March of 2018 from wikipedia page using pandas read html.
Here is my code:
import pandas as pd
import numpy as np
link = "https://en.wikipedia.org/wiki/2018_in_film"
tables = pd.read_html(link)
jan_march = tables[5].iloc[1:]
jan_march.columns = ['Opening1','Opening2','Title','Studio','Cast','Genre','Country','Ref']
jan_march.head()
There is some error in reading the columns. If anybody has already scraped some wikipedia tables may be they can help me solving the problem.
Thanks a lot.
Related links:
Scraping Wikipedia tables with Python selectively
https://roche.io/2016/05/scrape-wikipedia-with-python
Scraping paginated web table with python pandas & beautifulSoup
I am getting this:
But am expecting:
Because of how the table is designed it is not as simple as pd.read_html()
while that is a start you will need to do some manipulation to get it in a desirable formate:
import pandas as pd
link = "https://en.wikipedia.org/wiki/2018_in_film"
tables = pd.read_html(link,header=0)[5]
# find na values and shift cells right
i = 0
while i < 2:
row_shift = tables[tables['Unnamed: 7'].isnull()].index
tables.iloc[row_shift,:] = tables.iloc[row_shift,:].shift(1,axis=1)
i+=1
# create new column names
tables.columns = ['Month', 'Day', 'Title', 'Studio', 'Cast and crew', 'Genre', 'Country', 'Ref.']
# forward fill values
tables['Month'] = tables['Month'].ffill()
tables['Day'] = tables['Day'].ffill()
out:
Month Day Title Studio Cast and crew Genre Country Ref.
0 JANUARY 5 Insidious: The Last Key Universal Pictures / Blumhouse Productions Adam Robitel (director); Leigh Whannell (scree... Horror, Thriller US [33]
1 JANUARY 5 The Strange Ones Vertical Entertainment Lauren Wolkstein (director); Christopher Radcl... Drama US [34]
2 JANUARY 5 Stratton Momentum Pictures Simon West (director); Duncan Falconer, Warren... Action, Thriller IT, UK [35]
3 JANUARY 10 Sweet Country Samuel Goldwyn Films Warwick Thornton (director); David Tranter, St... Drama AUS [36]
4 JANUARY 12 The Commuter Lionsgate / StudioCanal / The Picture Company Jaume Collet-Serra (director); Byron Willinger... Action, Crime, Drama, Mystery, Thriller US, UK [37]
5 JANUARY 12 Proud Mary Screen Gems Babak Najafi (director); John S. Newman, Chris... Action, Thriller US [38]
6 JANUARY 12 Acts of Violence Lionsgate Premiere Brett Donowho (director); Nicolas Aaron Mezzan... Action, Thriller US [39]
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With