Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reindex malformed columns retrived from pandas read_html?

I am retrieving some content from a website which has several tables with the same number of columns, with pandas read_html. When I read a single link that actually has several tables with the same number of columns, pandas effectively read all the tables as one (something like a flat/normalized table). However, I am interested in do the same for a list of links from a website (i.e. a single flat table for several links), so I tried the following:

In:

import multiprocessing
def process(url):
    df_url = pd.read_html(url)
    df = pd.concat(df_url, ignore_index=False) 
    return df_url

links = ['link1.com','link2.com','link3.com',...,'linkN.com']

pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df

Nevertheless, I guess I am not specifiying corecctly to read_html() which are the columns, so I am getting this malformed list of lists:

Out:

[[                Form     Disponibility  \
  0  290090 01780-500-01)  Unavailable - no product available for release.   

                             Relation  \

     Relation drawbacks  
  0                  NaN                        Removed 
  1                  NaN                        Removed ],
 [                                        Form  \

                                   Relation  \
  0  American Regent is currently releasing the 0.4...   
  1  American Regent is currently releasing the 1mg...   

     drawbacks  
  0  Demand increase for the drug  
  1                         Removed ,
                                          Form  \
  0  0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...   

    Disponibility  Relation  \
  0                            Product available                  NaN   
  2                        Removed 
  3                        Removed ]]

So my question which parameter should I move in order to get a flat pandas dataframe from the above nested list?. I tried to header=0, index_col=0, match='"columns"', none of them worked or do I need to do the flatting when I create the pandas dataframe with pd.Dataframe()?. My main objective is to have a pandas dataframe like with this columns:

form, Disponibility, Relation, drawbacks
1 
2
...
n
like image 352
tumbleweed Avatar asked Oct 18 '22 23:10

tumbleweed


1 Answers

IIUC you can do it this way:

first you want to return concatenated DF, instead of list of DFs (as read_html returns a list of DFs):

def process(url):
    return pd.concat(pd.read_html(url), ignore_index=False) 

and then concatenate them for all URLs:

df = pd.concat(pool.map(process, links), ignore_index=True)
like image 180
MaxU - stop WAR against UA Avatar answered Oct 21 '22 04:10

MaxU - stop WAR against UA