Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tabulate data extracted from a .pdf into pandas

I have the following data:

import PyPDF2
pdf_file = open("123.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
page_content

The pdf I want to extract data from looks like this:

enter image description here

page_content
Out[157]: "RiderNatio\nn Motorcycle\nTotal Time\nPosKm/hGap\nTeam \nGRAND PRIX OF QATAR\nResults and timing service provided by\n5380 m.osail International Circ\nuMotoGPŽ\nRaceClassification after 20 laps = 107.6 km\n2925YAMAHA\nMaverick VIÑALES\nSPA138'59.999\n165.5\n25Movistar Yamaha MotoGP\n4DUCATI\nAndrea DOVIZIOSO\nITA239'00.460\n165.50.461\n20Ducati Team\n46YAMAHA\nValentino ROSSI\nITA339'01.927\n165.41.928\n16Movistar Yamaha MotoGP\n93HONDAMarc MARQUEZ\nSPA439'06.744\n165.06.745\n13Repsol Honda Team\n26HONDADani PEDROS\nASPA539'07.127\n165.07.128\n11Repsol Honda Team\n41APRILIA\nAleix ESPARGARO\nSPA639'07.660\n164.97.661\n10Aprilia Racing Team Gresini\n45DUCATI\nScott REDDING\nGBR\n739'09.781\n164.89.782\n9OCTO Pramac Racing\n43HONDAJack MILLERAUS\n839'14.485\n164.514.486\n8EG 0,0 Marc VDS\n42SUZUKI\nAlex RINS\nSPA939'14.787\n164.414.788\n7Team SUZUKI ECSTAR\n94YAMAHA\nJonas FOLGER\nGER\n1039'15.068\n164.415.069\n6Monster Yamaha Tech 3\n99DUCATI\nJorge LORENZO\nSPA1139'20.515\n164.020.516\n5Ducati Team\n76DUCATI\nLoris BAZ\nFRA\n1239'21.254\n164.021.255\n4Reale Avintia Racing\n8DUCATI\nHector BARBER\nASPA1339'28.827\n163.528.828\n3Reale Avintia Racing\n17DUCATI\nKarel ABRAHAM\nCZE\n1439'29.122\n163.529.123\n2Pull&Bear Aspar Team\n53HONDATito RABAT\nSPA1539'29.469\n163.429.470\n1EG 0,0 Marc VDS\n44KTMPol ESPARGARO\nSPA1639'33.600\n163.133.601\nRed Bull KTM Factory Racing\n38KTMBradle\ny SMITH\nGBR\n1739'39.703\n162.739.704\nRed Bull KTM Factory Racing\n22APRILIA\nSam LOWESGBR\n1839'47.130\n162.247.131\nAprilia Racing Team Gresini\nNot Classified\n9DUCATI\nDanilo PETRUCCI\nITA27'31.191\n164.26 laps\nOCTO Pramac Racing\n29SUZUKI\nAndrea IANNONE\nITA19'34.409\n164.910 laps\nTeam SUZUKI ECSTAR\n19DUCATI\nAlvaro BAUTISTA\nSPA13'46.030\n164.113 laps\nPull&Bear Aspar Team\n5YAMAHA\nJohann ZARCO\nFRA\n11'44.661\n164.914 laps\nMonster Yamaha Tech 3\n35HONDACal CRUTCHLOW\nGBR\n8'44.974\n147.516 laps\nLCR HondaDryAir: 21°\nGround: 22°\nHumidity: 96%\nPole Position:\nFastest Lap:\nMaverick VIÑALES\n1'54.316\n169.4 Km/h\nJohann ZARCO\n1'55.990\n166.9 Km/h\nLap 4Circuit Record Lap:\nCircuit Best Lap:\nJorge LORENZO\n1'54.927\n168.5 Km/h\nJorge LORENZO\n1'53.927\n170.0 Km/h\n2008\n2016\nRace condition:\nSIGHTING LAP START\n 20:40'00\nSIGHTING LAP START\n 21:15'00\nStart delayed\n 21:21'25WARM UP LAP START\n 21:40'00\nRACE START\n 21:45'16\nNo jump start\n 21:46'06\ncrashed out - Rider OK\nCal CRUTCHLOW\n21:53'13re-joined race\nCal CRUTCHLOW\n21:53'57crashed out - Rider OK\nCal CRUTCHLOW\n21:56'08crashed out - Rider OK\nJohann ZARCO\n21:57'16crashed out - Rider OK\nAlvaro BAUTISTA\n22:00'51crashed out - Rider OK\nAndrea IANNONE\n22:05'29retired\nDanilo PETRUCCI\n22:15'06Time limit for protest expires 30' afte\nr publication of the results  -  Mr. ...................................................\n...... Time:   ...................................\nThe results are provisional until the end of the limit for protest and appeals.\nDoha, Sunday, March 26, 2017\nThese data/results cannot be reproduced, stor\ned and/or transmitted in whole or in part \nby any manner of electronic, mechanical,\n photocopying, recording, broadcasting or otherwise now \nknown or herein after developed without the pr\nevious express consent by \nthe copyright owner, except for reproduction in daily p\nress and regular printed publications on sale to the public \nwithin 60 days of the event related to those data/results and \nalways provided that copyright symbol appears together as follows\n below.\n© DORNA, 2017\nOfficial MotoGP Timing by \nwww.mot\nogp.com\nTISSOT\n"

I want to process it and create a .csv with it so I can store it in a data frame and do analysis with it. I don't know how could I clean it.

I have tried with:

pgs = page_content.split()



pgs[pgs.index("km")+1:pgs.index("Classified")-1]
Out[183]: 
['2925YAMAHA',
 'Maverick',
 'VIÑALES',
 "SPA138'59.999",
 '165.5',
 '25Movistar',
 'Yamaha',
 'MotoGP',
 '4DUCATI',
 'Andrea',
 'DOVIZIOSO',
 "ITA239'00.460",
 '165.50.461',
 '20Ducati',
 'Team',
 '46YAMAHA',
 'Valentino',
 'ROSSI',
 "ITA339'01.927",
 '165.41.928',
 '16Movistar',
 'Yamaha',
 'MotoGP',
 '93HONDAMarc',
 'MARQUEZ',
 "SPA439'06.744",
 '165.06.745',
 '13Repsol',
 'Honda',
 'Team',
 '26HONDADani',
 'PEDROS',
 "ASPA539'07.127",
 '165.07.128',
 '11Repsol',
 'Honda',
 'Team',
 '41APRILIA',
 'Aleix',
 'ESPARGARO',
 "SPA639'07.660",
 '164.97.661',
 '10Aprilia',
 'Racing',
 'Team',
 'Gresini',
 '45DUCATI',
 'Scott',
 'REDDING',
 'GBR',
 "739'09.781",
 '164.89.782',
 '9OCTO',
 'Pramac',
 'Racing',
 '43HONDAJack',
 'MILLERAUS',
 "839'14.485",
 '164.514.486',
 '8EG',
 '0,0',
 'Marc',
 'VDS',
 '42SUZUKI',
 'Alex',
 'RINS',
 "SPA939'14.787",
 '164.414.788',
 '7Team',
 'SUZUKI',
 'ECSTAR',
 '94YAMAHA',
 'Jonas',
 'FOLGER',
 'GER',
 "1039'15.068",
 '164.415.069',
 '6Monster',
 'Yamaha',
 'Tech',
 '3',
 '99DUCATI',
 'Jorge',
 'LORENZO',
 "SPA1139'20.515",
 '164.020.516',
 '5Ducati',
 'Team',
 '76DUCATI',
 'Loris',
 'BAZ',
 'FRA',
 "1239'21.254",
 '164.021.255',
 '4Reale',
 'Avintia',
 'Racing',
 '8DUCATI',
 'Hector',
 'BARBER',
 "ASPA1339'28.827",
 '163.528.828',
 '3Reale',
 'Avintia',
 'Racing',
 '17DUCATI',
 'Karel',
 'ABRAHAM',
 'CZE',
 "1439'29.122",
 '163.529.123',
 '2Pull&Bear',
 'Aspar',
 'Team',
 '53HONDATito',
 'RABAT',
 "SPA1539'29.469",
 '163.429.470',
 '1EG',
 '0,0',
 'Marc',
 'VDS',
 '44KTMPol',
 'ESPARGARO',
 "SPA1639'33.600",
 '163.133.601',
 'Red',
 'Bull',
 'KTM',
 'Factory',
 'Racing',
 '38KTMBradle',
 'y',
 'SMITH',
 'GBR',
 "1739'39.703",
 '162.739.704',
 'Red',
 'Bull',
 'KTM',
 'Factory',
 'Racing',
 '22APRILIA',
 'Sam',
 'LOWESGBR',
 "1839'47.130",
 '162.247.131',
 'Aprilia',
 'Racing',
 'Team',
 'Gresini']

Still, I should separate starting from the MotorCycle brand and convert it into a data frame. Maybe there are better approaches than the one I am using.

When extracting the data in HTML format I get:

b'<html><head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n</head><body>\n<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>\n<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>\n<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:54px; top:77px; width:94px; height:11px;"><span style="font-family: b\'ArialMT\'; font-size:11px">osail International Circu\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:150px; top:77px; width:188px; height:14px;"><span style="font-family: b\'ArialMT\'; font-size:14px">Results and timing service provided by\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:149px; top:113px; width:257px; height:55px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:16px">GRAND PRIX OF QATAR\n<br>Race\n<br>Classification after 20 laps = 107.6 km\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:85px; top:156px; width:32px; height:11px;"><span style="font-family: b\'ArialMT\'; font-size:11px">5380 m.\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:458px; top:89px; width:106px; height:25px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:25px">MotoGP\xe2\x84\xa2\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:541px; top:152px; width:21px; height:20px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:20px">29\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:59px; top:189px; width:19px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Pos\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:112px; top:189px; width:27px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Rider\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:211px; top:189px; width:32px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Nation\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:249px; top:189px; width:30px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Team \n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:364px; top:189px; width:107px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px"> Motorcycle Total Time\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:481px; top:189px; width:26px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Km/h\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:538px; top:189px; width:21px; height:10px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:10px">Gap\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:226px; width:7px; height:174px;"><span style="font-family: b\'ArialMT\'; font-size:9px">25\n<br>20\n<br>16\n<br>13\n<br>11\n<br>10\n<br>9\n<br>8\n<br>7\n<br>6\n<br>5\n<br>4\n<br>3\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">2\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">1\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:64px; top:225px; width:10px; height:213px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">1\n<br>2\n<br>3\n<br>4\n<br>5\n<br>6\n<br>7\n<br>8\n<br>9\n<br>10\n<br>11\n<br>12\n<br>13\n<br>14\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">15\n<br>16\n<br>17\n<br>18\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:97px; top:225px; width:10px; height:212px;"><span style="font-family: b\'ArialMT\'; font-size:11px">25\n<br>4\n<br>46\n<br>93\n<br>26\n<br>41\n<br>45\n<br>43\n<br>42\n<br>94\n<br>99\n<br>76\n<br>8\n<br></span><span style="font-family: b\'ArialMT\'; font-size:11px">17\n<br>53\n<br>44\n<br>38\n<br>22\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:112px; top:225px; width:83px; height:213px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">Maverick VI\xc3\x91ALES\n<br>Andrea DOVIZIOSO\n<br>Valentino ROSSI\n<br>Marc MARQUEZ\n<br>Dani PEDROSA\n<br>Aleix ESPARGARO\n<br>Scott REDDING\n<br>Jack MILLER\n<br>Alex RINS\n<br>Jonas FOLGER\n<br>Jorge LORENZO\n<br>Loris BAZ\n<br>Hector BARBERA\n<br>Karel ABRAHAM\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">Tito RABAT\n<br>Pol ESPARGARO\n<br>Bradley SMITH\n<br>Sam LOWES\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:57px; top:440px; width:60px; height:12px;"><span style="font-family: b\'Arial-BoldItalicMT\'; font-size:12px">Not Classified\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:97px; top:452px; width:10px; height:59px;"><span style="font-family: b\'ArialMT\'; font-size:11px">9\n<br>29\n<br>19\n<br>5\n<br>35\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:112px; top:452px; width:76px; height:59px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:12px">Danilo PETRUCCI\n<br>Andrea IANNONE\n<br>Alvaro BAUTISTA\n<br>Johann ZARCO\n<br>Cal CRUTCHLOW\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:227px; top:226px; width:17px; height:211px;"><span style="font-family: b\'ArialMT\'; font-size:10px">SPA\n<br>ITA\n<br>ITA\n<br>SPA\n<br>SPA\n<br>SPA\n<br>GBR\n<br>AUS\n<br>SPA\n<br>GER\n<br>SPA\n<br>FRA\n<br>SPA\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">CZE\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">SPA\n<br>SPA\n<br>GBR\n<br>GBR\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:227px; top:452px; width:17px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">ITA\n<br>ITA\n<br>SPA\n<br>FRA\n<br>GBR\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:250px; top:226px; width:105px; height:211px;"><span style="font-family: b\'ArialMT\'; font-size:10px">Movistar Yamaha MotoGP\n<br>Ducati Team\n<br>Movistar Yamaha MotoGP\n<br>Repsol Honda Team\n<br>Repsol Honda Team\n<br>Aprilia Racing Team Gresini\n<br>OCTO Pramac Racing\n<br>EG 0,0 Marc VDS\n<br>Team SUZUKI ECSTAR\n<br>Monster Yamaha Tech 3\n<br>Ducati Team\n<br>Reale Avintia Racing\n<br>Reale Avintia Racing\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">Pull&amp;Bear Aspar Team\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">EG 0,0 Marc VDS\n<br>Red Bull KTM Factory Racing\n<br>Red Bull KTM Factory Racing\n<br>Aprilia Racing Team Gresini\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:250px; top:452px; width:88px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">OCTO Pramac Racing\n<br>Team SUZUKI ECSTAR\n<br>Pull&amp;Bear Aspar Team\n<br>Monster Yamaha Tech 3\n<br>LCR Honda\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:384px; top:226px; width:34px; height:211px;"><span style="font-family: b\'ArialMT\'; font-size:10px">YAMAHA\n<br>DUCATI\n<br>YAMAHA\n<br>HONDA\n<br>HONDA\n<br>APRILIA\n<br>DUCATI\n<br>HONDA\n<br>SUZUKI\n<br>YAMAHA\n<br>DUCATI\n<br>DUCATI\n<br>DUCATI\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">DUCATI\n<br></span><span style="font-family: b\'ArialMT\'; font-size:10px">HONDA\n<br>KTM\n<br>KTM\n<br>APRILIA\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:384px; top:452px; width:33px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">DUCATI\n<br>SUZUKI\n<br>DUCATI\n<br>YAMAHA\n<br>HONDA\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:435px; top:225px; width:35px; height:211px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">38\'59.999\n<br>39\'00.460\n<br>39\'01.927\n<br>39\'06.744\n<br>39\'07.127\n<br>39\'07.660\n<br>39\'09.781\n<br>39\'14.485\n<br>39\'14.787\n<br>39\'15.068\n<br>39\'20.515\n<br>39\'21.254\n<br>39\'28.827\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">39\'29.122\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">39\'29.469\n<br>39\'33.600\n<br>39\'39.703\n<br>39\'47.130\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:435px; top:452px; width:35px; height:58px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">27\'31.191\n<br>19\'34.409\n<br>13\'46.030\n<br>11\'44.661\n<br>8\'44.974\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:492px; top:226px; width:20px; height:211px;"><span style="font-family: b\'ArialMT\'; font-size:10px">165.5\n<br>165.5\n<br>165.4\n<br>165.0\n<br>165.0\n<br>164.9\n<br>164.8\n<br>164.5\n<br>164.4\n<br>164.4\n<br>164.0\n<br>164.0\n<br>163.5\n<br>163.5\n<br>163.4\n<br>163.1\n<br>162.7\n<br>162.2\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:492px; top:452px; width:20px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">164.2\n<br>164.9\n<br>164.1\n<br>164.9\n<br>147.5\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:538px; top:237px; width:24px; height:199px;"><span style="font-family: b\'ArialMT\'; font-size:10px">0.461\n<br>1.928\n<br>6.745\n<br>7.128\n<br>7.661\n<br>9.782\n<br>14.486\n<br>14.788\n<br>15.069\n<br>20.516\n<br>21.255\n<br>28.828\n<br>29.123\n<br>29.470\n<br>33.601\n<br>39.704\n<br>47.131\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:537px; top:452px; width:25px; height:57px;"><span style="font-family: b\'ArialMT\'; font-size:10px">6 laps\n<br>10 laps\n<br>13 laps\n<br>14 laps\n<br>16 laps\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:68px; top:526px; width:57px; height:10px;"><span style="font-family: b\'Arial-ItalicMT\'; font-size:10px">Race condition:\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:89px; top:528px; width:56px; height:41px;"><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:11px">Dry\n<br></span><span style="font-family: b\'ELOILF+ArialRoundedMTBold\'; font-size:9px">Air: 21\xc2\xb0\n<br>Humidity: 96%\n<br>Ground: 22\xc2\xb0\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:211px; top:526px; width:70px; height:42px;"><span style="font-family: b\'Arial-ItalicMT\'; font-size:10px">Pole Position:\n<br>Fastest Lap:\n<br>Circuit Record Lap:\n<br>Circuit Best Lap:\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:292px; top:537px; width:20px; height:31px;"><span style="font-family: b\'ArialMT\'; font-size:10px">Lap 4\n<br>2016\n<br>2008\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:287px; top:573px; width:31px; height:136px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">20:40\'00\n<br>21:15\'00\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">21:21\'25\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">21:40\'00\n<br>21:45\'16\n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:9px">21:46\'06\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">21:53\'13\n<br>21:53\'57\n<br>21:56\'08\n<br>21:57\'16\n<br>22:00\'51\n<br>22:05\'29\n<br>22:15\'06\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:347px; top:526px; width:71px; height:11px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">Maverick VI\xc3\x91ALES\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:351px; top:536px; width:63px; height:32px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">Johann ZARCO\n<br>Jorge LORENZO\n<br>Jorge LORENZO\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:474px; top:526px; width:30px; height:42px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">1\'54.316\n<br>1\'55.990\n<br>1\'54.927\n<br>1\'53.927\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:517px; top:526px; width:41px; height:41px;"><span style="font-family: b\'ArialMT\'; font-size:10px">169.4 Km/h\n<br>166.9 Km/h\n<br>168.5 Km/h\n<br>170.0 Km/h\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:324px; top:573px; width:57px; height:136px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px"> \n<br> \n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px"> \n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:11px"> \n<br> \n<br></span><span style="font-family: b\'Arial-BoldMT\'; font-size:9px"> \n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">Cal CRUTCHLOW\n<br>Cal CRUTCHLOW\n<br>Cal CRUTCHLOW\n<br>Johann ZARCO\n<br>Alvaro BAUTISTA\n<br>Andrea IANNONE\n<br>Danilo PETRUCCI\n<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:447px; top:573px; width:85px; height:136px;"><span style="font-family: b\'Arial-BoldMT\'; font-size:11px">SIGHTING LAP START\n<br>SIGHTING LAP START\n<br></span><span style="font-family: b\'ArialMT\'; font-size:9px">Start '
like image 982
Gotey Avatar asked Jan 16 '18 17:01

Gotey


1 Answers

Once I got the html I clean it with:

import lxml.html.clean as lhc

and

from bs4 import BeautifulSoup as bs
    motobs = bs(motoh)
    motobsg = bs.get_text(motobs)
    mbs = str(motobsg)
    mbss = mbs.split()     

from there I have to write a function finding relations between this objects so I can construct a Data frame:

mbsd
Out[216]: 
['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '25',
 '46',
 '35',
 '19',
 '5',
 '94',
 '9',
 '45',
 '43',
 '17',
 '76',
 '53',
 '8',
 '44',
 '38',
 '29',
 'Maverick',
 'VIÑALES',
 'Valentino',
 'ROSSI',
 'Cal',
 'CRUTCHLOW',
 'Alvaro',
 'BAUTISTA',
 'Johann',
 'ZARCO',
 'Jonas',
 'FOLGER',
 'Danilo',
 'PETRUCCI',
 'Scott',
 'REDDING',
 'Jack',
 'MILLER',
 'Karel',
 'ABRAHAM',
 'Loris',
 'BAZ',
 'Tito',
 'RABAT',
 'Hector',
 'BARBERA',
 'Pol',
 'ESPARGARO',
 'Bradley',
 'SMITH',
 'Andrea',
 'IANNONE',
 'Not',
 'Classified',
 '4',
 '41',
 '26',
 '22',
 '42',
 '93',
 '99',
 'Andrea',
 'DOVIZIOSO',
 'Aleix',
 'ESPARGARO',
 'Dani',
 'PEDROSA',
 'Sam',
 'LOWES',
 'Alex',
 'RINS',
 'Marc',
 'MARQUEZ',
 'Jorge',
 'LORENZO',
 'SPA',
 'ITA',
 'GBR',
 'SPA',
 'FRA',
 'GER',
 'ITA',
 'GBR',
 'AUS',
 'CZE',
 'FRA',
 'SPA',
 'SPA',
 'SPA',
 'GBR',
 'ITA',
 'Movistar',
 'Yamaha',
 'MotoGP',
 'Movistar',
 'Yamaha',
 'MotoGP',
 'LCR',
 'Honda',
 'Pull&Bear',
 'Aspar',
 'Team',
 'Monster',
 'Yamaha',
 'Tech',
 '3',
 'Monster',
 'Yamaha',
 'Tech',
 '3',
 'OCTO',
 'Pramac',
 'Racing',
 'OCTO',
 'Pramac',
 'Racing',
 'EG',
 '0,0',
 'Marc',
 'VDS',
 'Pull&Bear',
 'Aspar',
 'Team',
 'Reale',
 'Avintia',
 'Racing',
 'EG',
 '0,0',
 'Marc',
 'VDS',
 'Reale',
 'Avintia',
 'Racing',
 'Red',
 'Bull',
 'KTM',
 'Factory',
 'Racing',
 'Red',
 'Bull',
 'KTM',
 'Factory',
 'Racing',
 'Team',
 'SUZUKI',
 'ECSTAR',
 'Ducati',
 'Team',
 'Aprilia',
 'Racing',
 'Team',
 'Gresini',
 'Repsol',
 'Honda',
 'Team',
 'Aprilia',
 'Racing',
 'Team',
 'Gresini',
 'Team',
 'SUZUKI',
 'ECSTAR',
 'Repsol',
 'Honda',
 'Team',
 'ITA',
 'SPA',
 'SPA',
 'GBR',
 'SPA',
 'SPA',
 'SPA',
 'Ducati',
 'Team',
 'YAMAHA',
 'YAMAHA',
 'HONDA',
 'DUCATI',
 'YAMAHA',
 'YAMAHA',
 'DUCATI',
 'DUCATI',
 'HONDA',
 'DUCATI',
 'DUCATI',
 'HONDA',
 'DUCATI',
 'KTM',
 'KTM',
 'SUZUKI',
 'DUCATI',
 'APRILIA',
 'HONDA',
 'APRILIA',
 'SUZUKI',
 'HONDA',
 'DUCATI',
 "41'45.060",
 "41'47.975",
 "41'48.814",
 "41'51.583",
 "42'00.564",
 "42'03.301",
 "42'05.106",
 "42'10.540",
 "42'10.725",
 "42'11.463",
 "42'12.012",
 "42'26.935",
 "42'27.830",
 "42'28.145",
 "42'28.512",
 "42'31.279",
 "23'31.497",
 "23'31.661",
 "21'48.977",
 "18'51.906",
 "19'14.623",
 "5'02.050",
 '172.6',
 '172.4',
 '172.4',
 '172.2',
 '171.6',
 '171.4',
 '171.2',
 '170.9',
 '170.9',
 '170.8',
 '170.8',
 '169.8',
 '169.7',
 '169.7',
 '169.7',
 '169.5',
 '171.6',
 '171.5',
 '171.8',
 '168.1',
 '164.8',
 '171.8',
 '2.915',
 '3.754',
 '6.523',
 '15.504',
 '18.241',
 '20.046',
 '25.480',
 '25.665',
 '26.403',
 '26.952',
 '41.875',
 '42.770',
 '43.085',
 '43.452',
 '46.219',
 '11',
 'laps',
 '11',
 'laps',
 '12',
 'laps',
 '14',
 'laps',
 '14',
 'laps',
 '22',
 'laps',
 'Race',
 'condition:',
 'Dry',
 'Air:',
 '20°',
 'Humidity:',
 '60%',
 'Ground:',
 '25°']
like image 82
Gotey Avatar answered Nov 09 '22 07:11

Gotey