I'm trying to convert a HTML containing a table to a .csv file using a bash script.

So far I've acomplished the following steps:

  1. Convert to Unix format (with dos2unix)
  2. Remove all spaces and tabs (with sed 's/[ \t]//g')
  3. Remove all the blank lines (with sed ':a;N;$!ba;s/\n//g') (this is necesary, because the HTML file has a blank line for each cell of the table... that's not my fault)
  4. Remove the unnecesary <td> and <tr> tags (with sed 's/<t.>//g')
  5. Replace </td> with ',' (with sed 's/<\/td/,/g')
  6. Replace </tr> with end-of-line (\n) characters (with sed 's/<\/tr/\n/g')

Of course, I'm putting all this in a pipeline. So far, it's working great. There's one final step I'm stuck with: The table has a column with dates, which has the format dd/mm/yyyy, and I'd like to convert them to yyyy-mm-dd.

Is there a (simple) way to do it (with sed or awk)?

Data sample (after the whole sed pipe):


Expected result:


The reason I need to do this is because I need to import this data to MySQL. I could open the file in Excel and change the format by hand, but I would like to skip that.

sed -E 's,([0-9]{2})/([0-9]{2})/([0-9]{4}),\3-\2-\1,g'
