ok well i have a table that gets outputted by some open source software but it does not get outputted in an actual table format eg
<table>
<thead>
<td>Heading</td>
<thead>
<tbody>
<tr>
<td>Content</td>
</tr>
<tbody>
</table
Instead The people that developed the software decided that it would be a good a idea to output the table like so
+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1 | HEADING 2 | ETC | ANOTHER | HEADING3 | HEADING4 | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS AGENTS:21 | total| total| total| total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+
So i cant build a web scraper to get the Data or well im not shure if i could build a scraper to scrape that since its all wrapped inside one <pre> </pre>
tag . So instead i have been trying to use ruby and Regex to try and get the job done so far i have managed to get all the leading |
's out and also i have managed to get the heading +-------+-----
But only that far since it seems that i have to Repeat the pattern the whole time it doesnt want to repeat itself ok But enough talking for now Here is the Code i have used so far
text.lines.to_a.each do |line|
line.sub(/^\| |^\+*-*\+*\-*/) do |match|
puts "Regexp Match: " << match
end
STDIN.getc
puts "New Line "<< line
end
and for example the output for the first line would only be +-----------------+----------
it has be in CSV format so il use Gsub
to replace the remaining |
's with ,
's
I can use PHP or Ruby so any answer is more than welcome
this may not be as clean as is could be but it works for this example :) Ruby:
@text = <<END
+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1 | HEADING 2 | ETC | ANOTHER | HEADING3 | HEADING4 | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS AGENTS:21 | total| total| total| total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+
END
s = @text.scan(/^[|]\W(.*)[|]$/)
puts s
arr = []
arr2 = []
s.each do |o|
a = o.to_s.split('|')
a.each do |oo|
arr2 << oo.to_s.gsub('["','').gsub('"]','').gsub(/\s+/, "")
end
arr << arr2
arr2 = []
end
arr.each do |i|
puts i
end
Here's a complete solution in ruby. You need to manually add a |
to the last line, though.
require 'builder'
table = '+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1 | HEADING 2 | ETC | ANOTHER | HEADING3 | HEADING4 | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS AGENTS:21 | total| total| total| total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+';
def parse_table(table)
rows = []
table.each_line do |line|
next if line.match /^\+/
rows << line.split(/\s*\|\s*/).reject(&:empty?)
end
rows
end
def html_row(xml, columns)
xml.tr do
columns.each do |column|
xml.td column
end
end
end
def html_table(rows)
head_row = rows.first
body_rows = rows[1..-1]
xml = Builder::XmlMarkup.new :indent => 2
xml.table do
xml.thead do
html_row xml, head_row
end
xml.tbody do
body_rows.each do |body_row|
html_row xml, body_row
end
end
end.to_s
end
rows = parse_table(table)
html = html_table(rows)
puts html
Output:
<table>
<thead>
<tr>
<td>HEADING 1</td>
<td>HEADING 2</td>
<td>ETC</td>
<td>ANOTHER</td>
<td>HEADING3</td>
<td>HEADING4</td>
<td>SML</td>
</tr>
</thead>
<tbody>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>content</td>
<td>more content</td>
<td>cont</td>
<td>More more</td>
<td>content</td>
<td>content 2.0</td>
<td>litl</td>
</tr>
<tr>
<td>TOTALS AGENTS:21</td>
<td>total</td>
<td>total</td>
<td>total</td>
<td>total</td>
<td>total</td>
</tr>
</tbody>
</table>
Check out:
$table = '+------------+-------------+-------+-------------+------------+---------------+----------+
| HEADING 1 | HEADING 2 | ETC | ANOTHER | HEADING3 | HEADING4 | SML |
+------------+-------------+-------+-------------+------------+---------------+----------+
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
| content | more content | cont | More more | content | content 2.0 | litl |
+------------+-------------+-------+-------------+------------+--------------+----------+
| TOTALS AGENTS:21 | total| total| total| total| total|
+------------+-------------+-------+-------------+------------+--------------+----------+';
$lines = preg_split('/\r\n|\r|\n/', $table);
$array = array();
foreach($lines as $line){
if(!preg_match('/\+-+\+/', $line)){
$array[] = preg_split('/\s*\|\s*/', trim($line, '| '));
}
}
print_r($array);
Output:
Array
(
[0] => Array
(
[0] => HEADING 1
[1] => HEADING 2
[2] => ETC
[3] => ANOTHER
[4] => HEADING3
[5] => HEADING4
[6] => SML
)
[1] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[2] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[3] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[4] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[5] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[6] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[7] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[8] => Array
(
[0] => content
[1] => more content
[2] => cont
[3] => More more
[4] => content
[5] => content 2.0
[6] => litl
)
[9] => Array
(
[0] => TOTALS AGENTS:21
[1] => total
[2] => total
[3] => total
[4] => total
[5] => total
)
)
Hope this was helpful :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With