Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML to fix microtypography & glyph issues

Tags:

I'm interested in microtypography issues on the web.

I want a tool to fix:

  • Quotes
    • “ (“) opening quote (instead of ")
    • ” (”) closing quote (instead of ")
  • Apostrophe
    • ’ (’) apostrophe (instead of ')
  • Dashes and Hyphens
    • – (– or –) en dash, used for ranges, e.g. “13–15 November” (instead of -)
    • — (— or —) em dash, used for change of thought, e.g. “Star Wars is—as everyone knows—amazing.” (instead of -, or --)
  • Ellipsis
    • … (… or …) horizontal ellipsis, used to indicate an omission or a pause (instead of ...)
  • And more \o/

All those fixes depend on the content language. In French, for example, we must add a insecable (non-breaking) space before every composed glyph (:, ;, , ?, !, ...), and our quotes are « like this ».

There are many constraints for such a tool:

  • it must not edit any HTML inside protected tags (pre, code...)
  • it must be fast (used on a CMS output)
  • it must not break the HTML
  • and so on.

There already are some tools on the market:

  • http://michelf.ca/projects/php-smartypants/typographer/
  • http://kingdesk.com/projects/php-typography/
  • http://code.google.com/p/typogrify/

They are all more or less based on SmartyPants, a 2005 lib, not tested, not documented, parsing HTML manually and not dealing with other rules than English. Hell no.

So my questions are:

  • Do you know of any decent tool like this?
  • How can I do it? I already have a POC using DomCrawler but I'm not convinced. What's the best way to parse and edit HTML in PHP?

Edit July 2013: I have developed JoliTypo from the tests and expertise I gained with this issue. No existing lib was doing what I wanted to do.

like image 898
Damien Avatar asked Dec 04 '12 09:12

Damien


2 Answers

My somewhat-friend Sean built something that I use for this purpose quite often. You can view the demo here: http://files.seancoates.com/lexentity/ he blogged about it here: http://seancoates.com/blogs/lexentity and you can grab the source here: https://github.com/scoates/lexentity

It might not meet your full language needs, but it's a start with English.

like image 60
preinheimer Avatar answered Sep 20 '22 15:09

preinheimer


You might be interested in tidy. It is boundled with PHP 5+ (all you need to use it is libtidy). It not just parses HTML, but repairs it too.

But with the localization, you are on your own - intl does not have any data about quotes - f.ex.; at least i could not found them.

like image 29
pozs Avatar answered Sep 19 '22 15:09

pozs