Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting all Latex commands from a Latex code File

I am trying to extract all the latex commands from a tex file. I have to use Python for this. I tried to extract the latex commands in a list using Re module.

The problem is that this list does not contain the latex commands whose name includes special characters (such as \alpha*, \a', \#, \$, +, :, \; etc). It only contains the latex commands that consist of letters.

I am presently using the re.match python command :

    "I already know the starting index of '\' which is at self.i.
     The example Latex code string could be:
     \documentclass[envcountsame,envcountchap]{svmono}"

     match_text = re.match("[\w]+", search_string[self.i + 1:])

I am able to extract 'documentclass'. But suppose there is another command like:

     "\abstract*[alpha]{beta}"
     "\${This is a latex document}"
     "\:" 

How do I extract only 'abstract*', '$', ':' from these strings?

I am new to Python and tried various approaches, but am not able to extract all these command names. If there is a general python Regex that can handle all these cases, it would be useful.

NOTE: A book called 'The Not So Short introduction to LaTeX' defines that the format of LaTeX commands can be of three types -

FORMATS:

  • They start with a backslash \ and then have a name consisting of letters only. Command names are terminated by a space, a number or any other ‘non-letter.’

  • They consist of a backslash and exactly one non-letter.

  • Many commands exist in a ‘starred variant’ where a star is appended to the command name.

like image 570
shanu Avatar asked Oct 31 '25 18:10

shanu


1 Answers

Here's the exact translation of your format specification:

\\(?:[^a-zA-Z]|[a-zA-Z]+)\*?

Demo

  • non-letter: [^a-zA-Z]
  • or letters: [a-zA-Z]+
  • starred variant: \*?

If your format description is accurate, this should do it. Unfortunately I don't know LaTeX so I'm not sure it's 100% OK.


From the feedback in the comments, it turns out the star is applicable only to letter commands, and there can be some other terminating characters as well. The final regex is:

\\(?:[^a-zA-Z]|[a-zA-Z]+[*=']?)
like image 102
Lucas Trzesniewski Avatar answered Nov 02 '25 07:11

Lucas Trzesniewski



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!