For example :
For the following Text:
Lorem ipsum dolor sit amet, eum ut vitae quidam mentitum, eu eum malorum eligendi tincidunt. Vix te vitae tamquam, mea nisl praesent ea, vis omnis postulant in
import sys def fun(){print('Hello')} fun()
Mea veri fierent explicari eu, ne appareat convenire mei. Dicat neglegentur definitiones nec id, sit facete cotidieque in. Intellegam referrentur cu cum, an mandamus periculis pro.
How can we use regex (or some other technique) to find if there is code in there or not). Code can be in java/python/C/css/js etc.
trying to detect if a text contains "code" is a fundamentally fuzzy concept and impossible to meaningfully determine with perfect accuracy (for example, the empty string is a valid python program, but every string contains the empty string)
because of this, the first thing you'll probably want to do is set a minimum length for your span of code.
what you do next depends on your requirements for performance and accuracy.
for a low-accuracy, high-performance solution using regex, identify several patterns that are likely to appear in code and unlikely to appear in english text, such as (), and search the text for those patterns. if most of the patterns match, it probably contains some sort of code.
for a high-accuracy, low-performance solution, you could enumerate every possible substring of your text longer than your minimum length, then use something like tree-sitter to try and parse it as several different languages. this will be very computationally expensive, but some clever use of parallelism (or perhaps cleverly pruning your search so you don't have to enumerate every substring) might be enough to make it workable for small snippets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With