I'm trying to understand the concept of languages levels (regular, context free, context sensitive, etc.). I can look this up easily, but all explanations I find are a load of symbols and talk about sets. I have two questions: <ol> <li>Can you describe in words what a regular language is, and how the languages differ?</li> <li>Where do people learn to understand this stuff? As I understand it, it is formal mathematics? I had a couple of courses at uni which used it and barely anyone understood it as the tutors just assumed we knew it. Where can I learn it and why are people "expected" to know it in so many sources? It's like there's a gap in education.</li> </ol> Here's an example: <blockquote> Any language belonging to this set is a regular language over the alphabet. </blockquote> How can a language be "over" anything?

In the context of computer science, a word is the concatenation of symbols. The used symbols are called the alphabet. For example, some words formed out of the alphabet <code>{0,1,2,3,4,5,6,7,8,9}</code> would be <code>1</code>, <code>2</code>, <code>12</code>, <code>543</code>, <code>1000</code>, and <code>002</code>. A language is then a subset of all possible words. For example, we might want to define a language that captures all elite MI6 agents. Those all start with double-0, so words in the language would be <code>007</code>, <code>001</code>, <code>005</code>, and <code>0012</code>, but not <code>07</code> or <code>15</code>. For simplicity's sake, we say a language is "over an alphabet" instead of "a subset of words formed by concatenation of symbols in an alphabet". In computer science, we now want to classify languages. We call a language regular if it can be decided if a word is in the language with an algorithm/a machine with constant (finite) memory by examining all symbols in the word one after another. The language consisting just of the word <code>42</code> is regular, as you can decide whether a word is in it without requiring arbitrary amounts of memory; you just check whether the first symbol is 4, whether the second is 2, and whether any more numbers follow. All languages with a finite number of words are regular, because we can (in theory) just build a control flow tree of constant size (you can visualize it as a bunch of nested <code>if</code>-statements that examine one digit after the other). For example, we can test whether a word is in the "prime numbers between 10 and 99" language with the following construct, requiring no memory except the one to encode at which code line we're currently at: <pre class="prettyprint"><code>if word[0] == 1: if word[1] == 1: # 11 return true # "accept" word, i.e. it's in the language if word[1] == 3: # 13 return true ... return false </code></pre> Note that all finite languages are regular, but not all regular languages are finite; our double-0 language contains an infinite number of words (<code>007</code>, <code>008</code>, but also <code>004242</code> and <code>0012345</code>), but can be tested with constant memory: To test whether a word belongs in it, check whether the first symbol is <code>0</code>, and whether the second symbol is <code>0</code>. If that's the case, accept it. If the word is shorter than three or does not start with <code>00</code>, it's not an MI6 code name. Formally, the construct of a finite-state machine or a regular grammar is used to prove that a language is regular. These are similar to the <code>if</code>-statements above, but allow for arbitrarily long words. If there's a finite-state machine, there is also a regular grammar, and vice versa, so it's sufficient to show either. For example, the finite state machine for our double-0 language is: <pre class="prettyprint"><code>start state: if input = 0 then goto state 2 start state: if input = 1 then fail start state: if input = 2 then fail ... state 2: if input = 0 then accept state 2: if input != 0 then fail accept: for any input, accept </code></pre> The equivalent regular grammar is: <pre class="prettyprint"><code>start → 0 B B → 0 accept accept → 0 accept accept → 1 accept ... </code></pre> The equivalent regular expression is: <pre class="prettyprint"><code>00[0-9]* </code></pre> Some languages are not regular. For example, the language of any number of <code>1</code>, followed by the same number of <code>2</code> (often written as 1n2n, for an arbitrary n) is not regular - you need more than a constant amount of memory (= a constant number of states) to store the number of <code>1</code>s to decide whether a word is in the language. This should usually be explained in the theoretical computer science course. Luckily, Wikipedia explains both formal and regular languages quite nicely.

What is a regular language?

Tags:

syntax

programming-languages

formal-languages

regular-language

bnf

I'm trying to understand the concept of languages levels (regular, context free, context sensitive, etc.).

I can look this up easily, but all explanations I find are a load of symbols and talk about sets. I have two questions:

Can you describe in words what a regular language is, and how the languages differ?
Where do people learn to understand this stuff? As I understand it, it is formal mathematics? I had a couple of courses at uni which used it and barely anyone understood it as the tutors just assumed we knew it. Where can I learn it and why are people "expected" to know it in so many sources? It's like there's a gap in education.

Here's an example:

Any language belonging to this set is a regular language over the alphabet.

How can a language be "over" anything?

434

asked Jul 16 '11 15:07

FBryant87

1 Answers

In the context of computer science, a word is the concatenation of symbols. The used symbols are called the alphabet. For example, some words formed out of the alphabet {0,1,2,3,4,5,6,7,8,9} would be 1, 2, 12, 543, 1000, and 002.

A language is then a subset of all possible words. For example, we might want to define a language that captures all elite MI6 agents. Those all start with double-0, so words in the language would be 007, 001, 005, and 0012, but not 07 or 15. For simplicity's sake, we say a language is "over an alphabet" instead of "a subset of words formed by concatenation of symbols in an alphabet".

In computer science, we now want to classify languages. We call a language regular if it can be decided if a word is in the language with an algorithm/a machine with constant (finite) memory by examining all symbols in the word one after another. The language consisting just of the word 42 is regular, as you can decide whether a word is in it without requiring arbitrary amounts of memory; you just check whether the first symbol is 4, whether the second is 2, and whether any more numbers follow.

All languages with a finite number of words are regular, because we can (in theory) just build a control flow tree of constant size (you can visualize it as a bunch of nested if-statements that examine one digit after the other). For example, we can test whether a word is in the "prime numbers between 10 and 99" language with the following construct, requiring no memory except the one to encode at which code line we're currently at:

if word[0] == 1:   if word[1] == 1: # 11       return true # "accept" word, i.e. it's in the language   if word[1] == 3: # 13       return true ... return false

Note that all finite languages are regular, but not all regular languages are finite; our double-0 language contains an infinite number of words (007, 008, but also 004242 and 0012345), but can be tested with constant memory: To test whether a word belongs in it, check whether the first symbol is 0, and whether the second symbol is 0. If that's the case, accept it. If the word is shorter than three or does not start with 00, it's not an MI6 code name.

Formally, the construct of a finite-state machine or a regular grammar is used to prove that a language is regular. These are similar to the if-statements above, but allow for arbitrarily long words. If there's a finite-state machine, there is also a regular grammar, and vice versa, so it's sufficient to show either. For example, the finite state machine for our double-0 language is:

start state:  if input = 0 then goto state 2 start state:  if input = 1 then fail start state:  if input = 2 then fail ... state 2: if input = 0 then accept state 2: if input != 0 then fail accept: for any input, accept

The equivalent regular grammar is:

start → 0 B B → 0 accept accept → 0 accept accept → 1 accept ...

The equivalent regular expression is:

00[0-9]*

Some languages are not regular. For example, the language of any number of 1, followed by the same number of 2 (often written as 1ⁿ2ⁿ, for an arbitrary n) is not regular - you need more than a constant amount of memory (= a constant number of states) to store the number of 1s to decide whether a word is in the language.

This should usually be explained in the theoretical computer science course. Luckily, Wikipedia explains both formal and regular languages quite nicely.

answered Sep 28 '22 02:09

phihag

Related questions
                            
                                What do curly braces in Java mean by themselves?
                            
                                Parenthesis surrounding return values in C
                            
                                I do not understand why this compiles
                            
                                PHP curly brace syntax for member variable
                            
                                What's the difference between eq, eql, equal and equalp, in Common Lisp?
                            
                                What is the proper way to test a Bash function's return value?
                            
                                Dot notation vs. message notation for declared properties
                            
                                What is the underscore actually doing in this Java code? [closed]
                            
                                Why does Java not allow foreach on iterators (only on iterables)? [duplicate]
                            
                                Why don't associated types for protocols use generic type syntax in Swift?
                            
                                Why does the statement "2i;" NOT cause a compiler error?
                            
                                What is the use of <<<EOD in PHP?
                            
                                "Unknown escape sequence" error in Go
                            
                                ES6+ javascript module export options
                            
                                Generator as function argument
                            
                                $(document).on('click', '#id', function() {}) vs $('#id').on('click', function(){}) [closed]
                            
                                syntax error: unexpected token <
                            
                                Why the strange indentation on switch statements?
                            
                                What is the r#""# operator in Rust?
                            
                                C++ array initialization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With