I'm a Perl programmer who's attempting to learn Python by taking some work I've done before and converting it over to Python. This is NOT a line-by-line translation. I want to learn the Python Technique to do this type of task.
I'm parsing a Windows INI file. Sections names are in the format:
[<type> <description>]
The <type>
is a single word field and is not case sensitive. The <description>
could be multiple words.
After a section, there are a bunch of parameters and values. These are in the form of:
<parameter> = <value>
Parameters have no blank spaces and can only contain underscores, letters, and numbers (case insensitive). Thus, the first =
is the divider between a parameter and the value. There might be white space separating the parameter and value around the equals sign. There might be extra white space at the beginning or end of the line.
In Perl, I used regular expressions for parsing:
while (my $line = <CONTROL_FILE>) {
chomp($line);
next if ($line =~ /^\s*[#;']/); #Comments start with "#", ";", or "'"
next if ($line =~ /^\s*$/); #Ignore blank lines
if ($line =~ /^\s*\[\s*(\w+)\s+(.*)/) { #Section
say "This is a '$1' section called '$2'";
}
elsif ($line =~ /^\s*(\w+)\s*=\s*(.*)/) { #Parameter
say "Parameter is '$1' with a value of '$2'";
}
else { #Not Comment, Section, or Parameter
say "Invalid line";
}
}
The problem is that I've been corrupted by Perl, so I think the easiest way to do something is to use a regular expression. Here's the code I have so far...
for line in file_handle:
line = line.strip
# Comment lines and blank lines
if line.find("#") == 1 \
or line.find(";") == 1 \
or line.whitespace:
continue
# Found a Section Heading
if line.find("[") == 1:
print "I want to use a regular expression here"
print "to split the section up into two pieces"
elif line.find("=") != -1:
print "I want to use a regular expression here"
print "to split the parameter into key and value"
else
print "Invalid Line"
There are several things that irritate me here:
I've been going through the various on line tutorials, and they've helped me with understanding the syntax, but not much in the way of handling the language itself -- especially someone who tends to think in another language.
My question:
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
the 'r' means the the following is a "raw string", ie. backslash characters are treated literally instead of signifying special treatment of the following character. http://docs.python.org/reference/lexical_analysis.html#literals. so '\n' is a single newline. and r'\n' is two characters - a backslash and the letter 'n'
The r prefix is part of the string syntax. With r , Python doesn't interpret backslash sequences such as \n , \t etc inside the quotes. Without r , you'd have to type each backslash twice in order to pass it to re. sub .
The maxsplit parameter of re. split() is used to define how many splits you want to perform. In simple words, if the maxsplit is 2, then two splits will be done, and the remainder of the string is returned as the final element of the list.
While I don't think this is your intention, the file format appears quite similar to Python's built-in ConfigParser module. Sometimes the most "Pythonic" way is already provided for you. (:
In more direct answer to your question: regular expressions may be a good way to do this. Otherwise, you could try the more basic (and less robust)
(parameter, value) = line.split('=')
This would throw an error if the line contained no or more than one '=' character. You may want to test it first with '=' in line
.
Also:
line.find("[") == 1
is probably better replaced by
line.startswith("[")
Hope that helpls a little (:
Python includes a ini parsing library. If you want to build a library to parse ini files, then you are looking at an actual parser. Regex won't cut it, use PLY or hook in a flex/bison C parser. Additional python parsing resources are available as well.
Lexers handle all of the text consumption and tree construction for you, since it's a mechanical task prone to programmer error. I.E. this section:
while (my $line = <CONTROL_FILE>) {
chomp($line);
next if ($line =~ /^\s*[#;']/); #Comments start with "#", ";", or "'"
next if ($line =~ /^\s*$/); #Ignore blank lines
if ($line =~ /^\s*\[\s*(\w+)\s+(.*)/) { #Section
say "This is a '$1' section called '$2'";
}
elsif ($line =~ /^\s*(\w+)\s*=\s*(.*)/) { #Parameter
say "Parameter is '$1' with a value of '$2'";
}
else { #Not Comment, Section, or Parameter
say "Invalid line";
}
}
Is created by the lexer, you just need to define the correct Regex. The parser pulls the tokens from the lexer, and determines if they fit the allowable token patterns. That is:
[<type> <description>]
<parameter> = <value>
Define those tokens, and then how the are allowed to fit. Everything else just puts itself together. For those of you who think you can do a better job with a quick for loop and some regex, I suggest you read Lex & Yacc, 2nd Ed.
For an example parser I wrote with PLY, go here. It parses a "jetLetter" file, which is just a dialect of groff/troff.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With