In Perl, to lowercase a textfile, I could do the following lowercase.perl
:
#!/usr/bin/env perl
use warnings;
use strict;
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
while(<STDIN>) {
print lc($_);
}
And on the command line: perl lowercase.perl < infile.txt > lowered.txt
In Python
, I could do with lowercase.py
:
#!/usr/bin/env python
import io
import sys
with io.open(sys.argv[1], 'r', 'utf8') as fin:
with io.open(sys.argv[2], 'r', 'utf8') as fout:
fout.write(fin.read().lower())
And on the command line: python lowercase.py infile.txt lowered.txt
Is the Perl lowercase.perl
different from the Python lowercase.py
?
Does it stream the input and lowercase it as it outputs? Or does it read the whole file like the Python's lowercase.py
?
Instead of reading in a whole file, is there a way to stream the input into Python and output the lowered case byte by byte or char by char?
Is there a way to control the command-line syntax such that it follows the Perl STDIN and STDOUT? E.g. python lowercase.py < infile.txt > lowered.txt
?
Python 3.x equivalent for your Perl code may look as follows:
#!/usr/bin/env python3.4
import sys
for line in sys.stdin:
print(line[:-1].lower(), file=sys.stdout)
It reads stdin line-by-line and could be used in shell pipeline
There seem to be two interleaved issues here and I address that first. For how to make both Perl and Python use either invocation with a very similar behavior see the second part of the post.
Short: They differ in how they do I/O but both work line-by-line, and Python code is easily changed to allow the same command-line invocation as Perl code. Also, both can be written so to allow input either from file or from standard input stream.
(1) Both of your solutions are "streaming," in the sense that they both process input line-by-line. Perl code reads from STDIN
while Python code gets data from a file, but they both get a line at a time. In that sense they are comparable in efficiency for large files.
A standard way to both read and write files line-by-line in Python is
with open('infile', 'r') as fin, open('outfile', 'w') as fout:
fout.write(fin.read().lower())
See, for example, these SO posts on processing a very large file and read-and-write files. The way your read the file seems idiomatic for line-by-line processing, see for example SO posts on reading large-file line-by-line, on idiomatic line-by-line reading and another one on line-by-line reading.
Change the first open here to your io.open
to directly take the first argument from the command line as the file name, and add modes as needed.
(2) The command line with both input and output redirection that you show is a shell feature
./program < input > output
The program
is fed lines through the standard input stream (file descriptor 0). They are provided from the file input
by the shell via its <
redirection. From gnu bash manual (see 3.6.1), where "word" stands for our "input"
Redirection of input causes the file whose name results from the expansion of word to be opened for reading on file descriptor n, or the standard input (file descriptor 0) if n is not specified.
Any program can be written to do that, ie. act as a filter. For Python you can use
import sys
for line in sys.stdin:
print line.lower()
See for example a post on writing filters. Now you can invoke it as script.py < input
in a shell.
The code print
s to standard output, which can then be redirected by shell using >
. Then you get the same invocation as for the Perl script.
I take it that the standard output redirection >
is clear in both cases.
Finally, you can bring both to a nearly identical behavior, and allowing either invocation, in this way.
In Perl, there is the following idiom
while (my $line = <>) {
# process $line
}
The diamond operator <>
either takes line by line from all files submitted on the command line (which are found in @ARGV
), or it gets its lines from STDIN
(if data is somehow piped into the script). From I/O Operators in perlop
The null filehandle
<>
is special: it can be used to emulate the behavior of sed and awk, and any other Unix filter program that takes a list of filenames, doing the same to each line of input from all of them. Input from<>
comes either from standard input, or from each file listed on the command line. Here's how it works: the first time<>
is evaluated, the@ARGV
array is checked, and if it is empty,$ARGV[0]
is set to"-"
, which when opened gives you standard input. The@ARGV
array is then processed as a list of filenames.
In Python you get practically the same behavior by
import fileinput
for line in fileinput.input():
# process line
This also goes through lines of files named in sys.argv
, defaulting to sys.stdin
if list is empty. From fileinput documentation
This iterates over the lines of all files listed in
sys.argv[1:]
, defaulting tosys.stdin
if the list is empty. If a filename is'-'
, it is also replaced bysys.stdin
. To specify an alternative list of filenames, pass it as the first argument toinput()
. A single file name is also allowed.
In both cases, if there are command-line arguments other than file names more need be done.
With this you can use both Perl and Python scripts in either way
lowercase < input > output
lowercase input > output
Or, for that matter, as cat input | lowercase > output
.
All methods here read input and write output line-by-line. This may be further optimized (buffered) by the interpreter, the system, and shell's redirections. It is possible to change that so to read and/or write in smaller chunks but that would be extremely inefficient and noticeably slow down programs.
Slightly off topic (depending on your definition of "Perl") but maybe of interest...
perl6 -e ' .lc.say for "infile.txt".IO.lines ' > lowered.txt
This neither processes "byte by byte" nor "whole file" but "line by line". .lines
creates a lazy list so you will not use a ton of memory if your file is large. The file is presumed to be text (meaning you get Str
's rather than Buf
's of bytes when you read) and the encoding defaults to "Unicode" - meaning open
will try to figure out what UTF is used and if it can't it will presume UTF-8
. Details here.
By default line endings are chomp
'ed as you read and put back on by say
- if the processing requirements prohibit that, you can pass the boolean, named parameter :chomp
to .lines
(and use .print
rather than .say
) ;
$ perl6 -e ' .lc.print for "infile.txt".IO.lines(:!chomp) ' > lowered.txt
You can avoid the IO redirection and do it all in perl6 but this will read the whole file in as one Str
;
$ perl6 -e ' "lowered.txt".IO.spurt: "infile.txt".IO.slurp.lc '
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With