Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl split a text file into chunks

Tags:

perl

I have a large txt file made of thousand of articles and I am trying to split it into individual files - one for each of the articles that I'd like to save as article_1, article_2 etc.. Each articles begins by a line containing the word /DOCUMENTS/. I am totally new to perl and any insight would be so great ! (even advice on good doc websites). Thanks a lot. So far what I have tried look like:

#!/usr/bin/perl
use warnings;
use strict;

my $id = 0;
my $source = "2010_FTOL_GRbis.txt";
my $destination = "file$id.txt";

open IN, $source or die "can t read $source: $!\n";

while (<IN>)
  {
    {  
      open OUT, ">$destination" or die "can t write $destination: $!\n";
      if (/DOCUMENTS/)
       {
         close OUT ;
         $id++;
       }
    }
  }
close IN;
like image 533
user1562471 Avatar asked Apr 06 '26 16:04

user1562471


1 Answers

Let's say that /DOCUMENTS/ appears by itself on a line. Thus you can make that the record separator.

use English     qw<$RS>;
use File::Slurp qw<write_file>;
my $id     = 0;
my $source = "2010_FTOL_GRbis.txt";

{   local $RS = "\n/DOCUMENTS/\n";
    open my $in, $source or die "can t read $source: $!\n";
    while ( <$in> ) { 
        chomp; # removes the line "\n/DOCUMENTS/\n"
        write_file( 'file' . ( ++$id ) . '.txt', $_ );
    }
    # being scoped by the surrounding brackets (my "local block"),
    close $in;    # an explicit close is not necessary
}

NOTES:

  • use English declares the global variable $RS. The "messy name" for it is $/. See perldoc perlvar
  • A line separator is the default record separator. That is, the standard unit of file reading is a record. Which is only, by default, a "line".
  • As you will find in the linked documentation, $RS only takes literal strings. So, using the idea that the division between articles was '/DOCUMENTS/' all by itself on a line, I specified newline + '/DOCUMENTS/' + newline. If this is part of a path that occurs somewhere on the line, then that particular value will not work for the record separator.
like image 196
Axeman Avatar answered Apr 09 '26 08:04

Axeman