Compare many text files that contain duplicate "stubs" from the previous and next file and remove duplicate text automatically

Question

I have a large number of text files (1000+) each containing an article from an academic journal. Unfortunately each article's file also contains a "stub" from the end of the previous article (at the beginning) and from the beginning of the next article (at the end).

I need to remove these stubs in preparation for running a frequency analysis on the articles because the stubs constitute duplicate data.

There is no simple field that marks the beginning and end of each article in all cases. However, the duplicate text does seem to formatted the same and on the same line in both cases.

A script that compared each file to the next file and then removed 1 copy of the duplicate text would be perfect. This seems like it would be a pretty common issue when programming so I am surprised that I haven't been able to find anything that does this.

The file names sort in order, so a script that compares each file to the next sequentially should work. E.G.

bul_9_5_181.txt
bul_9_5_186.txt

are two articles, one starting on page 181 and the other on page 186. Both of these articles are included bellow.

There is two volumes of test data located at [http://drop.io/fdsayre][1]

Note: I am an academic doing content analysis of old journal articles for a project in the history of psychology. I am no programmer, but I do have 10+ years experience with linux and can usually figure things out as I go.

Thanks for your help

FILENAME: bul_9_5_181.txt

SYN&STHESIA

ISI

the majority of Portugese words signifying black objects or ideas relating to black. This association is, admittedly, no true synsesthesia, but the author believes that it is only a matter of degree between these logical and spontaneous associations and genuine cases of colored audition. REFERENCES

DOWNEY, JUNE E. A Case of Colored Gustation. Amer. J. of Psycho!., 1911, 22, S28-539MEDEIROS-E-ALBUQUERQUE. Sur un phenomene de synopsie presente par des millions de sujets. / . de psychol. norm, et path., 1911, 8, 147-151. MYERS, C. S. A Case of Synassthesia. Brit. J. of Psychol., 1911, 4, 228-238.

AFFECTIVE PHENOMENA — EXPERIMENTAL BY PROFESSOR JOHN F. .SHEPARD University of Michigan

Three articles have appeared from the Leipzig laboratory during the year. Drozynski (2) objects to the use of gustatory and olfactory stimuli in the study of organic reactions with feelings, because of the disturbance of breathing that may be involved. He uses rhythmical auditory stimuli, and finds that when given at different rates and in various groupings, they are accompanied by characteristic feelings in each subject. He records the chest breathing, and curves from a sphygmograph and a water plethysmograph. Each experiment began with a normal record, then the stimulus was given, and this was followed by a contrast stimulus; lastly, another normal was taken. The length and depth of breathing were measured (no time line was recorded), and the relation of length of inspiration to length of expiration was determined. The length and height of the pulsebeats were also measured. Tabular summaries are given of the number of times the author finds each quantity to have been increased or decreased during a reaction period with each type of feeling. The feeling state accompanying a given rhythm is always complex, but the result is referred to that dimension which seemed to be dominant. Only a few disconnected extracts from normal and reaction periods are reproduced from the records. The author states that excitement gives increase in the rate and depth of breathing, in the inspiration-expiration ratio, and in the rate and size of pulse. There are undulations in the arm volume. In so far as the effect is quieting, it causes decrease in rate and depth of

182

JOHN F. SHEPARD

breathing, in the inspiration-expiration ratio, and in the pulse rate and size. The arm volume shows a tendency to rise with respiratory waves. Agreeableness shows

Waylon Flinn · Accepted Answer

It looks like a much simpler solution would actually work.

No one seems to be using the information provided by the filenames. If you do make use of this information, you may not have to do any comparisons between files to identify the area of overlap. Whoever wrote the OCR probably put some thought into this problem.

The last number in the file name tells you what the starting page number for that file is. This page number appears on a line by itself in the file as well. It also looks like this line is preceded and followed by blank lines. Therefore for a given file you should be able to look at the name of the next file in the sequence and determine the page number at which you should start removing text. Since this page number appears in your file just look for a line that contains only this number (preceded and followed by blank lines) and delete that line and everything after. The last file in the sequence can be left alone.

Here's an outline for an algorithm

choose a file; call it: file1
look at the filename of the next file; call it: file2
extract the page number from the filename of file2; call it: pageNumber
scan the contents of file1 until you find a line that contains only pageNumber
make sure this line is preceded and followed by a blank line.
remove this line and everything after
move on to the next file in the sequence

MarkusQ · Answer

You should probably try something like this (I've now tested it on the sample data you provided):

#!/usr/bin/ruby

class A_splitter
    Title   = /^[A-Z]+[^a-z]*$/
    Byline  = /^BY /
    Number = /^\d*$/
    Blank_line = /^ *$/
    attr_accessor :recent_lines,:in_references,:source_glob,:destination_path,:seen_in_last_file
    def initialize(src_glob,dst_path=nil)
        @recent_lines = []
        @seen_in_last_file = {}
        @in_references = false
        @source_glob = src_glob
        @destination_path = dst_path
        @destination = STDOUT
        @buffer = []
        split_em
        end
    def split_here
        if destination_path
            @destination.close if @destination
            @destination = nil
          else
            print "------------SPLIT HERE------------
" 
          end
        print recent_lines.shift
        @in_references = false
        end
    def at_page_break
        ((recent_lines[0] =~ Title  and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Number) or
         (recent_lines[0] =~ Number and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Title))
        end
    def print(*args)
        (@destination || @buffer) << args
        end
    def split_em
        Dir.glob(source_glob).sort.each { |filename|
            if destination_path
                @destination.close if @destination
                @destination = File.open(File.join(@destination_path,filename),'w')
                print @buffer
                @buffer.clear
              end
            in_header = true
            File.foreach(filename) { |line|
                line.gsub!(/\f/,'')
                if in_header and seen_in_last_file[line]
                    #skip it
                  else 
                    seen_in_last_file.clear if in_header
                    in_header = false
                    recent_lines << line
                    seen_in_last_file[line] = true
                  end
                3.times {recent_lines.shift} if at_page_break
                if recent_lines[0] =~ Title and recent_lines[1] =~ Byline
                    split_here
                  elsif in_references and recent_lines[0] =~ Title and recent_lines[0] !~ /\d/
                    split_here
                  elsif recent_lines.length > 4
                    @in_references ||= recent_lines[0] =~ /^REFERENCES *$/
                    print recent_lines.shift
                  end
                }
            } 
        print recent_lines
        @destination.close if @destination
        end
    end

A_splitter.new('bul_*_*_*.txt','test_dir')

Basically, run through the files in order, and within each file run through the lines in order, omitting from each file the lines that were present in the preceding file and printing the rest to STDOUT (from which it can be piped) unless a destination director is specified (called 'test_dir' in the example see the last line) in which case files are created in the specified directory with the same name as the file which contained the bulk of their contents.

It also removes the page-break sections (journal title, author, and page number).

It does two split tests:

a test on the title/byline pair
a test on the first title-line after a reference section

(it should be obvious how to add tests for additional split-points).

Retained for posterity:

If you don't specify a destination directory it simply puts a split-here line in the output stream at the split point. This should make it easier for testing (you can just less the output) and when you want them in individual files just pipe it to csplit (e.g. with

csplit -f abstracts - '---SPLIT HERE---' '{*}'

or something) to cut it up.

Compare many text files that contain duplicate "stubs" from the previous and next file and remove duplicate text automatically

Tags:

text

scripting

nlp

duplicate-data

fdsayre

2 Answers

Waylon Flinn

MarkusQ

Recent Activity

Donate For Us

Compare many text files that contain duplicate "stubs" from the previous and next file and remove duplicate text automatically

Tags:

text

scripting

nlp

duplicate-data

fdsayre

2 Answers

Waylon Flinn

MarkusQ

Related questions

Recent Activity

Donate For Us