Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Open source html parsing class not properly parsing spaces between paragraphs

I'm using an open source method that parses the html text into an NSString.

The resulting strings have large amounts of white space between the first couple of paragraphs, but only one line of space for subsequent paragraphs. Here is an example of an output.

enter image description here Below is the method I'm calling. I've only changed two lines of the code. For stopCharacters and newLineAndWhitespaceCharacters, I removed /n from the characterset because when it was included, the entire text was one long paragraph.

- (NSString *)stringByConvertingHTMLToPlainText {

    // Pool
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

    // Character sets
    NSCharacterSet *stopCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@"< \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
    NSCharacterSet *newLineAndWhitespaceCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@" \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
    NSCharacterSet *tagNameCharacters = [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"];

    // Scan and find all tags
    NSMutableString *result = [[NSMutableString alloc] initWithCapacity:self.length];
    NSScanner *scanner = [[NSScanner alloc] initWithString:self];
    [scanner setCharactersToBeSkipped:nil];
    [scanner setCaseSensitive:YES];
    NSString *str = nil, *tagName = nil;
    BOOL dontReplaceTagWithSpace = NO;
    do {

        // Scan up to the start of a tag or whitespace
        if ([scanner scanUpToCharactersFromSet:stopCharacters intoString:&str]) {
            [result appendString:str];
            str = nil; // reset
        }

        // Check if we've stopped at a tag/comment or whitespace
        if ([scanner scanString:@"<" intoString:NULL]) {

            // Stopped at a comment or tag
            if ([scanner scanString:@"!--" intoString:NULL]) {

                // Comment
                [scanner scanUpToString:@"-->" intoString:NULL];
                [scanner scanString:@"-->" intoString:NULL];

            } else {

                // Tag - remove and replace with space unless it's
                // a closing inline tag then dont replace with a space
                if ([scanner scanString:@"/" intoString:NULL]) {

                    // Closing tag - replace with space unless it's inline
                    tagName = nil; dontReplaceTagWithSpace = NO;
                    if ([scanner scanCharactersFromSet:tagNameCharacters intoString:&tagName]) {
                        tagName = [tagName lowercaseString];
                        dontReplaceTagWithSpace = ([tagName isEqualToString:@"a"] ||
                                                   [tagName isEqualToString:@"b"] ||
                                                   [tagName isEqualToString:@"i"] ||
                                                   [tagName isEqualToString:@"q"] ||
                                                   [tagName isEqualToString:@"span"] ||
                                                   [tagName isEqualToString:@"em"] ||
                                                   [tagName isEqualToString:@"strong"] ||
                                                   [tagName isEqualToString:@"cite"] ||
                                                   [tagName isEqualToString:@"abbr"] ||
                                                   [tagName isEqualToString:@"acronym"] ||
                                                   [tagName isEqualToString:@"label"]);
                    }

                    // Replace tag with string unless it was an inline
                    if (!dontReplaceTagWithSpace && result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "];

                }

                // Scan past tag
                [scanner scanUpToString:@">" intoString:NULL];
                [scanner scanString:@">" intoString:NULL];

            }

        } else {

            // Stopped at whitespace - replace all whitespace and newlines with a space
            if ([scanner scanCharactersFromSet:newLineAndWhitespaceCharacters intoString:NULL]) {
                if (result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "]; // Dont append space to beginning or end of result
            }

        }

    } while (![scanner isAtEnd]);

    // Cleanup
    [scanner release];

    // Decode HTML entities and return
    NSString *retString = [[result stringByDecodingHTMLEntities] retain];
    [result release];

    // Drain
    [pool drain];

    // Return
    return [retString autorelease];

}

EDIT:

Here is the NSLog of the string. I only pasted the first few paragraphs

Mitt Romney spent the past six years running for president. After his loss to President Barack Obama, he'll have to chart a different course.  


 His initial plan: spend time with his family. He has five sons and 18 grandchildren, with a 19th on the way.  






 "I don't look at postelection to be a time of regrouping. Instead it's a time of forward focus," Romney told reporters aboard his plane Tuesday evening as he returned to Boston after the final campaign stop of his political career. "I have, of course, a family and life important to me, win or lose."  

 The most visible member of that family — wife Ann Romney — says neither she nor her husband will seek political office again.  

etc....

for (int j = 25; j< 50; j++) {
    char test =  [completeTrimmed characterAtIndex:([completeTrimmed rangeOfString:@"chart a different course."].location + j)];

        NSLog(@"%hhd", test);
    }

012-11-11 17:15:57.668 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 72
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 115
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 110
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 116
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 97
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 112
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 97
like image 226
Mahir Avatar asked Nov 08 '12 05:11

Mahir


2 Answers

Check with this,

  //Decode HTML entities and return
  NSString *retString = [result stringByDecodingHTMLEntities];
  [result release];

  //Drain
  [pool drain];

  retString = [[retString stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] retain];

  //Return
  return [retString autorelease];   
}

If the above is not working, Also try with

completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@"\n" withString:@""]; 

and

completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@"\r" withString:@""];
like image 56
iDev Avatar answered Oct 06 '22 15:10

iDev


You could replace @"/n/n" with @"/n" to reduce the number of line breaks.

like image 21
Darren Avatar answered Oct 06 '22 13:10

Darren