HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.
The constructor method is a special method of a class for creating and initializing an object instance of that class.
A constructor is a special function that creates and initializes an object instance of a class. In JavaScript, a constructor gets called when an object is created using the new keyword. The purpose of a constructor is to create a new object and set values for any existing object properties.
A class can have multiple constructors that assign the fields in different ways. Sometimes it's beneficial to specify every aspect of an object's data by assigning parameters to the fields, but other times it might be appropriate to define only one or a few.
I normally follow one easy principle:
Everything that is mandatory for the correct existence and behavior of the class instance should be passed and done into the constructor.
Every other activity is done by other methods.
The constructor should never:
Because I learned the hard way that while you are in the constructor, the object is in a incoherent, intermediate state which is too dangerous to handle. Some of this unexpected behavior could be expected from your code, some could be from the language architecture and compiler decisions. Never guess, stay safe, be minimal.
In your case, I would use a Parser::parseHtml(file) method. The instantiation of the parser and the parsing are two different operations. When you instance a parser, the constructor puts it in the condition to perform its job (parsing). Then you use its method to perform the parsing. You then have two choices:
The second strategy grants you better granularity, as the Parser is now stateless, and the client needs to interact with the methods of the ParsingResult interface. The Parser interface remains sleek and simple. The internals of the Parser class will tend to follow the Builder pattern.
You comment: "I feel as though returning an instance of a parser that hasn't parsed anything (as you suggest), a constructor that's lost its purpose. There's no use in initializing a parser without the intent of actually parsing the information. So if parsing is going to happen for sure, should we parse as early as possible and report and errors early, such as during the construction of the parser? I feel as though initializing a parser with invalid data should result in an error being thrown."
Not really. If you return an instance of a Parser, of course it's going to parse. In Qt, when you instantiate a button, of course it's going to be shown. However, you have the method QWidget::show() to manually call before something is visible to the user.
Any object in OOP has two concerns: initialization, and operation (ignore finalization, it's not on discussion right now). If you keep these two operations together, you both risk trouble (having an incomplete object operating) and you lose flexibility. There are plenty of reasons why you would perform intermediate setup of your object before calling parseHtml(). Example: suppose you want to configure your Parser to be strict (so to fail if a given column in a table contains a string instead of an integer) or permissive. Or to register a listener object which is warned every time a new parsing is performed or ended (think GUI progress bar). These are optional information, and if your architecture puts the constructor as the übermethod that does everything, you end up having a huge list of optional method parameters and conditions to handle into a method which is inherently a minefield.
"Caching should not be the responsibility of a parser. If data is to be cached, a separate cache class should be created to provide that functionality."
On the opposite. If you know that you are going to use the parsing functionality on a lot of files, and there's a significant chance that the files are going to be accessed and parsed again later on, it is internal responsability of the Parser to perform smart caching of what it already saw. From the client perspective, it is totally oblivious if this caching is performed or not. He is still callling the parsing, and still obtaining a result object. but it is getting the answer much faster. I think there's no better demonstration of separation of concerns than this. You boost performance with absolutely no change in the contract interface or the whole software architecture.
However, note that I am not advocating that you should never use a constructor call to perform parsing. I am just claiming that it's potentially dangerous and you lose flexibility. There are plenty of examples out there where the constructor is at the center of the actual activity of the object, but there is also plenty of examples of the opposite. Example (although biased, it arises from C style): in python, I would consider very weird something like this
f = file()
f.setReadOnly()
f.open(filename)
instead of the actual
f = file(filename,"r")
But I am sure there are IO access libraries using the first approach (with the second as a sugar-syntax approach).
Edit: finally, remember that while it's easy and compatible to add in the future a constructor "shortcut", it is not possible to remove this functionality if you find it dangerous or problematic. Additions to the interface are much easier than removals, for obvious reasons. Sugary behavior must be weighted against future support you have to provide to that behavior.
"Should the parsing code be placed within a void parseHtml() method and the accessors only return valid values once this method is called?"
Yes.
"The design of the class is such that the class' constructor does the parsing"
This prevents customization, extension, and -- most importantly -- dependency injection.
There will be times when you want to do the following
Construct a parser.
Add Features to the parser: Business Rules, Filters, Better Algorithms, Strategies, Commands, whatever.
Parse.
Generally, it's best to do as little as possible in a constructor so that you are free to extend or modify.
Edit
"Couldn't extensions simply parse the extra information in their constructors?"
Only if they don't have any kind of features that need to be injected.  If you want to add features -- say a different strategy for constructing the parse tree -- your subclasses have to also manage this feature addition before they parse.  It may not amount to a simple super() because the superclass does too much.
"Also, parsing in the constructor allows me to fail early"
Kind of. Failing during construction is a weird use case. Failing during construction makes it difficult to construct a parser like this...
class SomeClient {
    parser p = new Parser();
    void aMethod() {...}
}
Usually a construction failure means you're out of memory. There's rarely a good reason to catch construction exceptions because you're doomed anyway.
You're forced to build the parser in a method body because it has too complex arguments.
In short, you've removed options from the clients of your parser.
"It's inadvisable to inherit from this class to replace an algorithm."
That's funny. Seriously. It's an outrageous claim. No algorithm is optimal for all possible use cases. Often a high-performance algorithm uses a lot of memory. A client may want to replace the algorithm with a slower one that uses less memory.
You can claim perfection, but it's rare. Subclasses are the norm, not an exception. Someone will always improve on your "perfection". If you limit their ability to subclass your parser, they'll simply discard it for something more flexible.
"I don't see needing step 2 as described in the answer."
A bold statement. Dependencies, Strategies and related injection design patterns are common requirements. Indeed, they're so essential for unit testing that a design which makes it difficult or complex often turns out to be a bad design.
Limiting the ability to subclass or extend your parser is a bad policy.
Bottom Line.
Assume nothing. Write a class with as few assumptions about it's use cases as possible. Parsing at construction time makes too many assumptions about client use cases.
A constructor should do whatever is necessary to put that instance into a runnable, valid, ready-to-use state. If that means some validation or analysis, I'd say it belongs there. Just be careful about how much the constructor does.
There might be other places in your design where validation fits as well.
If the input values are coming from a UI, I'd say that it should have a hand in ensuring valid input.
If the input values are being unmarshalled from an incoming XML stream, I'd think about using schemas to validate it.
I'd probably just pass enough to initialize the object and then have a 'parse' method. The idea is that expensive operations should be as obvious as possible.
You should try to keep the constructor from doing unnecessary work. In the end, it all depends on what the class should do, and how it should be used.
For instance, will all the accessors be called after constructing your object? If not, then you've processed data unnecessarily. Also, there's a bigger risk of throwing a "senseless" exception (oh, while trying to create the parser, I got an error because the file was malformed, but I didn't even ask it to parse anything...)
On second thought, you might need the access to this data fast after it is built, but you may take long building the object. It might be ok in this case.
Anyway, if the building process is complicated, I'd suggest using a creational pattern (factory, builder).
It is good rule of thumb to only initialize fields in constructors, and otherwise do as little as possible to initialize the Object.  Using Java as an example, you could run into problems if you call methods in your constructor, especially if you subclass your Object.   This is because, due to the order of operations in the instantiation of Objects, instance variables will not be evaluated until after the super constructor has finished.  If you try to access the field during the super constructor's process, you will throw an Exception
Suppose you have a superclass
class Test {
   Test () {
      doSomething();
   }
   void doSomething() {
     ...
   }
 }
and you have a subclass:
class SubTest extends Test {
    Object myObj = new Object();
    @Override
    void doSomething() {
        System.out.println(myObj.toString()); // throws a NullPointerException          
    }
 }
This is an example specific to Java, and while different languages handle this sort of ordering differently, it serves to drive the point home.
edit as an answer to your comment:
Though I would normally shy away from methods in constructors, in this case you have a few options:
In your constructor, set the HTML string as a field in your Class, and parse every time your getters are called. This most likely will not be very efficient.
Set the HTML as a field on your object, and then introduce a dependency on parse(), with it needing to be called either right after the constructor is finished or include some sort of lazy parsing by adding something like 'ensureParsed()' at the head of your accessors.  I dont like this all that much, as you could have the HTML around after you've parsed, and your ensureParsed() call could be coded to set all of your parsed fields, thereby introducing a side-effect to your getter.
You could call parse() from your constructor and run the risk of throwing an exception.  As you say, you are setting the fields to initialize the Object, so this is really OK.  With regard to the Exception, stating that there was an illegal argument passed into a constructor is acceptable.  If you do this, you should be careful to ensure that you understand the way that your language handles the creation of Objects as discussed above.  To follow up with the Java example above, you can do this without fear if you ensure that only private methods (and therefore not eligible for overriding by subclasses) are called from within a constructor.
Misko Hevery has a nice story on this subject, from a unit testing perspective, here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With