I'm writing a microformats parser in C# and am looking for some refactoring advice. This is probably the first "real" project I've attempted in C# for some time (I program almost exclusively in VB6 at my day job), so I have the feeling this question may become the first in a series ;-)
Let me provide some background about what I have so far, so that my question will (hopefully) make sense.
Right now, I have a single class, MicroformatsParser
, doing all the work. It has an overloaded constructor that lets you pass a System.Uri
or a string
containing a URI: upon construction, it downloads the HTML document at the given URI and loads it into an HtmlAgilityPack.HtmlDocument
for easy manipulation by the class.
The basic API works like this (or will, once I finish the code...):
MicroformatsParser mp = new MicroformatsParser("http://microformats.org");
List<HCard> hcards = mp.GetAll<HCard>();
foreach(HCard hcard in hcards)
{
Console.WriteLine("Full Name: {0}", hcard.FullName);
foreach(string email in hcard.EmailAddresses)
Console.WriteLine("E-Mail Address: {0}", email);
}
The use of generics here is intentional. I got my inspiration from the way that the the Microformats library in Firefox 3 works (and the Ruby mofo
gem). The idea here is that the parser does the heavy lifting (finding the actual microformat content in the HTML), and the microformat classes themselves (HCard
in the above example) basically provide the schema that tells the parser how to handle the data it finds.
The code for the HCard
class should make this clearer (note this is a not a complete implementation):
[ContainerName("vcard")]
public class HCard
{
[PropertyName("fn")]
public string FullName;
[PropertyName("email")]
public List<string> EmailAddresses;
[PropertyName("adr")]
public List<Address> Addresses;
public HCard()
{
}
}
The attributes here are used by the parser to determine how to populate an instance of the class with data from an HTML document. The parser does the following when you call GetAll<T>()
:
T
has a ContainerName
attribute (and it's not blank)class
attribute that matches the ContainerName
. Call these the "container nodes".T
.MemberInfo[]
) for type T
via reflectionMemberInfo
PropertyName
attribute
T
created in the first step)T
to a List<T>
List<T>
, which now contains a bunch of microformat objectsI'm trying to figure out a better way to implement the step in bold. The problem is that the Type
of a given field in the microformat class determines not only what node to look for in the HTML, but also how to interpret the data.
For example, going back to the HCard
class I defined above, the "email"
property is bound to the EmailAddresses
field, which is a List<string>
. After the parser finds all the "email"
child nodes of the parent "vcard"
node in the HTML, it has to put them in a List<string>
.
What's more, if I want my HCard
to be able to return phone number information, I would probably want to be able to declare a new field of type List<HCard.TelephoneNumber>
(which would have its own ContainerName("tel")
attribute) to hold that information, because there can be multiple "tel"
elements in the HTML, and the "tel"
format has its own sub-properties. But now the parser needs to know how to put the telephone data into a List<HCard.TelephoneNumber>
.
The same problem applies to Float
S, DateTime
S, List<Float>
S, List<Integer>
S, etc.
The obvious answer is to have the parser switch on the type of field, and do the appropriate conversions for each case, but I want to avoid a giant switch
statement. Note that I'm not planning to make the parser support every possible Type
in existence, but I will want it to handle most scalar types, and the List<T>
versions of them, along with the ability to recognize other microformat classes (so that a microformat class can be composed from other microformat classes).
Any advice on how best to handle this?
Since the parser has to handle primitive data types, I don't think I can add polymorphism at the type level...
My first thought was to use method overloading, so I would have a series of a GetPropValue
overloads like GetPropValue(HtmlNode node, ref string retrievedValue)
, GetPropValue(HtmlNode, ref List<Float> retrievedValue)
, etc. but I'm wondering if there is a better approach to this problem.
Mehrdad's approach is basically the one I'd suggest to start with, but as the first step out of potentially more.
You can use a simple IDictionary<Type,Delegate>
(where each entry is actually from T
to Func<ParseContext,T>
- but that can't be expressed with generics) for single types (strings, primitives etc) but then you'll also want to check for lists, maps etc. You won't be able to do this using the map, because you'd have to have an entry for each type of list (i.e. a separate entry for List<string>
, List<int>
etc). Generics make this quite tricky - if you're happy to restrict yourself to just certain concrete types such as List<T>
you'll make it easier for yourself (but less flexible). For instance, detecting List<T>
is straightforward:
if (type.IsGenericType && type.GetGenericTypeDefinition() == typeof(List<>))
{
// Handle lists
// use type.GetGenericArguments() to work out the element type
}
Detect whether a type implements IList<T>
for some T
(and then discovering T
) can be a pain, especially as there could be multiple implementations, and the concrete type itself may or may not be generic. This effort could be worthwhile if you really need a very flexible library used by thousands of developers - but otherwise I'd keep it simple.
Instead of a large switch statement you can construct a dictionary that maps the string to a delegate and look it up when you want to parse using the appropriate method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With