Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would you refactor a "switch on Type" to be polymorphic if you don't control the types involved?

I'm writing a microformats parser in C# and am looking for some refactoring advice. This is probably the first "real" project I've attempted in C# for some time (I program almost exclusively in VB6 at my day job), so I have the feeling this question may become the first in a series ;-)

Let me provide some background about what I have so far, so that my question will (hopefully) make sense.

Right now, I have a single class, MicroformatsParser, doing all the work. It has an overloaded constructor that lets you pass a System.Uri or a string containing a URI: upon construction, it downloads the HTML document at the given URI and loads it into an HtmlAgilityPack.HtmlDocument for easy manipulation by the class.

The basic API works like this (or will, once I finish the code...):

MicroformatsParser mp = new MicroformatsParser("http://microformats.org");
List<HCard> hcards = mp.GetAll<HCard>();

foreach(HCard hcard in hcards) 
{
    Console.WriteLine("Full Name: {0}", hcard.FullName);

    foreach(string email in hcard.EmailAddresses)
        Console.WriteLine("E-Mail Address: {0}", email);
}

The use of generics here is intentional. I got my inspiration from the way that the the Microformats library in Firefox 3 works (and the Ruby mofo gem). The idea here is that the parser does the heavy lifting (finding the actual microformat content in the HTML), and the microformat classes themselves (HCard in the above example) basically provide the schema that tells the parser how to handle the data it finds.

The code for the HCard class should make this clearer (note this is a not a complete implementation):

[ContainerName("vcard")]
public class HCard
{
    [PropertyName("fn")]
    public string FullName;

    [PropertyName("email")]
    public List<string> EmailAddresses;

    [PropertyName("adr")]
    public List<Address> Addresses;

    public HCard()
    { 
    }
}

The attributes here are used by the parser to determine how to populate an instance of the class with data from an HTML document. The parser does the following when you call GetAll<T>():

  • Checks that the type T has a ContainerName attribute (and it's not blank)
  • Searches the HTML document for all nodes with a class attribute that matches the ContainerName. Call these the "container nodes".
  • For each container node:
    • Uses reflection to create an object of type T.
    • Get the public fields (a MemberInfo[]) for type T via reflection
    • For each field's MemberInfo
      • If the field has a PropertyName attribute
        • Get the value of the corresponding microformat property from the HTML
        • Inject the value found in the HTML into the field (i.e. set the value of the field on the object of type T created in the first step)
        • Add the object of type T to a List<T>
    • Return the List<T>, which now contains a bunch of microformat objects

I'm trying to figure out a better way to implement the step in bold. The problem is that the Type of a given field in the microformat class determines not only what node to look for in the HTML, but also how to interpret the data.

For example, going back to the HCard class I defined above, the "email" property is bound to the EmailAddresses field, which is a List<string>. After the parser finds all the "email" child nodes of the parent "vcard" node in the HTML, it has to put them in a List<string>.

What's more, if I want my HCard to be able to return phone number information, I would probably want to be able to declare a new field of type List<HCard.TelephoneNumber> (which would have its own ContainerName("tel") attribute) to hold that information, because there can be multiple "tel" elements in the HTML, and the "tel" format has its own sub-properties. But now the parser needs to know how to put the telephone data into a List<HCard.TelephoneNumber>.

The same problem applies to FloatS, DateTimeS, List<Float>S, List<Integer>S, etc.

The obvious answer is to have the parser switch on the type of field, and do the appropriate conversions for each case, but I want to avoid a giant switch statement. Note that I'm not planning to make the parser support every possible Type in existence, but I will want it to handle most scalar types, and the List<T> versions of them, along with the ability to recognize other microformat classes (so that a microformat class can be composed from other microformat classes).

Any advice on how best to handle this?

Since the parser has to handle primitive data types, I don't think I can add polymorphism at the type level...

My first thought was to use method overloading, so I would have a series of a GetPropValue overloads like GetPropValue(HtmlNode node, ref string retrievedValue), GetPropValue(HtmlNode, ref List<Float> retrievedValue), etc. but I'm wondering if there is a better approach to this problem.

like image 728
Mike Spross Avatar asked Feb 16 '09 06:02

Mike Spross


2 Answers

Mehrdad's approach is basically the one I'd suggest to start with, but as the first step out of potentially more.

You can use a simple IDictionary<Type,Delegate> (where each entry is actually from T to Func<ParseContext,T> - but that can't be expressed with generics) for single types (strings, primitives etc) but then you'll also want to check for lists, maps etc. You won't be able to do this using the map, because you'd have to have an entry for each type of list (i.e. a separate entry for List<string>, List<int> etc). Generics make this quite tricky - if you're happy to restrict yourself to just certain concrete types such as List<T> you'll make it easier for yourself (but less flexible). For instance, detecting List<T> is straightforward:

if (type.IsGenericType && type.GetGenericTypeDefinition() == typeof(List<>))
{
    // Handle lists
    // use type.GetGenericArguments() to work out the element type
}

Detect whether a type implements IList<T> for some T (and then discovering T) can be a pain, especially as there could be multiple implementations, and the concrete type itself may or may not be generic. This effort could be worthwhile if you really need a very flexible library used by thousands of developers - but otherwise I'd keep it simple.

like image 174
Jon Skeet Avatar answered Nov 02 '22 23:11

Jon Skeet


Instead of a large switch statement you can construct a dictionary that maps the string to a delegate and look it up when you want to parse using the appropriate method.

like image 22
mmx Avatar answered Nov 02 '22 23:11

mmx