Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Represent Conjugation Tables in C#

I'm designing a linguistic analyzer for French text. I have a dictionary in XML format, that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<Dictionary>
  <!--This is the base structure for every entry in the dictionary. Values on attributes are given
    as explanations for the attributes. Though this is the structure of the finished product for each word, definition, context and context examples will be ommitted as they don't have a real effect on the application at this moment. Defini-->
  <Word word="The word in the dictionary (any word that would be defined)." aspirate="Whether or not the word starts with an aspirate h. Some adjectives that come before words that start with a non-aspirate h have an extra form (AdjectiveForms -&gt; na [non-aspirate]).">
      <GrammaticalForm form="The grammatical form of the word is the grammatical context in which it is used. Forms may consist of a word in noun, adjective, adverb, exclamatory or other form. Each form (generally) has its own definition, as the meaning of the word changes in the way it is used.">
        <Definition definition=""></Definition>
      </GrammaticalForm>
    <ConjugationTables>
      <NounForms ms="The masculin singular form of the noun." fs="The feminin singular form of the noun." mpl="The masculin plural form of the noun." fpl="The feminin plural form of the noun." gender="The gender of the noun. Determines"></NounForms>
      <AdjectiveForms ms="The masculin singular form of the adjective." fs="The feminin singular form of the adjective." mpl="The masculin plural form of the adjective." fpl="The feminin plural form of the adjective." na="The non-aspirate form of the adjective, in the case where the adjective is followed by a non-aspirate word." location="Where the adjective is placed around the noun (before, after, or both)."></AdjectiveForms>
      <VerbForms group="What group the verb belongs to (1st, 2nd, 3rd or exception)." auxillary="The auxillary verb taken by the verb." prepositions="A CSV list of valid prepositions this verb uses; for grammatical analysis." transitive="Whether or not the verb is transitive." pronominal="The pronominal infinitive form of the verb, if the verb allows pronominal construction.">
        <Indicative>
          <Present fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></Present>
          <SimplePast fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></SimplePast>
          <PresentPerfect fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></PresentPerfect>
          <PastPerfect fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></PastPerfect>
          <Imperfect fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></Imperfect>
          <Pluperfect fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></Pluperfect>
          <Future fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></Future>
          <PastFuture fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></PastFuture>
        </Indicative>
        <Subjunctive>
          <Present fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></Present>
          <Past fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></Past>
          <Imperfect fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></Imperfect>
          <Pluperfect fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></Pluperfect>
        </Subjunctive>
        <Conditional>
          <Present fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></Present>
          <FirstPast fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></FirstPast>
          <SecondPast fps="(Je) first person singular." sps="(Tu) second person singular." tps="(Il) third person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural." tpp="(Ils) third person plural."></SecondPast>
        </Conditional>
        <Imperative>
          <Present sps="(Tu) second person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural."></Present>
          <Past sps="(Tu) second person singular." fpp="(Nous) first person plural." spp="(Vous) second person plural."></Past>
        </Imperative>
        <Infinitive present="The present infinitive form of the verb." past="The past infinitive form of the verb."></Infinitive>
        <Participle present="The present participle of the verb." past="The past partciple of the verb."></Participle>
      </VerbForms>
    </ConjugationTables>
  </Word>
</Dictionary>

Sorry it's so long, but it's necessary to show exactly how the data is modeled (tree-node structure).

Currently I am using structs to model the conjugation tables, nested structs to be more specific. Here is the class I created to model what is a single entry in the XML file:

class Word
{
    public string word { get; set; }
    public bool aspirate { get; set; }
    public List<GrammaticalForms> forms { get; set; }

    struct GrammaticalForms
    {
        public string form { get; set; }
        public string definition { get; set; }
    }

    struct NounForms
    {
        public string gender { get; set; }
        public string masculinSingular { get; set; }
        public string femininSingular { get; set; }
        public string masculinPlural { get; set; }
        public string femininPlural { get; set; }
    }

    struct AdjectiveForms
    {
        public string masculinSingular { get; set; }
        public string femininSingular { get; set; }
        public string masculinPlural { get; set; }
        public string femininPlural { get; set; }
        public string nonAspirate { get; set; }
        public string location { get; set; }
    }

    struct VerbForms
    {
        public string group { get; set; }
        public string auxillary { get; set; }
        public string[] prepositions { get; set; }
        public bool transitive { get; set; }
        public string pronominalForm { get; set; }

        struct IndicativePresent
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct IndicativeSimplePast
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct IndicativePresentPerfect
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct IndicativePastPerfect
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct IndicativeImperfect
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct IndicativePluperfect
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct IndicativeFuture
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct IndicativePastFuture
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }

        struct SubjunctivePresent
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct SubjunctivePast
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct SubjunctiveImperfect
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct SubjunctivePluperfect
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }

        struct ConditionalPresent
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct ConditionalFirstPast
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }
        struct ConditionalSecondPast
        {
            public string firstPersonSingular { get; set; }
            public string secondPersonSingular { get; set; }
            public string thirdPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
            public string thirdPersonPlural { get; set; }
        }

        struct ImperativePresent
        {
            public string secondPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
        }
        struct ImperativePast
        {
            public string secondPersonSingular { get; set; }
            public string firstPersonPlural { get; set; }
            public string secondPersonPlural { get; set; }
        }

        struct Infinitive
        {
            public string present { get; set; }
            public string past { get; set; }
        }

        struct Participle
        {
            public string present { get; set; }
            public string past { get; set; }
        }
    }
}

I'm new to C#, and I'm not too familiar with the data structures. Based on my limited knowledge of C++, I know that structs are useful when you are modeling small, highly-related pieces of data, which is why I am currently using them in this fashion.

All of these structs could realistically be made into a ConjugationTables class, and would have, to a high degree, the same structure. I'm unsure of whether to make these into a class, or use a different data structure that would be better suited to the problem. In order to give some more information about the problem specifications, I'll say the following:

  1. Once these values have been loaded from the XML file, they will not be changed.
  2. These values will be read/fetched very often.
  3. The table-like structure must be maintained - that is to say that IndicativePresent must be nested under VerbForms; the same applies to all other structs that are members of the VerbForms struct. These are conjugation tables after all!
  4. Perhaps the most important: I need the organization of the data to be set up in a way that, if for example a Word in the XML file does not have a GrammaticalForm of verb, that no VerbForms struct will actually be created for that entry. This is in an effort to improve efficiency - why instantiate VerbForms if the word is not actually a verb? This idea of avoiding unnecessary creation of these "forms" tables (which are currently represented as struct XXXXXForms) is absolutely imperative.

In accordance with (primarily) point #4 above, what kinds of data structures would be best used in modeling conjugation tables (not database tables)? Do I need to change the format of my data in order to be compliant with #4? If I instantiate a new Word, will the structs, in their current state, be instantiated as well and take up a lot of space? Here's some math... after Googling around and eventually finding this question...

In all the conjugation tables (nouns, adjectives, verbs), there are a total of (coincidence?) 100 strings allocated, and that are empty. So 100 x 18 bytes = 1800 bytes for each Word, at minimum, if these data structures are created and remain empty (there will always be at least some overhead for the values that would actually be filled in). So assuming (just randomly, could be more or less) 50,000 Words that would need to be in memory, that's 90 million bytes, or approximately 85.8307 megabytes.

That's a lot of overhead just to have empty tables. So what is a way that I can put this data together to allow me to instantiate only certain tables (noun, adjective, verb) depending on what GrammaticalForms the Word actually has (in the XML file).

I want these tables to be a member of the Word class, but only instantiate the tables that I need. I can't think of a way around it, and now that I did the math on the structs I know that it's not a good solution. My first thought is to make a class for each type of NounForms, AdjectiveForms, and VerbForms, and instantiate the class if the form appears in the XML file. I'm not sure if that is correct though...

Any suggestions?

like image 847
Chris Cirefice Avatar asked Feb 01 '14 22:02

Chris Cirefice


3 Answers

Suggestions:

  1. I'd switch to using class.
  2. I'd name the class in the singular and express cardinality through the property name.
  3. I'd remove the nesting from the class definitions.
  4. It looks like you can introduce a ConjugationForm abstract class, have multiple ConjugationForm subclasses (NounForm, AdjectiveForm, abstract VerbForm, then all the VerbForm subclasses - IndicativePresent, IndicativeSimplePast, etc.) The Word class then has a data structure for the ConjugationForm instances, maybe List<ConjugationForm> conjugationTable {get; set;}

As for the memory / GC pressure, have you actually measured how bad the situation is? I'd recommend coding up something and actually testing to verify whether you're going to experience GC problems, rather than trying to guess or make an assumption. The GC is pretty well optimized to allocate and de-allocate a lot of small objects, but long lived and/or large objects might cause you problems.

like image 135
Gordon Avatar answered Nov 05 '22 10:11

Gordon


I would change to using classes since everything in your data is a string and you have so much nesting. Having structs doesn't buy you anything for what you are trying to do.

Also, I would recommend creating base classes for each hierarchy and either just use a single class for each level (with some indication of what type it is) or use inheritance to create each class. Using a single base class allows you to reduce your code making it more maintainable and much easier to read.

Example. Instead of having all of those structs under your verbforms, you could have something like this:

enum VerbConjugationType
{
    IndicativePresent,
    IndicativeSimplePast,
    ...
}

class VerbConjugation
{
    public VerbConjugationType ConjugationType { get; set; }
    public string FirstPersonSingular { get; set; }
    public string SecondPersonSingular { get; set; }
    public string ThirdPersonSingular { get; set; }
    public string FirstPersonPlural { get; set; }
    public string SecondPersonPlural { get; set; }
    public string ThirdPersonPlural { get; set; }
}
like image 2
brcpar Avatar answered Nov 05 '22 10:11

brcpar


First, I'll clear up a misunderstanding that might exist:

struct Outer {
 struct Inner {
  int X;
 }
}

Outer o;

This does not allocate any storage because Inner is never used.

struct Outer {
 struct Inner {
  int X;
 }
 Inner i;
}

Outer o;

This allocates 4 bytes of storage (generally). The nesting of structs changes nothing. It is purely an organisational instrument.

In your data structures you have never created a field of most types which is why I cannot totally see what you intended. Maybe you just misunderstood how struct instances are created because you are coming from a C++ background.


I need the organization of the data to be set up in a way that, if for example a Word in the XML file does not have a GrammaticalForm of verb, that no VerbForms struct will actually be created for that entry.

For that you'd need VerbForms to be a class so that it always lives on the heap and the object reference to it can be null.

If I instantiate a new Word, will the structs, in their current state, be instantiated as well and take up a lot of space?

Yes, see above. Nested structs are inlined into the runtime object layout. All fields have constant offsets. Making fields nullable would not change that. It would add an additional boolean.

So what is a way that I can put this data together to allow me to instantiate only certain tables (noun, adjective, verb) depending on what GrammaticalForms the Word actually has (in the XML file).

Probably just put them on the heap, i.e. make them a class.

You seem to be worried about load performance, not about memory usage. In that sense all your optimizations do not really help. Reading and parsing XML is by far more expensive than creating objects from it. Also, you will have big GC costs.

The cost of a G2 collection is proportional to the number of references, objects and heap size (in approximation). You will produce tons of garbage and long-lived objects. You will have tons of G2 collections.

If you insist on more efficient storage, you could have a global array for each of the value types you are using. When you want to reference an instance you store its index in the global array. This reduces the number of heap-allocated objects and makes the pointer structure faster to traverse for the GC. http://marcgravell.blogspot.de/2011/10/assault-by-gc.html

So instead of

class VerbForms...
..
public VerbForms VerbForms { get; set; } //nullable reference to heap object

you have

struct VerbForms...
...
List<VerbForms> verbFormsGlobal = new List<VerbForms>(); //stores *all* VerbForms
...
public int? VerbFormsIndex { get; set; } //no reference, no GC cost
like image 1
usr Avatar answered Nov 05 '22 11:11

usr