Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse Gedcom to SQLite-Database

I am a Hobby Xojo-User. I wanna import a Gedcom-File to my Program, espacially to a SQLite-Database.

Structure of the Database

Tables

Persons

 - ID: Integer
 - Gender: Varchar // M, F or U
 - Surname: Varchar
 - Givenname: Varchar

Relationships

 - ID: Integer
 - Husband: Integer
 - Wife: Integer

Children

 - ID: Integer
 - PersonID: Integer
 - FamilyID: Integer
 - Order: Integer

PersonEvents

 - ID: Integer
 - PersonID: Integer
 - EventType: Varchar // e.g. BIRT, DEAT, BURI, CHR
 - Date: Varchar
 - Description: Varchar
 - Order: Integer

RelationshipEvents

 - ID: Integer
 - RelationshipID: Integer
 - EventType: Varchar // e.g. MARR, DIV, DIVF
 - Date: Varchar
 - Description: Integer
 - Order: Integer

I wrote a working Gedcom-Line-Parser. He splits a single Gedcomline into:

 - Level As Integer
 - Reference As String // optional
 - Tag As String
 - Value As String // optional

I load the Gedcom-File via TextInputStream (working fine). No i need to parse every Line.

Gedcom-Individual-Sample

0 @I1@ INDI
1 NAME George /Clooney/
2 GIVN George
2 SURN Clooney
1 BIRT
2 DATE 6 MAY 1961
2 PLAC Lexington, Fayette County, Kentucky, USA

You'll see, the Level-Numbers shows us a "Tree-Structure". So i thought it would be the best and simplest way to parse the File into separated Objects (PersonObj, RelationshipObj, EventObj etc.) into a JSONItem, because there its easy to get the Childs of a Node. Later on, i can simple read the Nodes, Child-Nodes to create the Database-Entries. But i don't know how to create such an Algorithm.

Can anyone help my please?

like image 753
Genealogy Avatar asked Oct 19 '22 03:10

Genealogy


1 Answers

To parse the Gedcom lines with a good speed, try these ideas:

Read the entire file into a String and split the lines up:

dim f as FolderItem = ...
dim fileContent as String = TextInputStream.Open(f).ReadAll
fileContent = fileContent.DefineEncoding (Encodings.WindowsLatin1)
dim lines() as String = ReplaceLineEndings(fileContent,EndOfLine).Split(EndOfLine)

Parse every line using RegEx to extract its 3 columns

dim re as new RegEx
re.SearchPattern = "^(\d+) ([^ ]+)(.*)$"
for each line as String in lines
  dim rm as RegExMatch = re.Search (line)
  if rm = nil then
    // nothing found in this line. Is this correct?
    break
    continue // -> onward with next line
  end
  dim level as Integer = rm.SubExpressionString(1).Val
  dim code as String = rm.SubExpressionString(2)
  dim value as String = rm.SubExpressionString(3).Trim
  ... process the level, code and value
next

The RegEx search pattern means that it looks for the start of the line ("^"), then for one or more digits ("\d"), a blank, one or more non-blank chars ("[^ ]"), and finally any more chars (".") before the end of the string ("$"). The parentheses around each of these groups is for extracting their results with SubExpression() then.

The check for rm = nil hits whenever the line does not contain at least a number, a blank and at least one more character. If the Gedcom file is malformed or has blank lines, this may be the case.

Hope this helps.

like image 53
Thomas Tempelmann Avatar answered Oct 22 '22 00:10

Thomas Tempelmann