Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can you parse a document stored in the MARC21 format with Python

Yesterday harvard released open access to all its library metadata (some 12 million records)

I was looking to parse the data and play with it as the goal of the release was to "support innovation"

Download the 12GB tarball, unpacked it to find 13 .mrc files about 800MB each

MARC21 format

When I looked at the head and tail of the first few files, it looks to be very unstructured, even after reading a bit about MARC21.

Here's what the first 4k of the first file look like:

$ head -c 4000 ab.bib.00.20120331.full.mrc

00857nam a2200253 a 4500001001200000005001700012008004100029010001700070020001200087035001600099040001800115043001200133050002500145245011100170260004900281300002100330504004100351610006400392650005300456650003500509700003800544988001300582906000800595002000001-420020606093309.7880822s1985    unr      b    000 0 ruso   a   86231326   c0.45rub0 aocm18463285  aDLCcDLCdHLS  ae-ur-un0 aJN6639.A8bK665 198500aInformat︠s︡ii︠a︡ v rabote partiĭnykh komitetov /c[sostavitelʹ Stepan Ivanovich I︠A︡lovega].  aKiev :bIzd-vo polit. lit-ry Ukrainy,c1985.  a206 p. ;c20 cm.  aIncludes bibliographical references.20aKomunistychna partii︠a︡ UkraïnyxInformation services. 0aParty committeeszUkrainexInformation services. 0aInformation serviceszUkraine.1 aI︠A︡lovega, Stepan Ivanovich.  a20020608  0DLC00418nam a22001335u 4500001001200000005001700012008004100029110003000070245004600100260006000146500005800206988001300264906000700277002000002-220020606093309.7900925|1944    mx           |||||||spa|d1 aCampeche (Mexico : State)10aLey del notariado del estado de Campeche.0 a[Campeche]bDepartamento de prensa y publicidad,c1944.  aAt head of title: Gobierno constitucional del estado.  a20020608  0MH00647nam a2200229M  4500001001200000005001700012008004100029010001700070035001600087040001300103041001100116050003600127100004200163245004100205246005600246260001600302300001900318500001500337650004400352988001300396906000800409002000003-020051201172535.0890331s1902    xx       d    000 0 ota    a   73960310 0 aocm23499219  aDLCcEYM0 aotaara0 aPJ6636.T8bU5 1973 (Orien Arab)1 aUnsī, Muḥammad ʻAlī ibn Ḥasan.10aQāmūs al-lughah al-ʻUthmānīyah.3 aDarārī al-lāmiʻāt fī muntakhabāt al-lughāt.  c[1902 1973]  a564 p.c22 cm.  aRomanized. 0aTurkish languagevDictionariesxArabic.  a20020608  0DLC00878nam a2200253 a 4500001001200000005001700012008004100029010001700070035001600087040001800103043001200121050002300133245012800156246004600284260006300330300004800393500003300441610003200474650005000506700002400556710002300580988001300603906000800616002000004-920020606093309.7880404s1980    yu fa         000 0 scco   a   82167322 0 aocm17880048  aDLCcDLCdHLS  ae-yu---0 aL53.P783bT75 198000aTrideset pet godina Prosvetnog pregleda, 1945-1980 /c[glavni i odgovorni urednik i urednik publikacije Ružica Petrović].3 a35 godina Prosvetnog pregleda, 1945-1980.  aBeograd :bNovinska organizacija Prosvetni pregled,c1980.  a146 p., [21] p. of plates :bill. ;c29 cm.  aIn Serbo-Croatian (Cyrillic)20aProsvetni pregledxHistory. 0aEducationzYugoslaviaxHistoryy20th century.1 aPetrović, Ružica.2 aProsvetni pregled.  a20020608  0DLC00449nam a22001455u 4500001001200000005001700012008004100029245008200070260002800152300001100180440006600191700002600257988001300283906000700296002000005-720020606093309.7900925|1981    pl           |||||||pol|d10aZ zagadnień dialektyki i świadomości społecznej /cpod red. K. Ślęczka.0 aKatowice :bUŚ,c1981.  a135 p. 0aPrace naukowe Uniwersytetu Śląskiego w Katowicach ;vnr 4621 aŚlęczka, Kazimierz.  a20020608  0MH00331nam a22001455u 4500001001200000005001700012008004100029100002200070245002200092250001200114260002800126300001100154988001300165906000700178002000006-520020606093309.7900925|1980    pl           |||||||pol|d1 aMencwel, Andrzej.10aWidziane z dołu.  aWyd. 1.0 aWarszawa :bPIW,c1980.  a166 p.  a20020608  0MH00746cam a2200241 a 4500001001200000005001700012008004100029010001700070020001500087035001600102040001800118050002400136082001600160100001600176245008000192260007100272300002500343504004100368650003400409650004000443988001300483906000800496002000007-300000000000000.0900123s1990    enk      b    001 0 eng    a   90031350   a03910368230 aocm21081069  aDLCcDLCdHBS00aHF5439.8b.O35 199

Has anyone ever had to work with a MARC21 before? Does it typically look like this or do I need to parse it differently.

like image 516
MattoTodd Avatar asked Apr 26 '12 01:04

MattoTodd


2 Answers

pymarc is the best option to parse MARC21 records using Python (full disclosure: I'm one of its maintainers). If you're unfamiliar with working with MARC21, it's worth reading through some of the specification you linked to on the Library of Congress website. I'd also read through the Working with MARC page on the Code4lib wiki.

like image 164
anarchivist Avatar answered Nov 14 '22 23:11

anarchivist


You may want to check this out - pymarc

like image 20
dpp Avatar answered Nov 14 '22 23:11

dpp