Yesterday harvard released open access to all its library metadata (some 12 million records)
I was looking to parse the data and play with it as the goal of the release was to "support innovation"
Download the 12GB tarball, unpacked it to find 13 .mrc files about 800MB each
MARC21 format
When I looked at the head and tail of the first few files, it looks to be very unstructured, even after reading a bit about MARC21.
Here's what the first 4k of the first file look like:
$ head -c 4000 ab.bib.00.20120331.full.mrc
00857nam a2200253 a 4500001001200000005001700012008004100029010001700070020001200087035001600099040001800115043001200133050002500145245011100170260004900281300002100330504004100351610006400392650005300456650003500509700003800544988001300582906000800595002000001-420020606093309.7880822s1985 unr b 000 0 ruso a 86231326 c0.45rub0 aocm18463285 aDLCcDLCdHLS ae-ur-un0 aJN6639.A8bK665 198500aInformat︠s︡ii︠a︡ v rabote partiĭnykh komitetov /c[sostavitelʹ Stepan Ivanovich I︠A︡lovega]. aKiev :bIzd-vo polit. lit-ry Ukrainy,c1985. a206 p. ;c20 cm. aIncludes bibliographical references.20aKomunistychna partii︠a︡ UkraïnyxInformation services. 0aParty committeeszUkrainexInformation services. 0aInformation serviceszUkraine.1 aI︠A︡lovega, Stepan Ivanovich. a20020608 0DLC00418nam a22001335u 4500001001200000005001700012008004100029110003000070245004600100260006000146500005800206988001300264906000700277002000002-220020606093309.7900925|1944 mx |||||||spa|d1 aCampeche (Mexico : State)10aLey del notariado del estado de Campeche.0 a[Campeche]bDepartamento de prensa y publicidad,c1944. aAt head of title: Gobierno constitucional del estado. a20020608 0MH00647nam a2200229M 4500001001200000005001700012008004100029010001700070035001600087040001300103041001100116050003600127100004200163245004100205246005600246260001600302300001900318500001500337650004400352988001300396906000800409002000003-020051201172535.0890331s1902 xx d 000 0 ota a 73960310 0 aocm23499219 aDLCcEYM0 aotaara0 aPJ6636.T8bU5 1973 (Orien Arab)1 aUnsī, Muḥammad ʻAlī ibn Ḥasan.10aQāmūs al-lughah al-ʻUthmānīyah.3 aDarārī al-lāmiʻāt fī muntakhabāt al-lughāt. c[1902 1973] a564 p.c22 cm. aRomanized. 0aTurkish languagevDictionariesxArabic. a20020608 0DLC00878nam a2200253 a 4500001001200000005001700012008004100029010001700070035001600087040001800103043001200121050002300133245012800156246004600284260006300330300004800393500003300441610003200474650005000506700002400556710002300580988001300603906000800616002000004-920020606093309.7880404s1980 yu fa 000 0 scco a 82167322 0 aocm17880048 aDLCcDLCdHLS ae-yu---0 aL53.P783bT75 198000aTrideset pet godina Prosvetnog pregleda, 1945-1980 /c[glavni i odgovorni urednik i urednik publikacije Ružica Petrović].3 a35 godina Prosvetnog pregleda, 1945-1980. aBeograd :bNovinska organizacija Prosvetni pregled,c1980. a146 p., [21] p. of plates :bill. ;c29 cm. aIn Serbo-Croatian (Cyrillic)20aProsvetni pregledxHistory. 0aEducationzYugoslaviaxHistoryy20th century.1 aPetrović, Ružica.2 aProsvetni pregled. a20020608 0DLC00449nam a22001455u 4500001001200000005001700012008004100029245008200070260002800152300001100180440006600191700002600257988001300283906000700296002000005-720020606093309.7900925|1981 pl |||||||pol|d10aZ zagadnień dialektyki i świadomości społecznej /cpod red. K. Ślęczka.0 aKatowice :bUŚ,c1981. a135 p. 0aPrace naukowe Uniwersytetu Śląskiego w Katowicach ;vnr 4621 aŚlęczka, Kazimierz. a20020608 0MH00331nam a22001455u 4500001001200000005001700012008004100029100002200070245002200092250001200114260002800126300001100154988001300165906000700178002000006-520020606093309.7900925|1980 pl |||||||pol|d1 aMencwel, Andrzej.10aWidziane z dołu. aWyd. 1.0 aWarszawa :bPIW,c1980. a166 p. a20020608 0MH00746cam a2200241 a 4500001001200000005001700012008004100029010001700070020001500087035001600102040001800118050002400136082001600160100001600176245008000192260007100272300002500343504004100368650003400409650004000443988001300483906000800496002000007-300000000000000.0900123s1990 enk b 001 0 eng a 90031350 a03910368230 aocm21081069 aDLCcDLCdHBS00aHF5439.8b.O35 199
Has anyone ever had to work with a MARC21 before? Does it typically look like this or do I need to parse it differently.
pymarc is the best option to parse MARC21 records using Python (full disclosure: I'm one of its maintainers). If you're unfamiliar with working with MARC21, it's worth reading through some of the specification you linked to on the Library of Congress website. I'd also read through the Working with MARC page on the Code4lib wiki.
You may want to check this out - pymarc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With