Current Researches

Current Researches of Common Terminology

Current researches of Common Terminology (CT) is to improve metadata interoperability between metadata records of cooperating organizations:EuropeanaDigital Public Library of AmericaNational Library of KoreaMassachusetts Institute of Technology (MIT), and Harvard Library. It is to convert the provided records into the developed Common Terminology (CT), to build a linked open data with the CT, to provide the multilingual service, and to provide an international digital library portal for the public to be able to access their quality collections freely.

February of 2017.

January of 2017.

For 2016, developing CT SKOS crosswalk based on the analyzed usage of the provided records, and developing CT conversions to convert them into the developed Common Terminology (CT) in rdf/xml form.

  • EuropeanaEDMtoCTConversion is developed from September, 2016 to convert Europeana Data Model (EDM) records of Europeana into the developed Common Terminology (CT). Europeana offered ways to access their data at http://labs.europeana.eu/api. Thanks to their kind offer, the HarvestEDM program was developed, which harvests their metadata records via their OAI-PMH Service by sets. Thanks to GOD, in the beginning of December 2016, all sets of Europeana EDM records were harvested in rdf/xml form. The harvested EDM records were converted into the developed Common Terminology (CT) in rdf/xml form in the beginning of December 2016. The statistics of the conversion shows excellent performance of CT: converted rate as 99.99%; SKOS semantic exact match rate as 67.16%; narrow match rate as 22.98%; broad match rate as 9.8%; and non converted rate 0.00980%.
  • MITQDCtoCT Conversion is developed to convert the harvested MIT DSpace records on September of 2015 into the developed Common Terminology (CT). The statistics of the conversion shows excellent performance of CT: converted rate as 100%; SKOS semantic exact match rate as 89.5%; narrow match rate as 10.5%; broad match rate as 0%; and non converted rate 0%.
  • NLKMODStoCT Conversion is developed to convert the provided MODS records for the ancient rare resources of National Library of Korea on August of 2015 into the developed Common Terminology (CT). The statistics of the conversion shows excellent performance of CT: converted rate as 100%; SKOS semantic exact match rate as 85.7%; narrow match rate as 14.3%; broad match rate as 0%; and non converted rate 0%.
  • HarvardtoCT Conversion is developed to convert the provided library of cloud dataset of Harvard library into the developed Common Terminology (CT). The statistics of the conversion shows excellent performance of CT: converted rate as 100%; SKOS semantic exact match rate as 83.7%; narrow match rate as 16.3%; broad match rate as 0%; and non converted rate 0%.
  • CT Conversion is developed that converts Metadata Application Profile(MAP) of DPLA into the developed Common Terminology (CT). The statistics of the conversion shows very high performance of CT: converted rate as 96.5%; SKOS semantic exact match rate as 62.7%; narrow match rate as 35%; broad match rate as 2%; and non converted rate as low as 3.5%.
  • CT SKOS crosswalks are developed to enhance understanding and usage of Common Terminology using the found usages of element names/terms of the provided metadata records, and the crosswalk in csv format are designed on March,  2016.
  • The planned activities are started. Simple Knowledge Organization System (SKOS) crosswalk is being developed for MARC, MODS, QDC, and DC. Europeana 18 million Europeana Data Model records are harvested, and the usage of their element names in 31 sets of 1827 OAI list sets are analyzed.

For 2013-2015, Organizing International Open Public Digital Library (IOPDL), Inc., not for profit and tax exempt organization, and receiving the provided metadata records from cooperating organizations.

  • International Open Public Digital Library (IOPDL) Inc. is organized and operated since January of 2015. Common Terminology (CT) project is continued at IOPDL from January of 2015.
  • IOPDL was provided or harvested metadata records of cooperating organizations, and
  • The usages of terms/Element Names used in their records are analyzed on January of 2016 recently again.

For 2012-2014,

The updated CT version 1.1 Schemas

The updated CT version 1.1 SKOS concept 

Metadata Records

Metadata records are provided from cooperating organizations: EuropeanaDigital Public Library of AmericaNational Library of KoreaMassachusetts Institute of Technology (MIT)Harvard Library, and University of Illinois at Urbana-Champaign (UIUC) library. The provided original records are preserved, and saved separately into new files to fit the IOPDL prototype project. The IOPDL prototype is to provide a single portal of cooperating national and Well-Designed Digital Libraries (WDDLs). It is to makes the public be able to access freely their high quality collections.

Metadata Element Names Usage Analyses

To analyze usages of their element names, the usage Python program is developed for all used metadata and file formats:

  • DPLA MAP metadata format in json file format;
  • Europeana EDM format in rdf  file format;
  • National Library of Korea MODS format in xlsx file format;
  • Harvard Library Cloud in tsv file format;
  • MIT QDC format in xml file format.

The found used element names and their usages are as follows:

CT SKOS Crosswalks

Through the found usages of element names/terms of the provided metadata records, first, SKOS crosswalk are developed.

[gview file=”http://ct.iopdl.org/1.1/ctskos_Crosswalk7.pdf”]

  • CT SKOS crosswalk for version 1.1

[gview file=”http://ct.iopdl.org/1.1/ctskos_CrosswalkIII-MARC_MODS_QDC_DC_QDCMIT.rdf.pdf”]

CT Crosswalks

[gview file=”http://www.ct.iopdl.org/1.1/CT_crosswalk7_doc.pdf” save=”1″]

CT Conversions

CT conversions programs are being developed for the conversions that convert Metadata Application Profile(MAP) of DPLA, Library of cloud dataset of Harvard, QDC of MIT, MODS of National Library of Korea, and EDM of Europeana into the developed Common Terminology (CT). These conversions are based on CT version 1.2 that is slightly upgraded to adapt diverse metadata formats (MAP and EDM) and forms (json) of cooperating organizations.

Metadata Application Profile(MAP) of DPLA to CT conversion

Using  Metadata Application Profile(MAP) metadata records, the CT conversion program with Python programing language is developed that converts them into the developed Common Terminology (CT). At last, we had a final result of DPLA MAP to CT conversion and statistic results of the conversion:

The total Match rates of 8012390 records  in the folder, C:\Python27\metadata\DPLA
The number of total Statement= 228248135
Converted rate= 96.4896142525
exactMatch rate= 62.6515663931
narrowMatch rate= 35.3335726678
broadMatch rate= 2.01486093913
noConverted rate= 3.51038574751
Not converted Element Names are  {u’originalRecord’: 8012390}

The sample record of the original MAP
The Converted CT by DPLAMAPtoCTConversion
MAP to CT Conversion Program
Library Cloud of Harvard to CT Conversion

HarvardtoCTConversion program is to convert the provided library cloud dataset of Harvard library into the developed Common Terminology (CT).

HarvardtoCTConversion Match Rates

The total Match rates of 1525223 records  in the folder, C:\Python27\metadata\harvard_library_cloud_urlsII
The number of total Statement= 38847221
HarvardtoCTConversion Converted rate= 100.0
exactMatch rate= 83.703912823
narrowMatch rate= 16.2956856039
broadMatch rate= 0.000401573126685
noConverted rate= 0.0
Not converted Element Names are  {}

An Example of Original Harvard Record
Converted CT record in rdf/xml form by HarvardtoCTConversion

 

MITQDCtoCTConversion

MITQDCtoCTConversion is to convert the harvested MIT DSpace records on September of 2015 into the developed Common Terminology (CT).

MITQDCtoCTConversion Conversion Rates

The measured total Match rates of 358017 records  in the folder, C:\Python27\metadata\MIT are the below:
The number of total Statement= 7343157
Converted rate= 100.0%
exactMatch rate= 89.4742955925
narrowMatch rate= 10.5257044075
broadMatch rate= 0.0
noConverted rate= 0.0
Not converted Element Names are  {}

Findings
  • Few records have no url identifiers that makes access available online. Thus, MITQDCtoCTConversion program generates url identifier for these records with header:identifier starting with ‘oai:dspace.mit.edu’.
  • The deleted records stated in header are not converted.
An Example Original Record in xml form
Converted CT record in rdf/xml form by MITQDCtoCTConversion

 

NLKMODStoCTConversion

NLKMODStoCTConversion is a conversion program that converts MODS records of National Library of Korea to the developed Common Terminology (CT).

NLKMODStoCTConversion Conversion Rates

The total measured Match rates of 43762 records  in the folder, C:\Python27\metadata\NationalLibraryOfKorea are the below:
The number of total Statement= 893425
Converted rate= 100.0%
exactMatch rate= 85.7332176736%
narrowMatch rate= 14.2667823264%
broadMatch rate= 0.0%
noConverted rate= 0.0%
Not converted Element Names are  {}

Findings
  • W3 rdf validation warnings like the below by some specific characters that are not in Unicode Normal Form. But, because these are the warning not the fetal errors and we have no idea yet how to fix the warning, the warnings are not fixed .
    • Warning: {W131} String not in Unicode Normal Form C: “不分卷1冊; 21.3 x 14.5 cm”[Line = 43, Column = 49] in nlk1.rdf  (不 causes the error).
    • Warning: {W131} String not in Unicode Normal Form C: “그 女子의 戀人”[Line = 187, Column = 32]
An Example Original Record in xlsx form
Converted CT record in rdf/xml form by NLKMODStoCTConversion

 

EuropeanaEDMtoCTConversion

EuropeanaEDMtoCTConversion is a conversion program that converts Europeana Data Model (EDM) records of Europeana to the developed Common Terminology (CT).

Europeana offered ways to access their data at http://labs.europeana.eu/api. Thanks to their kind offer, the HarvestEDM program was developed, which harvests their metadata records through their OAI-PMH Service by sets.

EuropeanaEDMtoCTConversion Conversion Rates

The Average of Match rates was calculated with the measured values of the grouped sets. For example, the first set group is 1 to 30 sets. The measured values of 11271544 records in 1-30 sets are: Converted rate=  100.0;exactMatch rate=  69;narrowMatch rate=  22.18;broadMatch rate=  8.6;noConverted rate=  0.0%;Not converted Element Names are  {}.

The Average of total Match rates of 39937489 EDM records of Europeana are: 

The number of Statements=  1925054834.0

The Average of Converted rate=  99.9901938321

The Average of exactMatch rate=  67.1639836621

The Average of narrowMatch rate=  22.9867034728

The Average of broadMatch rate=  9.84931286504

The Average of closeMatch rate=  0.0

The Average of noConverted rate=  0.00980965763118

Not converted Element Names are  {‘ore:Proxy/dcterms:isRequiredBy’: 12, ‘ore:Proxy/edm:isDerivativeOf’: 5, ‘edm:EuropeanaAggregation/edm:hasView’: 1282, ‘ore:Aggregation/edm:ugc’: 48500, ‘ore:Proxy/edm:isRepresentationOf’: 2, ‘ore:Proxy/edm:incorporates’: 21, ‘ore:Proxy/edm:isSuccessorOf’: 2}

Difficulties

There were some difficulties in the conversion, EuropeanaEDMtoCTConversion. The main reason of the difficulties comes from the diversity of values that providers described.

  • The different language codes are used in some records, which ISO 639 series do not define and causes W3 rdf validation error such as {W116} RFC 3066 section 2.3 mandates the use of ‘en’ instead of ‘eng’. The used language codes that are not defined in ISO series are [‘als’,’ang’,’arz’,’ast’,’azb’,’bar’,’bcl’,’bjn’,’bpy’,’bxr’,’cas’,’cdo’,’ckb’,’diq’,’en-gb’,’en-us’,’eur’,’ext’,’frp’,’gag’,’gan’,’glk’,’gml’,’gom’,’gr’,’hak’,’hbs’,’hif’,’iten’,’japani’,’jp’,’jut’,’koi’,’ksh’, ‘lad’, ‘lbe’,’lij’,’lmi-2010′,’lmo’ ,’lrc’,’ltg’,’lzh’, ‘mhr’,’mo’,’mrj’,’mzn’,’nan’,’nap’,’nov’, ‘nrm’,’olo’,’osx’,’pcd’,’pdc’,’pfl’, ‘pih’,’pms’,’pnb’,’pnt’,’prg’, ‘ran’,’rgn’,’rmy’, ‘rue’,’sgs’,’sh’,’sk-SK’,’Spa’,’stq’,’szl’,’tcy’,’uri’,’vec’,’vep’,’vls’,’wuu’,’xmf’,’xxx’,’yue’,’zea’,     ‘zh-hant’,’zh-latn-pinyin-x-hanyu’,’zh-latn-pinyin-x-notone’,’zh-latn-wadegile’]
  • Some languages that are used in SKOS concept to provide the multilingual services causes W3 RDF validation warning such as
    “Warning: {W131} String not in Unicode Normal Form C: “(sl)pozidano območje, strnjeno naselje;(sk)zastavaná oblasť;(da)bebygget område;(eu)eremu eraiki; eraikitako eremu;(ro)zonă construită;(it)area edificata;(tr)yerleşim alanı;(mt)żona mibnija;(no)bebygd område;(hu)beépített terület;(lv)apbūvēta teritorija;(ar)منطقة مشيَّدة;(lt)apstatyta teritorija;(cs)území zastavěné;(de)Bebaute Fläche;(el)(πυκνο)δομημένη περιοχή/οικιστική περιοχή;built-up area;城市建成;(fi)rakennettu alue, asutusalue;(pl)teren zabudowany;(pt)povoações;(bg)Застроен район;(fr)agglomération;(sv)tätbebyggelse;(en)built-up area;(ru)застроенная территория;(et)täisehitatud ala;(es)zona edificada;(nl)bebouwde kom”[Line = 22, Column = 783]”
  • The used diverse prefixes such as ‘odrl’ and ‘cc’ in ‘drl:inheritFrom=”http://www.europeana.eu/rights/out-of-copyright-non-commercial/”‘, cc:deprecatedOn=”2027-11-10″.’
  • The rarely used terms that were omitted in the ct crosswalk such as ‘ore:Aggregation/edm:ugc.’
  • HTML tags that include ‘>’ . For example,

‘<edm:isShownAt rdf:resource=”http://galenet.galegroup.com/servlet/ECCO?c=1&amp;stp=Author&amp;ste=11&amp;&lt;>af=BN&amp;ae=T152600&amp;tiPG=1&amp;dd=0&amp;dc=flc&amp;docNum=CW119814160&amp;vrsn=1.0&lt; >&amp;srchtp=a&amp;d4=0.33&amp;n=10&amp;SU=0LRF”/>’

Especially, it causes significant semantic errors, because I use ‘>’ as a separator to find the used terms/element names and values in xml form. ‘>’ in the value causes losing original values, and specially it results broken links when ‘>’ was used in URLs. However, changing the logic of the program fixes the problem recovering the original values, but the broken link problem is remained, if the original value was already the broken link.

  • The broken links in the value and in the rdf:resource.
  • Few files in a set have no records such as <ListRecords></ListRecords>.
  • Few records have no metadata description with ‘null’ value. For example,

<record><header><identifier>http://data.europeana.eu/item/2048605/data_item_bbaw_dta_30400</identifier><datestamp>2015-07-18T07:24:17Z</datestamp><setSpec>2048605_Ag_EU_DM2E_bbaw_dta</setSpec></header><metadata>null</metadata></record>

  • Few records have no data provider and provider. In this case, the default provider is Europeana.
  • In some records, few descriptions have no values such as ‘<dc:rights xmlns:dc=”http://purl.org/dc/elements/1.1/”></dc:rights>’
An Example Original Record in RDF/XML form
Converted CT record in rdf/xml form by EuropeanaEDMtoCTConversion

Comments are closed.