Matthew Milner: September 12, 2017 at 13:06

Over the last year I've been slowly working through the revised metadata I created for Voyant DREaM using Phases I and the pre-release version of Phase II of EEBO-TCP's TEI headers. While EEBO employs TEI, and seems to have created an authority list of authors and some agents, the work for DREaM also focused on agents and places within the unparsed . These agents aren't recognized, nor do they appear in any sense in an authority or canonical form. The DREaM revised metadata mashed up OCLC and VIAF metadata with EEBO-TCP's in an attempt to find identifiers, and actual authority versions of these agents. But with Nanohistory the problem becomes a bit more complex. Nanohistory doesn't treat titles, or dates, as primary data for agents, while EEBO-TCP doesn't disambiguate or resolve authority identities between languages, and at its worst, it doesn't have authority names for organizations. This means, in practice, that the Church of England can have sub-organizations in the name, such as the Diocese of London, or a bishop or other officer, like Edmund Bonner, for instance. Or it can appear as Eglise de l'Angleterre. This becomes even messier when considering authors with titles - kings or lords often have their main title, but sometimes their family names, sometimes not; sometimes secondary titles appear as well. And then there are the dates.

Nanohistory requires much more consistent authorial or canonical names than those currently used in EEBO-TCP. It also permits the tracking of various kinds of dates, some of which are available in VIAF XML, some which appear within names themselves in EEBO-TCP. These need not correspond of course. Furthermore, Nanohistory also permits the documentation of orthographic variants for particular languages. And it has a different approach to thinking about titles. And so, while EEBO-TCP provides considerable amounts of data for tracking authors and agents, it does so very messily. The EEBO-TCP metadata requires considerable cleaning in order to integrate into Nanohistory as usable historical data, even aside from the agents contained within the .

The DREaM metadata revision resulted in 13031 possible individuals for Phase I and pre-release Phase II of EEBO-TCP. Many are orthographic variants, while others are duplicates with different dates arising from VIAF, EEBO-TCP or OCLC metadata. Of these 2330 are long names which include those with possible titles, but also pseudonyms or attributions. There are also 732 possible organizations, though there are many individuals contained in this set because VIAF does not have a clean data when it comes to identifying an individual vs a collective agent. In many cases these individuals are publishers, and so appear as a 'press', rather than as a person. Of these 13031, 6534 are distinct individuals in VIAF, and have been added fairly easily. A further 4315 aren't easily identified in VIAF, but upon further checking, we obtain viable candidates, and they can be added quickly as well. This leaves some 2182 or so remaining, which require thorough checking and editing in order to be added - including the agents with titles, alternative spellings, and organizations.

Once this work is complete, and we've bound EEBO-TCP agents to VIAF identifiers, we'll pull in the VIAF XML to create a list of othernames for each agent in Nanohistory. These will act as a gazetteer for recognizing existing Nanohistory data when we generate RIS formatted data for EEBO-TCP texts to dump enmasse into Nanohistory using the Thing importing tool. Fingers crossed, the result will nicely formatted data for all 44418 EEBO-TCP Phase I and pre-release Phase II headers that are currently in DREaM. The new data will offer us a further revision for DREaM metadata: cleaned and clear authority names for EEBO-TCP agents and places, and those appearing in the . And we'll publish this 3rd, revised set, as new metadata headers for EEBO-TCP itself.