Matthew Milner: September 12, 2017 at 13:06
Over the last year I've been slowly working through the revised metadata I created for Voyant DREaM using Phases I and the pre-release version of Phase II of EEBO-TCP's TEI headers. While EEBO employs TEI, and seems to have created an authority list of authors and some agents, the work for DREaM also focused on agents and places within the unparsed
Nanohistory requires much more consistent authorial or canonical names than those currently used in EEBO-TCP. It also permits the tracking of various kinds of dates, some of which are available in VIAF XML, some which appear within names themselves in EEBO-TCP. These need not correspond of course. Furthermore, Nanohistory also permits the documentation of orthographic variants for particular languages. And it has a different approach to thinking about titles. And so, while EEBO-TCP provides considerable amounts of data for tracking authors and agents, it does so very messily. The EEBO-TCP metadata requires considerable cleaning in order to integrate into Nanohistory as usable historical data, even aside from the agents contained within the
The DREaM metadata revision resulted in 13031 possible individuals for Phase I and pre-release Phase II of EEBO-TCP. Many are orthographic variants, while others are duplicates with different dates arising from VIAF, EEBO-TCP or OCLC metadata. Of these 2330 are long names which include those with possible titles, but also pseudonyms or attributions. There are also 732 possible organizations, though there are many individuals contained in this set because VIAF does not have a clean data when it comes to identifying an individual vs a collective agent. In many cases these individuals are publishers, and so appear as a 'press', rather than as a person. Of these 13031, 6534 are distinct individuals in VIAF, and have been added fairly easily. A further 4315 aren't easily identified in VIAF, but upon further checking, we obtain viable candidates, and they can be added quickly as well. This leaves some 2182 or so remaining, which require thorough checking and editing in order to be added - including the agents with titles, alternative spellings, and organizations.
Once this work is complete, and we've bound EEBO-TCP agents to VIAF identifiers, we'll pull in the VIAF XML to create a list of othernames for each agent in Nanohistory. These will act as a gazetteer for recognizing existing Nanohistory data when we generate RIS formatted data for EEBO-TCP texts to dump enmasse into Nanohistory using the Thing importing tool. Fingers crossed, the result will nicely formatted data for all 44418 EEBO-TCP Phase I and pre-release Phase II headers that are currently in DREaM. The new data will offer us a further revision for DREaM metadata: cleaned and clear authority names for EEBO-TCP agents and places, and those appearing in the