Authors: Philippe Gray and Anne Lawrence
Inspired by the Association of Southeastern Regional Libraries webinar, “Adding Patent Records to Clemson’s IR — Highlighting the University’s Output,” VTechWorks, Virginia Tech’s institutional repository, now offers a similar collection, Virginia Tech Patents. The collection contains 645 U.S. Patents assigned to Virginia Tech at the time of patent application. The dates of issuance span 1919-2016. The collection’s display is customized with fields, search filters, and facets particular to patents, such as patent type, inventor, assignee, patent and application numbers, and patent classifications. Our motivation for creating the collection was that a sizeable collection of useful public domain content could be harvested programmatically and that it provides an opportunity to spotlight how Virginia Tech “invents the future.”
To enable other repositories to develop a similar collection, we offer our software, Patent-Harvest, in a GitHub repository. Patent-Harvest contains a Java program written to harvest all patents with Virginia Tech as the assignee. It can be adapted to harvest patents and associated files for other organizations or search parameters.
The harvesting program uses the PatentsView API to retrieve relevant metadata for all Virginia Tech patents and outputs a CSV spreadsheet. If desired, all the corresponding files for each patent are also downloaded and logically renamed. Since most United States patent documents are image-only PDFs, a script is included that uses optical character recognition to read text content and embed it in the patent documents. This makes the text of the patent documents searchable, but doesn’t change how they appear to the reader.