Open@VT

Open Access, Open Data, and Open Educational Resources

Category Archives: Open Data

Open Data Week in Review

Last week Virginia Tech’s University Libraries hosted its inaugural Open Data Week with six programs on a variety of open data topics. The new format builds on last year’s Open Data Day, which incorporated a hackathon and roundtable discussions. However, the weekend scheduling and a conflict with spring break this year spurred us to create a new event friendlier to academic schedules, with programs throughout the week. Though we hadn’t heard of anyone having an Open Data Week before, we know that Virginia Tech is supposed to “Invent The Future,” so we did. Here’s a summary of the week’s programs.

Open Data Week logo

In our first program of the week, Data Anonymization: Lessons from an Millennium Challenge Corporation Impact Evaluation, Ralph P. Hall (Urban Affairs and Planning) and Eric Vance (Director, LISA- Laboratory for Interdisciplinary Statistical Analysis) described their evaluation of a rural water supply project in Mozambique, which involved household surveys (slides, MCC documentation).

Ralph P. Hall

Ralph P. Hall

The first lesson learned from their evaluation was that everything is linked to the informed consent. The primary takeaway here is the importance of distinguishing between anonymity and confidentiality (see slide 18), the latter of which provides researchers much more flexibility. In addition, there were difficulties with the translation of informed consent into Portuguese and local languages. Other lessons include not underestimating the time required to anonymize data, and designing surveying instruments to minimize anonymization challenges. Unfortunately, the anonymization challenges resulted in an analysis that is not reproducible and data that cannot be shared with a follow-up evaluation team. Data anonymization is a persistent and complex issue that needs to be discussed more frequently, and will certainly be on the agenda of future Open Data Weeks.

Our session on The Freedom of Information Act (FOIA) featured three speakers: Wat Hopkins (Dept. of Communication), Steve Capaldo (University Legal Counsel), and Siddhartha Roy (Flint Water Study team).

Wat Hopkins

Wat Hopkins

Wat Hopkins focused on FOIA in Virginia. FOIA first emerged at the federal level from a 1964 Supreme Court case, and subsequently Virginia was among the first to implement FOIA at the state level in the late 1960s. FOIA laws vary greatly from state to state. In Virginia, FOIA applies to records and meetings. Record requests must receive a response within 5 days and do not need to be in writing (though federal FOIA does require it), and there are around 130 exemptions. Requests must come from a Virginia citizen, or a news organization with circulation or broadcast in some part of the state. For more information, see Virginia’s Freedom of Information Advisory Council, and the Virginia Coalition for Open Government’s FOI Citizens Guide. Ultimately, we can’t be responsible citizens without access to government information.

Steve Capaldo said that since Virginia Tech is a state agency, it is governed by Virginia FOIA. However, the university responds to requests from everyone, not just residents or the media, and will do so within 5 days. There are many exemptions, including some involving research (proprietary or classified research, and grant proposals), personnel records, and records involving security, such as building plans. He emphasized the importance of making requests as specific as possible in order to reduce the time and effort required to respond. And although it’s not required, Capaldo suggested that it can be helpful when requestors explain the context of their request, because sometimes information needs can be met in alternative ways.

Sid Roy, a member of the Flint Water Study team and a graduate student in Civil and Environmental Engineering, described the Flint water crisis which has spanned 18 months and affected 100,000 people. In the process, an EPA employee was silenced and the fallout has included several resignations. The crisis response involved FOIA requests to the city of Flint, the Michigan Department of Environmental Quality, and the EPA. Interestingly, federal FOIA requires an acknowledgement of the request within 2 weeks, but there is no time limit for responding with the requested information. Roy relayed the FOIA advice of the project’s leader, Dr. Marc Edwards: first, be as specific as possible in your request, and second, make requests to a related agency that is not the primary target. For example, the team made FOIA requests to Flint in order to obtain communications and data from EPA. Although we ran out of time to discuss FOIA costs, according to the Flint Water Study GoFundMe page, their FOIA expenses came to $3,180 (while you are on that page, consider a donation!). In short, Roy recommended that FOIA should be in every scientist’s toolbox.

In Library Data Services: Supporting Data-Enabled Teaching and Research @ VT , Andi Ogier gave an overview of the three services offered: education (data management and fluency), curation (capturing context and ensuring reuse), and consulting (embedding informatics methods into research, and teaching about proprietary formats and the need for using open standards). Data Services strives to help researchers have their data achieve impact on the scholarly record, remain useful over time and across disciplines, and have it openly shared for the benefit of humanity. The library helps with data management plans required by funders, and can assign DOIs to datasets. The presentation coincided with the beta release of VTechData, a data repository to help Virginia Tech researchers provide access to and preserve their data.

Show Me the (Open) Data! with librarians Ginny Pannabecker and Andi Ogier was a conversational, exploratory session devoted to identifying open data sets. At the session, they introduced a new guide to finding data, which in addition to listing data sources also includes definitions and information on citing data.

Web Scraping session with Ben Schoenfeld

Web Scraping session with Ben Schoenfeld

Scraping Websites: How to Automate the Collection of Data from the Web was led by Ben Schoenfeld of Code for New River Valley, a Code for America brigade that meets biweekly to work on civic projects. As the slides explain, some programming skills are needed to effectively obtain and clean up data from websites lacking an API, and the basic steps are outlined. The live demonstration, using local restaurant health inspection data, did a good job of showing what is possible. One of our developers in the library, Keith Gilbertson, wrote a blog post about the session and how he applied the skills he learned to a database of state salaries.

Intro to APIs: What’s an API and How Can I Use One? was led by Neal Feierabend, also of Code for NRV (slides follow the scraping slides with slide 17). After an explanation of what APIs (application programming interfaces) are and what types are available, the live demo explored a few APIs, beginning with the Google Maps API. Use of this API is free up to a certain number of page loads, and usage beyond that requires a fee– a model used by many popular APIs. This is one reason Craigslist switched from Google Maps to OpenStreetMap, which as an open mapping tool enables download of the data. Generally, good APIs are those that are well documented. Both Neal and Ben attested to the value of using Stack Overflow and searching the web when encountering coding problems. After the session I found out there are also web services for data extraction like import.io.

Thanks to all of our presenters and attendees, and please let us know if you have suggestions for Open Data Week programs. We hope to do it again next year!

Grad Students: Travel to Brussels to Learn About Openness!

Graduate students at Virginia Tech are encouraged to apply for a travel scholarship to OpenCon 2015, the student and early career researcher conference on Open Access, Open Education, and Open Data to be held on November 14-16, 2015 in Brussels, Belgium.

OpenCon 2015

One scholarship will be awarded to a Virginia Tech graduate student, which will cover travel expenses, lodging, and some meals. Applicants must use the following URL to apply by Monday, September 21:

http://opencon2015.org/virginia_tech

To find out more about the conference, see the Participant FAQ and the conference program. This international conference offers an unparalleled opportunity to learn about the growing culture of openness in academia and how to become a participant in it. The travel scholarship is sponsored by the Graduate School and the University Libraries. For questions, please contact Philip Young, pyoung1@vt.edu (please note that the general application process for the conference closed earlier this summer, and related details in the participant FAQ will not apply).

Last year two graduate students received scholarships to the conference (which was in Washington, D.C.), and you can read about their experiences.

This year’s winner will be selected by the Graduate School and the University Libraries based on answers to the application questions, and announced on September 24. Please share this opportunity with all VT graduate students, and best of luck to the applicants!

Book Review: Issues in Open Research Data

Issues in Open Research Data

Moore, Samuel A. (ed.), Issues in Open Research Data (London: Ubiquity Press, 2014).

Bringing together contributed chapters on a wide variety of topics, Issues in Open Research Data is a highly informative volume of great current interest. It’s also an open access book, available to read or download online and released under a CC BY license. Three of the nine chapters have been previously published, but benefit from inclusion here. In the interest of full disclosure, I’m listed as a book supporter (through unglue.it) in the initial pages.

In his Editor’s Introduction, Samuel A. Moore introduces the Panton Principles for data sharing, inspired by the idea that “sharing data is simply better for science.” Moore believes each principle builds on the previous one:

  1. When publishing data, make an explicit and robust statement of your wishes.
  2. Use a recognized waiver or license that is appropriate for data.
  3. If you want your data to be effectively used and added to by others, it should be open as defined by the Open Knowledge/Data Definition— in particular, non-commercial and other restrictive clauses should not be used.
  4. Explicit dedication of data underlying published science into the public domain via PDDL or CC0 is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.

In “Open Content Mining” Peter Murray-Rust, Jennifer C. Molloy and Diane Cabell make a number of important points regarding text and data mining (TDM). Both publisher restrictions and law (recently liberalized in the UK) can block TDM. And publisher contracts with libraries, often made under non-disclosure agreements, can override copyright and database rights. This chapter also includes a useful table of the TDM restrictions of major journal publishers. (Those interested in exploring further may want to check out ContentMine.)

“Data sharing in a humanitarian organization: the experience of Médecins Sans Frontières” by Unni Karunakara covers the development of MSF’s data sharing policy, adopted in 2012 (its research repository was established in 2008). MSF’s overriding imperative was to ensure that patients were not harmed due to political or ethnic strife.

Sarah Callaghan makes a number of interesting points in her chapter “Open Data in the Earth and Climate Sciences.” Because much of earth science data is observational, it is not reproducible. “Climategate,” the exposure of researcher emails in 2009, has helped drive the field toward openness. However, there remain several barriers. The highly competitive research environment causes researchers to hoard data, though funder policies on open data are changing this. Where data has commercial value, non-disclosure agreements can come into play. Callaghan notes the paradox that putting restrictions on collaborative spaces makes sharing more likely (the Open Science Framework is a good example). She also shares a case in which an article based on open data was published three years before the researchers who produced the data published. It is becoming likely that funders will increasingly monitor data use and require acknowledgement of data sources if used in a publication. Data papers (short articles describing a dataset and the details of collection, processing, and software) may encourage open data. Researchers are more likely to deposit data if given credit through a data journal. However, data journals need to certify data hosts and provide guidance on how to peer review a dataset.

In “Open Minded Psychology” Wouter van den Bos, Mirjam A. Jenny, and Dirk U. Wulff share a discouraging statistic: 73% of corresponding authors failed to share data from published papers on request. A significant barrier is that providing data means substantial work. Usability can be enhanced by avoiding proprietary software and following standards for structuring data sets (an example of the latter is OpenfMRI). The authors discuss privacy issues as well, which in the case of fMRI includes a 3D image of the participant’s face. The value of open data is that data sets can be combined, used to address new questions, analyzed with novel statistical methods, or used as an independent replication data set. The authors conclude:

Open science is simply more efficient science; it will speed up discovery and our understanding of the world.

Ross Mounce’s chapter “Open Data and Palaeontology” is interesting for its examination of specific data portals such as the Paleobiology Database, focusing in particular on the licensing of each. He advocates open licenses such as the CC0 license, and argues against author choice in licensing, pointing out that it creates complexity and results in data sharing compatibility problems. And even though articles with data are cited more often, Mounce points out that traditionally indexing occurs only for the main paper, not supplementary files where data usually resides.

Probably the most thought-provoking yet least data-focused chapter is “The Need to Humanize Open Science” by Eric Kansa of Open Context, an open data publishing venue for archaeology and related fields. Starting with open data but mostly about the interaction of neoliberal policies and openness, the chapter deserves a more extensive analysis than I can give here, but those interested in the context against which openness struggles may want to read his blog post on the subject, in addition to this chapter.

Other chapters cover the role of open data in health care, drug discovery, and economics. Common themes include:

  • encouraging the adoption of open data practices and the need for incentives
  • the importance of licensing data as openly as possible
  • the challenges of anonymization of personal data
  • an emphasis on the usability of open data

As someone without a strong background in data (open or not), I learned a great deal from this book, and highly recommend it as an introduction to a range of open data issues.

Open Data Day/CodeAcross Event Recap

Blacksburg’s first celebration of Open Data Day and CodeAcross was organized by Code for NRV, our local Code for America brigade, and the University Libraries, which hosted the event in Newman Library’s Multipurpose Room. Originally scheduled for Saturday, February 21 (the official Open Data Day observed in hundreds of cities around the world), due to rapidly accumulating snow we had to postpone until Sunday. As it turned out, a water leak closed the library around mid-day Saturday, so things worked out for the best. (Our apologies to registrants for the sudden change in plans.)

Open Data Day logo

The first event of the morning was a mapping roundtable led by Peter Sforza, director of the Center for Geospatial Information Technology at Virginia Tech. In addition to looking at a lot of cool maps, we identified three potential areas for collaboration:

  • 3D Blacksburg – an effort to develop a common, shared 3D spatial reference model for Blacksburg and the New River Valley.
  • Contributing more authoritative data to OpenStreetMap for Blacksburg and Virginia by working with GeoGig.
  • Opening data that CGIT compiles for projects and research, for example crash data from the Virginia Department of Transportation.
Mapping Roundtable

Peter Sforza Leads the Mapping Roundtable

For the journalism roundtable, we were joined by Scott Chandler, Design/Production Adviser for the Educational Media Company at Virginia Tech, and Cameron Austin, former editor of the Collegiate Times. One problem the CT has is finding/keeping programmers to help with data, such as their academic salaries database. Code for NRV will try to help with recruitment. A database of textbook costs was identified as a possibility to work on that would be of particular interest to students.

Blacksburg town council member Michael Sutphin joined us for the public policy roundtable, which included interesting discussions of town planning notifications and ways to encourage citizen engagement (such as the underutilized site Speak Up Blacksburg). Some of the project ideas included:

  • Visualizations of the town’s historical budget data that could benefit the public and town officials.
  • Opening the raw data used to create tables and maps in the town’s comprehensive plans.
  • Analysis of emails to and from local government officials to create visualizations of the most commented on topics in the town, e.g. word clouds and tag lists.

Our hackathon emerged from the morning’s mapping roundtable, so perhaps it’s not surprising that the projects were geographic in nature:

  • One volunteer used the Virginia Restaurant Health Inspection API created by Code for Hampton Roads to create a map of Blacksburg restaurants and their health scores.
  • An architecture student started a project that will use open 3D geospatial data from Virginia Tech to design pathways that are sculpted for the landscape.
  • Researchers from the Virginia Bioinformatics Institute adapted a model used in Ebola research to optimize placement of EMS staging areas during flood emergencies in Hampton Roads, Virginia. The model uses open data sets like the location and elevation of every roadway in Virginia to determine which streets would still be navigable during a flood.
Waldo Jaquith

Waldo Jaquith

To kick off our events Friday evening, we were very happy to have Waldo Jaquith speaking on “Open Government Data in Virginia” prefaced by a brief introduction to Open Data Day/CodeAcross by Ben Schoenfeld, co-leader of the Code for NRV brigade. Waldo Jaquith is the director of the U.S. Open Data Institute, an organization building the capacity of open data and supporting government in that mission. See the video of his talk below.

Thanks to everyone who turned out Friday and/or Sunday!

Thanks to the University Libraries’ Event Capture Service for the video below.

Learn About Open Data at Open Data Day/CodeAcross!

Join us for Blacksburg’s first observance of Open Data Day/CodeAcross, organized by Virginia Tech’s University Libraries and Code for NRV, our local Code for America brigade, this Friday and Saturday, February 20-21, 2015. We will be one of more than 100 Open Data Day and CodeAcross events taking place around the world on February 21. We welcome area residents and local government officials as well as faculty, staff, and students at Virginia Tech to find out how open data can improve our community (coding not required!). Registration is requested to help us with logistics, and for VT faculty, NLI credit is available (look for the sign-in sheet as well).

Waldo Jaquith

Waldo Jaquith

Friday, February 20, 2015
5:30pm to 7:00pm
Newman Library Multipurpose Room (first floor)

To kick off our events, we are very pleased to have Waldo Jaquith speaking on “Open Government Data in Virginia” which will be followed by a brief introduction to Open Data Day/CodeAcross. Waldo Jaquith is the director of the U.S. Open Data Institute, an organization building the capacity of open data and supporting government in that mission. In 2011, in acknowledgement of his open data work, Jaquith was named a “Champion of Change” by the White House and, in 2012, an “OpenGov Champion” by the Sunlight Foundation. He went on to work in open data with the White House Office of Science and Technology Policy. Jaquith, a 2005 Virginia Tech graduate, lives near Charlottesville, Virginia with his wife and son.

Open Data Day logo

Saturday February 21, 2015
9:30am to 5:00pm (lunch provided)
Newman Library Multipurpose Room (first floor)
Registration requested!

Open Data Day/CodeAcross will offer three tracks for coders and non-coders alike. First, there will be a sequence of one-hour discussion roundtables led by experts on the relationship of open data with mapping (10am), journalism (11am), public policy (1pm), health (2pm), and research (3pm). Second, there will be a mapping project emerging from the mapping roundtable and lasting the rest of the day. Third, for the coders there will be a hackathon using open government data in Virginia. Around 4pm, we will gather together, talk about our projects and what we learned, and plan for the continuation of projects. Attendees may move between these three strands as they like- or just come for one roundtable. Lunch is provided! While all events are free and open to the public, please register online to help us plan for the roundtables, lunch, and wireless access for those without a Virginia Tech affiliation. If you have questions, please contact me, Philip Young at pyoung1@vt.edu or 540-231–8845. Hope to see you there! #OpenDataDay #CodeAcross

CodeAcross logo

Open@VT on Mastodon

Loading Mastodon feed...