■James Evans Bev Acreman
1) Open Repository,E-mail:info@openrepository.com
2) BioMed Central,236 Gray's Inn Road,London,WC1X 8HB,United Kingdom,E-mail:Bev.Acreman@biomedcentral.com
Mandated storage,curation and management of research data is likely to significantly challenge and change established institutional repository processes in coming years.At the major repository conferences,management of research data and metadata has been the hot topic of recent years.The biggest institutions w ith large research communities and well-resourced digital libraries,are already actively working on integrating ‘data’ into their repository ecosystems.But institutional repository systems w ithin institutions of all sizesw ill face significant challenges in the way they collect,curate and store different types of datasets,as well as competition from a range of external sources to serve as the record of authority.
Governments are increasingly looking at open data as a societal good[1].Scientists are looking at the level of reproducibility of research(and the related increasing number of retractions[2]due to poor training aswell as fraud)and authors are increasingly looking to add tools to assist w ith visualising their data alongside themanuscript“w rapper”.
In 2012,the Office of Science&Technology Policy(OSTP) in the USA called on federal agencies w ith a budget ofmore than$100 millions to develop a“strategy for improving the public's ability to locate and access digital data resulting from federally funded research”.[3]
Unsurprisingly given her trenchant support for all things“open”,Neelie Kroes,EU Commissioner for the Digital Agenda stated in March 2012:
“Let me underline one initiative that I am supporting to make digital technology work for governance and transparency:by opening up public data.In the digital age,data takes on a whole new value,and with new technologywe can do great things with it.Opening it up is not just good for transparency,italso stimulates greatweb content,and provides the fuel for a future economy…That's why I say that data is the new oil for the digital age.”[4]
And the EU Commission,in 2013,included an Open Research Data Pilot in its Horizon 2020 plans.[5]
Clearly,technological changes to systems were necessary to underpin these requirements for data deposits.
Current and emerging online software tools w ill challenge the way research-publishing-preservation is done in the future.Using these tools,creation and storage of datasets occurs at a far earlier point than traditional institutional repository workflows,yet the repository infrastructure seems suited to data capture and classification.Institutions must find a way to enable repositories to become usefully involved w ith dataset ingestion processes at an earlier stage,for example,w ithin a laboratory context.With amixture of commercial and open source tools such as online cloud storage,collaborative w riting,versioning,realtime processing and visualization,a dataset may be easily stored and catalogued elsewhere.The other issue facing repositories is preservation at scale,as a survey presented at the 2014 Confederation of Open Access Repositories(COAR) annual conference reported:
“A 2011 survey of 1700 researchers across disciplines undertaken by the journal,Science,found that 48.3%of respondents were working with datasets thatwere less than 1GB in size and over half of those polled store their data only in their laboratories”[6]
But the long-chain of dataset types-individually and collectively-can also potentially scale and become terabyte-sized over time,at institutional level.
Further complicating the institutional role in data preservation,dedicated data repository platforms have emerged,both free and commercial,such as Dryad,FigShare,CERN's Zenodo platform,EUDAT,CKAN and a range of other subject-specific data repositories.Their aim is to give researcher-authors a place to store,visualise and make available data associated w ith a publication,along w ith a dedicated DOI and other useful identifiers.They also allow disregarded data to be stored or embargoed where required.Deposit workflow is clearer and greatly simplified compared to many traditional repositories,utilising commercial cloud API services such as Dropbox.New data repository services are filling gaps between laboratory,publication workflows and institutional repository stages.The industry noise around these platforms,and their rapid development iteration,w ill impact on traditional institutional repository platforms such as DSpace,EPrints,Invenio and Fedora Commons.These open source platforms are driven by contributed code and varying levels of developer engagement.Some versions of the new data repository platforms are also moving into offering services for institutions,where their purpose and functionality w ill overlap significantly w ith traditional institutional repository platforms.Dryad,Fedora and Invenio are leading the way on driving institutional repository relevance for dataset storage and visualisation.
The institutional repository world is reacting to this changing landscape in varying ways.Organisations such the Confederation of Open Access Repositories(COAR)[7],and JISC[8]in the UK,have set up working groups to exam ine the best way to handle and build support networks of repositories.They also consider polices and standards around institutional level dataset preservation.Greatemphasis has been made on linking w ith and harvesting from existing data repositories.Beyond OAI-compliance,definitive interoperability standards are yet to emerge for linked-data.However,repositories can already assist w ith the citation of dataset contributors by implementing ORCID,and enriching datasetmetadata w ith funder information,such as the OpenAIRE initiative in the EU.
A challenge to current dataset preservation is a multitude of metadata standards and approaches to their representation.W ithin individual institutions,theremay be no single policy towards representation of descriptive metadata.For library data-curation tasks,this represents a considerable challenge.In this regard,COAR has been consulting on ways to improve interoperability between repositories,and w ith external systems.More complex datasets often require granular metadata description,to enable discovery and usability by humans or machines,and to align usefully w ith w ider web information architecture.It is also unclear if institutional repositories should include tools to easily visualise dataset from w ithin the repository platform,or if those tools should existoutside the repository domain.
To illustrate what can happen when a university is able to coordinate data curation w ithin an institutional framework,at the 2014 Open Repositories Conference in Helsinki,David Groenewegen and colleagues from Monash University in Australia gave a presentation entitled‘Breaking down the boundaries to storing,sharing and publishing research data’[9].This describes how research data is created,captured and moved between different repository user-cases as part of Monash University's‘Data Curation Continuum’,a process that also integrates w ith a w ider national storagepreservation system.It is an example of what is possible when a university unifies its approach to research data capture and preservation throughout the research lifecycle,serving as a model that can be studied and emulated by other institutions and national preservation structures.
Deposit of research data into an institutional or digital repository serves a set of aims broadly aligned to those of previously more traditional document or multimedia-based submissions. Institutional repositories offer open access to stored content,granular-levelmetadata classification,a range of tools to enable content discovery and dissemination via search engines or aggregators.The institutional repository(IR) therefore exists as a solid basis for emerging research data preservation mandates w ithin an institutional curation framework.
Successful institutional or digital repositories,as measured by levels of internal engagement and external usage,require a number of stakeholders to be successful over a long period of time.For long-term research data preservation,and,useful,discoverable metadata classification,the w ider investment made by an institution into running the repository and fitting it into a broader research-preservation ecosystem,is paramount to institutional repository success and relevance.
Conflicts:BioMed Central is the owner and developer of Open Repository,custom ised open source institutional repository software.
[1]OSTP,2013.White House Office of Science&Technology Policy memorandum.http://www.whitehouse.gov/sites/default/files/m icrosites/ostp/ostp_public_access_memo_2013.pdf(Accessed 12 November 2014).
[2]Van Noorden,Science publishing:The trouble w ith retractions[J].Nature,2011,478,26-28.doi:10.1038/478026a.
[3]OSTP,2013.White House Office of Science&Technology Policy memorandum.http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf(Accessed 12 November 2014).
[4]http://europa.eu/rapid/press-release_SPEECH-12-149_en.htm(Accessed 12 November 2014).
[5]EU.2013.Europa Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020.http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-guide_en.pdf(Accessed 12 November 2014).
[6]Science Staff.Challenges and Opportunities[J].Science,2011,331(6018):692-693 DOI:10.1126/science.331.6018.692.
[7]https://www.coar-repositories.org/.
[8]http://www.doria.fi/handle/10024/97568.Building a Cohesive Repository Infrastructure for the UK;Notay,Balviar(2014-06-10) (Accessed 12 November 2014).
[9]https://www.doria.fi/bitstream/handle/10024/97736/Breaking_down_the_boundaries_to_storing,_sharing_and_publishing_research_data.pdf?sequence=3.