Research Data Management: Managing Data

Managing Data

Data Management can be generally considered as any activity involving data outside of actually using the data.

Data management is best defined as any and all of the following examples:

  • Organising data into directories/folders and using meaningful filenames.
  • Keeping backups of data in case you accidentally delete or lose data.
  • Storing final state data in an archive.
  • Making data available to others via an archive or website.
  • Ensuring security of confidential data.
  • Collaboratively creating and using data with other researchers.
  • Synchronising data between desktop, laptop, USB key, cloud storage, etc.
  • Maintaining a bibliography and electronic copies of relevant literature.

Standard Methods of Data Organisation

File Storage and Transfers

Researchers collaborating on projects will often need to share primary data and preliminary results, so it is often necessary for them to transfer data between computers. The most common method for transferring files is with email attachments, but there are limits to the size of files that can be transferred. Removable data storage media, such as USB keys and CDs or DVDs can transfer large amounts of data, but require the researcher to physically carry the data to its destination.

To assist good data management, VU has made the R: drive available to all researchers and research students. The VU R: drive is a central storage space which is secure and backed up. The default allocation for a project is 10 GB and more is available on request. The storage can be used for anything from working files to long term retention of research data/files.

Benefits of using R: Drive:

  • safety and security - greater than desktop storage or portable USB storage.
  • convenience - provided and managed by the University, including backups and maintenance of hardware.
  • collaborate - enables researchers and students to share data and collaborate on the same project, reducing duplicate copies and effort in transferring/emailing.
  • retention - keep your research data in the long term, either to building a repository of your valuable data or to meet retention requirements (minimum 5 years after project).
  • supervision - can retain HDR project data centrally reducing risk of loss to supervisors and students.

 

Remote Access and File Synchronisation

Often researchers will work on their university desktop as well as a laptop, and possible a home computer. Typically files are just copied back and forth between the computers. This is the most obvious method but has a number of drawbacks:

  • It is time consuming to manually copy files.
  • You have multiple copies of data and you can easily lose track of which copy is the latest version.
  • If both copies have been modified, then it is easy to overwrite some changes without knowing.

 
If you find you are synchronising your data regularly and are experiencing difficulties with this, then you should consider using the R: drive and accessing it remotely using the VU VPN (Virtual Private Network). This way you can edit all your data in the one place. You will need to install the VU VPN software on the remote computer, run it and then connect. If you're using a VU laptop you may then have access (once connected). If not you may need to follow the next section on non-VU PCs.

Download the instructions from the ITS Wireless Network Knowledgebase Base (ignore the confusing name) and download the software from the ITS downloads page (bottom of page). There are instructions and downloads for Windows and Macs.

 

Collaboration

Many research projects are carried out collaboratively: between postgraduates and their supervisors; within departmental research groups; as cross-discipline research, and as inter-university research. When working with a large volume of research data, it is worth considering using collaborative tools such as the R: drive or the AARNet CloudStor service. CloudStor allows people to send or distribute a large file to a number of internal and external colleagues. Despite the name, you cannot "store" files in CloudStor for any reasonable period. CloudStor is the equivalent of email for large files.

 

Version Control

When the data is constantly being edited, especially by multiple users, it is a good idea to implement some form of version control to keep track of changes. This can be as simple as appending a number to the end of a file after each major edit. For example:

Journal_v1.0.tex, Journal_v1.2.tex
Journal_Feb12.tex, Journal_May5.tex
Journal_Feb12_John_DRAFT_WithSallysEdits_NewDiagram.tex

Such conventions are good for simple work but quickly become unmanageable when you have multiple authors or make lots of edits. The alternative is to use revision (or version) control software. Such programs offer several advantages:

The software requires you to input a description of the changes made, which makes it easier to pick up where you left off and for collaborators to see what you are doing
You can be confident with making major changes as you can revert to an old version if you make a mistake. You can also easily compare two versions to help you find errors
Useful for people who use more than one computer. It implicitly provides synchronisation and is good for resolving conflicting changes

TortoiseSVN is a popular example of a version control system that integrates with Windows Explorer making it one of the easiest to use.

Benefits of Good Data Management

Managing your data allows you to work more efficiently, produce higher quality data, achieve greater exposure for your research, and protect your data from being lost or misused.

Back Up, Back Up, Back Up

Making regular backups of data is probably the most important and, fortunately, one easiest tasks to manage.

Although most people are quite aware of the risk and cost of losing data through hard drive failure or accidental deletion, it is best to have a policy and schedule in place for maintaining data backups. When considering your backup strategy, you need to know:

  • How often will you make backups
  • How long will backups need to be stored
  • How much hard-drive space, or number of DVDs, will be required to maintain this backup schedule
  • If the data is sensitive, how it will be secured and (possibly) destroyed
  • What backup services are available that meet these needs, if none, then what will be done
  • Who will be responsible for ensuring backups are available

Backup security requires further mention. If the data is sensitive then it should not be stored on a computer that is connected to the internet, and preferably not connected to any network. If the data needs to be destroyed at the end of a project then consider what level is required – a hard drive will need to be overwritten several hundred times to ensure that no data can be recovered.

You can use the VU R: drive for backup. And, if you use it as your main storage - back up happens automatically.