GenAI in Research: Research Data & GenAI

Copyright

Copyright in relation to GenAI inputs & outputs is an evolving field that may vary across different jurisdictions

An important factor to be aware of, is that the training materials for these tools can contain copyrighted/proprietary materials for which proper copyright clearance or attribution has not been obtained or performed.

This means that, from their very inception, AI tools are conceived with no regard to intellectual and creative property rights  

 

Which makes using these tools and respecting creators' rights difficult to reconcile.

There are some general rules however, that will help you to use GenAI in a compliant and responsible way:

 

Copyright & GenAI Inputs

Generally, you, the user, are responsible for obtaining the appropriate license or copyright clearance for any third party material that you input into a GenAI tool.

There are some common forms of knowledge that university researchers may use often for which copyright clearance is difficult to obtain or not yours to obtain and you should never input in any GenAI tool

These Include

  • University copyrighted material: such as any course materials, lectures, and other university IP
  • Large data scraped from systematic or structured searches
  • Any Indigenous Cultural or Intellectual Property 
  • Any resources (such as journal articles, books, images etc.) obtained from library databases or the library collection. These materials are licensed to VU and should not go into GenAI.
  • Any creative works for which you have not obtained permission to use in this way from the artist/creator

 

Copyright & GenAI Outputs

The way GenAI tools are trained (and develop) with no regard for copyright and IP of authors and artists, means that any output you obtain from them, to help you with your grant application, figures, literature synthesis, and so on, may contain copyright protected material for which no permissions or license to use has been sought - in other words,

it is impossible to know definitively whether your AI tool is giving you pirated material.

Privacy

Participant and Sensitive Data

You likely already know that GenAI tools are not secure or compliant repositories for data, and therefore:

  Never input confidential or sensitive data into a GenAI 

Especially human participant generated data. This would contravene your research ethics agreement, state and federal privacy acts, and the code of responsible research, to name a few.

  • No research data from human subject
  • No student essays or other materials students may have provided you as part of their course participation or assessments 

 

Your Privacy

A lesser-known issue however is that LLM AIs are able to

 Accurately infer your private information from your text prompts

Even when you have consciously anonymised your inputs, or prompted something completely unrelated to your identifying data

LLMs are trained on geolocated census data among other large datasets that help them develop identifying capacities. (WIRED 17 Oct 2023)

Because LLMs are not secure or compliant data repositories, the private information that they infer from prompts can be exposed in various ways

If you are curious, researchers from Secure, Reliable, and Intelligent Systems Lab (SRILAB) Zurich have developed this tool to test your privacy inference skills against current LLMs' capabilities

 

Personal data is subject to data privacy legislation.

  • Australia. The Privacy Act applies to all uses of AI involving personal information (OIAC, 2024)
    • personal information may not be input without consent.
  • Europe. The EU General Data Protection Regulation (GDPR) limits collection, use, and retention of personal data.
    • De-identified data is still covered by the GDPR.

Legal requirements on privacy vary with jurisdiction. If your research involves other jurisdictions you will need to check relevant requirements.