Sensitive data

Data that in some way identify individuals need special caution. You should never publish data that identify individual people, and you must treat such data with extra care when collecting, transferring or storing them for research purposes.

Personal data and sensitive data

Combining data from different categories increases their degree of sensitivity

The possibility of identifying individuals from a data set depends greatly on the data itself, but also on the population the data is collected from. Note that data that in themselves do not identify people, may do so when they are combined with other data. In addition, data concerning behavioural patterns should be treated as personal; EU case law has ruled that even dynamic IP-addresses count as personal data.

The following main categories of data should always be treated as personal data:

Geographical data: country, county, municipality, institution
Personal characteristics: body, personality traits, voice
Demographic data: age, gender, income, education, social status
Family data: number of children, family structure

Further, some other personal data can also be regarded as sensitive data under the Norwegian Personal Data Act. Data on the following must always be treated as sensitive:

Health-related issues
Criminal offences
Racial or ethnic origin
Religious beliefs
Union membership
Sexual matters

Although these data are sensitive personal data, we are still dealing with a scale of sensitivity here; you would normally treat data on jogging as less sensitive than data on sexually transmitted diseases, even though they are both 'health related issues'.

Data could of course be both personal and sensitive! If your data are neither of these, you probably do not need to seek approval before you start collecting or analysing them.

We will look at a few example research cases and illustrate how changes in the data collected, sample size and population can develop the need for more rigorous security of the data as both the sensitivity of the data and the probability of identifying the single persons behind the data increases.

We cannot propose a security solution that will work for all projects, but the following section will help you in asking the following important question about your PhD; how much security is reqiured and what measures are necessary to ensure the required level of security? You should preferably have answered these questions before you start any data management plan you may need.

Example research cases involving sensitive data

Imagine the following two research cases:

DATA ON HEALTH-RELATED MATTERS

Consider the following questions asked as part of the data collection phase of a project:

Do you work out regularly?
Have you walked for at least ten minutes every day the last week?
What kind of physical activity have you done the last week?
To what extent does physical activity help you to increase mastery in other areas?

DATA ON CRIMINAL OFFENCES

Imagine a project focusing on the connection between a criminal offence and difficulty in finding a job:

How long were you in prison for?
Did you work before going to prison?
How long were you unemployed after leaving prison?
At which stage in the process did the facts about your sentence come into focus?

(Adopted from HiOA)

These questions are undoubtedly on 'health related matters' or 'criminal offences' and therefore to be treated as sensitive data. Of course, just the answers to the above questions are not problematic as it is not possible to identify any individual people. When you start combining data for the purpose of analysis, the possibility of identification could arise, as described below.

Low-risk personal data

If your sample size and population are both fairly large, the risk of identifying the individuals answering the questions is small. For instance, with a sample of 5000 people from the population 'Norway', the possibility of identification is generally low. Collecting data such as age, gender, nationality and level of education for analytical purposes will still not make it possible to identify any persons. Remember that combining data from different categories will increase their level of sensitivity, and will therefore require more security even though the potential of identifying individuals is not very much greater. Asking the same questions as above, using the same sample and population size, could very well give you data with greater need of security if you targeted only immigrant girls. Combining the sensitive data types 'ethnicity' and 'health' can make your data more sensitive!

If you look at the second example above involving data on criminal offence, sampling the same 5000 people from the same population, data on age and gender alone would probably not put the individuals in immediate risk of identification. The high degree of sensitivity of the data would nonetheless lead to a greater likelihood of identification in the case of a data leakage, and you should therefore require higher security levels of the data.

High-risk personal data

The same two projects above could easily generate data that you should treat as having a much higher risk of personal identification. For instance, imagine that you ask the questions on physical exercise to a sample of about 100 schoolgirls aged 12–14, and that you collect data on weight and height, size of school, parents’ income, education and employment status, number of siblings, parents’ marital status and friends’ activities. In a data set such as this, the combination of otherwise 'anonymous' data would indeed make it much easier to identify the individual schoolchildren answering the questions. Although the degree of sensitivity of the data is not too high as the data is on physical exercise, not sexually transmitted diseases, the higher risk of identifying individuals from the data still require protection of the data.

Data on criminal offences would normally be considered more sensitive than data on physical activity. If your research involved the questions in the second case above, and you also collected data on gender, age, previous work status, level of education, parents’ income and family structure from a population of about 100 former inmates from western Norway, your data would probably be in the highest risk category! The high level of sensitivity of the data combined with the high potential of identification would require very high levels of security.

Unclear boundaries

The above examples show quite clearly that there are no fixed boundaries in terms of the risk of identification; combining different types of data may identify individuals, and the sizes of the sample and the population play an important role in the continuum of risk. The addition of sensitive data complicates this even more, as there is no fixed ranking of sensitivity of types of data, or within the categories themselves. Further, combining sensitive data from different categories will normally increase their level of sensitivity.

You need to plan your work in advance with regard to these issues; using your smartphone to record highly sensitive interview data and sending the data to your Gmail account for storage would be considered very poor data management and could put your entire project at risk. Writing a proper data management plan in advance can save you from such a situation.

Useful resources for sensitive data

The University of Oslo offers services for sensitive data called TSD. These services can be used for storing, collecting and analysing sensitive data. TSD is available for researchers at the University of Oslo and other research institutions.