Researchers at Trinity College Dublin have been working on a system to improve technology that harvests data from social media by analysing how it may intrude upon individual privacy. The system, called an Intrusion Index, detects potentially private information in digital data so that this information can be deleted if necessary.
During a natural disaster there is a large volume of information shared on social media sites like Facebook and Twitter. Some of this information contains private data that could be used to identify individuals, although it is difficult to process all of this of data. Slándáil researchers have been looking at ways to better protect sensitive information, including encryption methods and anonymisation methods, and part of this includes a novel system that works on Named Entity Recognition.
Work on the Intrusion Index began in 2014 for Slándáil, and progress has been ongoing in testing and development. The index searches online text for named entities including place-names and people’s names, and creates a log when this data is detected in social media text. The system is now being tested on social media data.
The Intrusion Index is being designed for the Slándáil system in order to better protect the privacy of individuals that may be named on public social media sites. When implemented, it will highlight named entities that appear in text so that the system can later privatise this data, for example by automatically deleting all place and person names. Other data, such as the content of tweets, can still be useful to train the system for future natural disasters, so by deleting sensitive data the rest of the text can still be useful.
Early tests on the Intrusion Index have shown that less than one in five words in public media sources contain a person’s name. Many more of these names have been found to belong to public officials, such as heads of state or emergency managers.
However, named entities can include Twitter handles and place-names, and any data that can identify an individual person should be removed from the Slándáil system unless it can be used to help protect them from danger.
Test on samples sets of social media messages collected from sites such as Twitter and Facebook to determine potential entity occurrences were conducted. Processing the messages according to the intrusion index method, names of institutions, events, people, places, and Facebook and Twitter names were identified. One one particular sample of 27,000 tagged words, some 8.4% were institutions, 7.9% events, 7.5% person names, 4% places, and 57.1% were Twitter and Facebook handles.
Research on the Intrusion Index is ongoing, and as the Slándáil prototypes are developed over the coming 24 months the index will be tested more frequently.