Data Collection in Online Terrorism and Extremism Research: Future Directions

26th January 2024 Dr. Ryan Scrivens

By Dr. Ryan Scrivens, Dr. Joshua D. Freilich, Dr. Steven Chermak and Dr. Richard Frank

26th January 2024 In Insights

Introduction

The role of the Internet in facilitating violent extremism and terrorism is a primary concern for many researchers, practitioners, and policymakers around the world. The so-called Islamic State, an internationally designated terrorist organisation, released a steady stream of video-recorded beheadings of Western hostages and other atrocity footage to radicalise some and intimidate others. Violent anti-fascist extremists used social media to instigate widespread violence against law enforcement during COVID-19 lockdowns and following George Floyd’s killing. Many of the right-wing extremists who engaged in violence during the January 6th Capitol Riot also used online channels to coordinate and/or boast about their involvement. Understandably, law enforcement and intelligence communities have become invested in examining the digital footprints of violent extremist movements. It also comes as no surprise that online terrorism and extremism research has grown rapidly in recent years, with a variety of data collection techniques emerging to address key research questions in the space. The primary focus of this effort has been on extracting open-source, publicly available information from active data sources (e.g., social media platforms, websites, blogs, forums) and informative sources (e.g., online newspapers, government reports, existing databases).

As studies in this research area have increased, various data collection techniques have emerged to address key research questions, ranging from manual extraction to computational tools to collect online information. Yet, despite the growing efforts in online terrorism and extremism research to collect open-source information, little is known about the methodological, practical, and ethical challenges of open-source data collection in this research space particularly, in terrorism and extremism studies generally. Instead, what we generally know comes from studies that briefly highlight limitations specific to a project or a particular research method.

In our forthcoming article in the special issue ‘The strengths and struggles of different methods of research on radicalization, extremism and terrorism’ in Studies in Conflict & Terrorism, edited by Frank Weerman and Elanie Rodermond, we examine key strengths, limitations, and ethical concerns associated with open-source data collection methods commonly used in online terrorism and extremism research. The purpose of this Insight is to assist researchers and analysts in choosing between commonly used data collection methods in this regard, as summarised in Table 1.

Table 1. Strengths and limitations of data collection methods in online terrorism and extremism research.

Technique	Approach	Strengths	Limitations
Computational	Third Party APIs	Time-efficient, large-scale data collection Relative ease of data collection Standard, simplified data collection Minimal technical skills required for data collection, formatting, or archiving Minimal need to clean, transform, or prepare collected data	Can be costly Restrictions on data collection Black box problem
	Commercial crawlers	Time-efficient, large-scale data collection Relative ease of data collection Technical skills not required for data collection, formatting, or archiving Minimal need to clean, transform, or prepare collected data Flexible data filtering options via interactive interfaces	Can be costly Restrictions on data collection Black box problem
	Custom-written crawlers	Time-efficient, large-scale data collection Source code available for data collection Ability to collect and then merge disparate data sources to be analysed as one Filtering tools to extract specific information of interest Customizable for various platforms	Technical skills required for data collection, formatting, or archiving Specific parameters required for data collection Rate-limiting techniques from websites can slow data collection Software, which can be expensive, is needed to format and archive collected data
Manual	Database Development	Relevant and quality information is collected Data is focused on the specific problem Follow up on uncovered leads to track down specific materials or variables Targeted searches to fill in missing values Better positioned to capture latent constructs and improve validity	Labour-intensive and time-consuming data collection Coder training is critical to guard against reliability issues Amount and type of information available about different types of events varies considerably A universal protocol for data collection does not exist across databases Inclusion/exclusion criteria for databases are often not transparent Missing data

Importantly, we also set forth suggestions for progressing data collection efforts in online terrorism and extremism research based on the methods highlighted above, which is the focus of this Insight. By no means, however, do we provide suggestions based on every study on, or trend in, data collection in online terrorism and extremism research. Instead, our suggestions are derived from what we view as key current and emerging trends from our involvement in the field. We have contributed to the expansion of online terrorism and extremism research, from developing computational tools for large-scale extraction and analysis of extremist content online at the International CyberCrime Research Centre to creating the open-source database US Extremist Cyber Crime Database to better understand online pathways to radicalisation and mobilisation. These experiences have provided us with unique insights regarding the usefulness of various open-source data collection efforts in online terrorism and extremism research and avenues for future work.

First, combining data extraction techniques in online terrorism and extremism research, such as blending manual and automated data extraction techniques or linking commercial crawlers with other data extraction tools, will advance research in this space. These combinations, although relatively rare in the online terrorism and extremism literature, have shown signs of success, in part because a technical background is not required for data collection and because researchers can draw from the abovementioned strengths of each extraction technique. Combining techniques will also help researchers better understand what is captured and what is missing using different strategies, as well as identify areas where adjustments in the process should be made. In addition, combining techniques may be helpful in addressing some of the more challenging aspects of data collection in contemporary online terrorism and extremism research, such as identifying and then collecting image and video-based content from online sharing apps such as Instagram and TikTok or from encrypted communication apps such as Telegram and Signal, or even from gaming platforms such as Steam and Twitch. Here, violent extremist content, users, or networks of interest could be manually identified from these platforms, and then the data could be extracted using computational techniques.

Second, future data collection efforts would benefit from the integration of traditional methods (e.g., in-depth interviews or surveys) with computational methods to address key research questions with policy implications. Scrivens and colleagues, for example, used a customised web crawler to extract online content from a sample of violent and non-violent right-wing extremists who were identified by a former violent extremist during an in-depth interview. Here, the researchers were in a unique position to identify which online users engaged in violent extremism offline to explore an array of their online behaviours compared to their non-violent counterparts. Such an open-source dataset containing users’ offline violent behaviour is indeed rare in online terrorism and extremism research, as most that draw from open-source data simply do not have access to ground truth. This is a main limitation of open-source data generally, not only in terrorism and extremism research, because developing a high level of confidence in the accuracy of second-hand information is challenging without first-hand collection of such data.

Third, researchers must make archives of extremist online content accessible to other researchers. Access to data in online terrorism and extremism research remains a challenge for many in the field, especially junior and early career scholars who may not have the resources or skillsets. This is despite the various calls from researchers to make such content more widely available for research purposes. Surprisingly, to date, only a small number of individuals have contributed to this initiative. The Dark Web Project, for example, collected and made available the content of 28 jihadi forums comprising over 13 million messages. The Dark Crawler database includes, but is not limited to, over 11 million posts from the most conspicuous right-wing extremist forum, Stormfront; over 8 million posts that include Islamist content; as well as over 49 million posts drawn from 11 right-wing extremist subreddits – all are available to users for research upon request. The Repository of Extremist Aligned Documents (READ), developed by the International Centre for the Study of Radicalisation at King’s College London, is a database comprising far-right/ideologically motivated violent extremism documents and primary sources which have been collected and curated with academic rigour. The purpose of READ is to provide researchers and practitioners with a secure and controlled database to explore the various dimensions of radicalisation, violent extremism, and terrorism.

These exceptional databases are few and far between, and these two resources have not been widely used by researchers, perhaps because they are less known compared to widely used databases such as the Global Terrorism Database. Regardless, providing researchers with access to non-traditional data sources, especially open-source intelligence and social media data, will undoubtedly transform the future understanding of violent extremism and terrorism in general and online terrorism and extremism in particular. Indeed, the tech industry must play a role in collaborating with researchers collecting data on their platforms. The Global Internet Forum to Counter Terrorism (GIFCT) is at the helm of these efforts, from its Hash-Sharing Database to its various Working Groups to combat terrorists’ and extremists’ use of digital platforms. But in light of these important and ongoing efforts, much more is needed to cultivate academic-practitioner cooperation and data-sharing in online terrorism and extremism research.

Lastly, in addition to collecting and then sharing open-source data among key stakeholders generally, those working in online terrorism and extremism research should triangulate data across databases and datasets. Taking a lead in this respect are, for example, Holt, Freilich, Chermak, and LaFree, who triangulated data between the Extremist Crime Database and the Profiles of Individual Radicalization in the United States database, testing whether various criminological theories account for on- and offline pathways to extremist violence. This provided multiple observational points to explore the similarities and differences across offenders’ backgrounds, attitudes, and behaviour. Perhaps equally valuable would be for researchers to merge such databases with databases that include extremist online content, such as the abovementioned Dark Web Project and The Dark Crawler databases, and to develop a central database in which various online platforms that violent extremists and terrorists have been known to frequent can be made available in one space. This would place researchers in a better position to explore key questions in online terrorism and extremism research, such as whether consumption of violent extremist online content leads directly to violent acts occurring that would not have occurred if the Internet did not exist.

Tags: P/CVE, Research

Data Collection in Online Terrorism and Extremism Research: Future Directions

By Dr. Ryan Scrivens, Dr. Joshua D. Freilich, Dr. Steven Chermak and Dr. Richard Frank

Dr. Ryan Scrivens

Dr. Joshua D. Freilich

Dr. Steven Chermak

Dr. Richard Frank

Share

GNET’s Research Digest