Data Journalism on Twitter - 2020

Preface

Acknowledgements

This research was realised by Marta Fioravanti, Tong Zhang, Weixuan Yang for the Data Journalism module at King's College London, MA in Digital Humanities.

Introduction

This research observes users and communities around the #ddj hashtag on Twitter. It intends to find the most important ones and to explore the geographical factors associated with them. It aims to understand the online ecosystem of data journalism and the real people behind it.

In general, the accounts that are actively engaging in data journalism discussion in 2020 have formed several groups based on their roles in the news industry, the media they are involved in, and the geographic area they locate. Most of the narrative readings resonate with the values and themes associated with news and investigative journalism at different levels (Bounegru et al., 2017).

This research combines qualitative and quantitative approaches. It is composed of three stages: the data preprocessing, data mining and data interpretation. Only users with the most in-coming or out-coming mentions were kept; the Leiden algorithm generated well-connected clusters. These users’ Twitter profile information was then analysed, trying to map their network and explore the characteristics of each cluster.

Analysis #00

The characteristics of each cluster

Description

Using the Leiden algorithm, twelve communities were extracted: two of them were pruned since they were too small and sparse; the others were examined qualitatively to discern their characteristics.

Question

Mapping the communities in the online discussion around data journalism hashtags and finding out characteristics of different clusters.

Research protocol

Findings

The communities were generated by analysing the network of mentions between users, that is why some clusters seem to exist around one specific account. These clusters are generally associated with the partnership of organizations or with geographical factors.

Communities

The communities

How does each community contribute and act in the Data Journalism field? In this section, each cluster is analysed to discern its nature.

Cluster 0: the big soup
5050oD Agencia_TatuDataharvestEIJCEl_Universal_MxEscolaDeDadosHarry_StevensHelenaWittlichLucasMarchesiniMapboxMediacareerngrMetropolesNYTmagNataliaMazotteOCCRPRosentalSCMPNewsSZTDataScienceTagesspiegelThaMatos_UOLNoticiasYouTube_davidmeidinger_fiquemsabendoabrajiagenciapublicabasedosdadosbellingcatbrauliolorentzcijournalismcolaboradadosdatentaeterindigitalcampbellelsalvadorcomflowingdatafolhafunkeinteraktivgijnjburnmurdochjourntoolboxjuditecyprestemarilia_gehrkemunichrockernhofmeisterortegopolisplateautonpudoraphael_veledaschoolofdatakgstonepeoplestorybenchtaisecoisasthe_claustoledoluizftrbrasilutknightcenterwwwproektmedia </g>
# hashtag freq.
#ddj 345
#covid19 37
#dataviz 30
#data 19
#maps 14
#mapping 10
#coronavirus 8
#journalism 6
#cjsummer 4
#scraping 4

Cluster 0 witnesses a high level of diversity regarding accounts' nationality, active region and identity. It involves users from Europe (about 63% Germany), North America (90% US), Central and South America (nearly 92% Brazil), and some from Asia and Africa. Nearly all accounts are mentioned by the Global Investigative Journalism Network (@gijn). As the most active account, @gijn frequently mentions other accounts in cluster 0, partly because it publishes daily lists of top 10 ddj works.

Accounts based in Central and South America make up a large proportion, one from Mexico (@El_Universal_Mx), one from El Salvador (@elsalvadorcom) and twenty-two from Brazil. With Brazilian actors taking up 32% of the whole cluster, it is not surprising to find that this clique consists of journalists working for media organisations like @Metropoles, @_fiquemsabendo (both members of cluster 0) and @G1 (cluster 8). Brazilian accounts are predominant in quantity and involve data journalists, data aggregators, news media, investigative journalism organisations, and non-governmental organisations.

Germany, the most frequent European country in cluster 0, is represented by press professionals, like the South German Newspaper (@SZ), The Daily Mirror (@Tagesspiegel) and their journalists, whose influence sometimes reaches a global range. Their professional labels (data journalist, data visualisation personnel, programming journalist) demonstrates a clear division of work and a mature data journalism ecosystem. Actors in North America (the US and Canada) also have global influence.

Cluster 1: news organizations and data teams
5050oD AJEnglishAJLabsAdaLovelaceInstBBGVisualDataBigLocalNewsCNNCode4AfricaDataVizSocietyDatawrapperECONdailychartsFiveThirtyEightGuardianVisualsICIJorgJHUSystemsJSKstanfordNiemanLabODIHQONSPostGraphicsReutersReutersGraphicsSimonScarrTheEconomistUpshotNYTWarningGraphicCZDFheutecephillipscommonslibraryel_paisf_l_o_u_r_i_s_hgijnAfricaguardianinstituteforgovlisacrostnesta_uknewreldopenDemocracypewresearchpinardagpropublicarealDonaldTrumprisj_oxfordthemarkupwashingtonpost </g>
# hashtag freq.
#ddj 238
#dataviz 105
#datajournalism 66
#opendata 55
#coronavirus 21
#covid19 17
#digitalgov 16
#jskbln 16
#covid 11
#datajourno 8

Cluster 1 mainly belongs to big news organizations and their data teams, which seldom mention others within this community, or any other accounts outside it. The reason for this grouping is probably that users are either mentioned by @gijnafrica or@WarningGraphicC (the former shares investigative journalism, the latter data journalism and visualization references).

Cluster 2: non-profit journalist network organizations and Sigma Awards
ARIJNetworkData_BlogEdjNetEdjQuoteFinderEvaConstantarasFinancialTimesGoogleNewsInitGurmanBhatiaICFJICFJKnightIJNetIJNetPortuguesIRE_NICARJuliaAngwinMaid_MarianneSCMPgraphicsTmarcoHaeleraqiajlabscraignewmarkdarrenlonghkdavidemancino1ejcnethaddadmekuangkengmtrpirespilhoferreginaldchuasigmaawardssmfrogerszeitonline
# hashtag freq.
#ddj 79
#datajournalism 79
#dataviz 23
#covid19 21
#data 16
#journalismmatters 12
#resources 12
#dataanalysis 12
#sigma21 10
#nisdata 8

Cluster 2 collects non-profit journalism organizations and accounts linked to the Sigma Awards. Most of these accounts are linked with each other and belong to the Sigma Awards management team or are media partners (such as @IJNet and @IRE_NICAR). This indicates that the organisation has a regional and international role.

Cluster 3: around datajournalism.com
AlbertoCairoCraigSilvermanEDHNoticiasFTGoogleJanWillemTulpMathias_FelipeRodrigoMenegatTextyOrgUaagencialupacivioclaralmmdatajournalismevabelmontefirstdraftnewsftdatajournalismfestkarim_douiebmaartenzamnoeL_maSnytgraphicsnytimespetransparentepuddingviz pulitzercenter </g>
# hashtag freq.
#datajournalism 82
#dataviz 59
#ddj 44
#infographics 19
#data 10
#podcast 7
#covid19 6
#conversationswithdata 4
#journojobs 4
#gistribe 3

Cluster 3 revolves around @datajournalism , which mentions almost every account in the cluster. These accounts are mainly regional media organizations and data journalists around the world. Among them, six are contributors of datajournalism.com.

Cluster 4: BBC and BCU
AnnaekhooBBCNewsBCUJournalismC_AguilarGarciaEngland_Rob_GWJournalismHuffPostUKPaul_theChronTimHarfordalexhomerpaulbradshaw </g> </g>
# hashtag freq.
#ddj 48
#bcujournos 5
#councilcookies 4
#vis 2
#businessrates 2
#bbcshareddataunit 2
#covid19 2
#nodata 2
#charities 2
#podcast 1

Cluster 4 involves British data journalists’ accounts, which can be divided into BBC-related and BCU-related. The former used to or currently work for BBC; the latter either studied or teach at theData, Multiplatform and Mobile Journalism MA in Birmingham City University. These two groups of accounts are merged into one cluster because @paulbradshaw has interaction with both BBC-related accounts and BCU-related accounts. He has worked with the BBC England data unit since 2015, and leads the MA at Birmingham City University.

Cluster 5: Nigeria
BifolaXBudgITngConnected_devDataphyteSchoolOfDataTheICIRfisayosoyombojayangbayippmonitorNGptcij
# hashtag freq.
#datajounalism 8
#ddj 5
#sigmas21 3
#nigeriadebts 1
#training 1
#thursdaymotivation 1
#radio 1
#data4radio 1
#covid19 1
#data4dev 1

80% of this cluster's users are from Nigeria. These accounts have shown variety in category: there are organisations interested in civic topics, research and international development, non-governmental associations, news agencies, but also journalists and entrepreneurs who own an online study website of data methods (@Dataphyte).

90% of them aim to promote the transparency of the Nigerian government and the financial progress of Africa.

Cluster 6: Northern Europe
EU_DataPortalEU_opendataEmmaQvirinHolstHecquetKKGattermannMarieSchoenningclaesdevreeseeuroparljournalismfund </g>
# hashtag freq.
#ddj 2
#datajournalism 1
#dataharvest2020 1

All of this cluster's accounts originate in Western and Northern Europe (Belgium, Netherlands, France and Denmark). About 44% of them are organisations, including the EU-funded data portal and NPO focusing on journalism. Nearly 46% of them are personal accounts either of journalism practitioners or academics. The former, like @EmmaQvirinHolst and @HecquetK, work for Altinget.dk, a party-politically independent Danish online newspaper. The latter, like @claesdevreese and @KGattermann, are researchers from University of Amsterdam specialised in political communication.

The most active accounts of this cluster are EU @EU_opendata, an event to be held in 23-25th November 2021, and a student assistant editor @MarieSchoenning of UvA Amsterdam, an intellectual hub of the Netherlands.

Cluster 7 and Cluster 8: Brazil
ceciliadolagocepespgabrielacaesarge_brandinoturicasEstadaog1generonumerogiubianconi </g>
# hashtag freq.
#datajournalism 8
#journalismodedados 7
#ddj 3
#sql 3
#covid19 1
#dataviz 1
#gender 1
#genero 1
#opendata 1
#python 1

These clusters are located in Brazil, and talk about politics, education, health, security issues in Brazil and sometimes in Latin America. They are composed by journalism professionals.

Cluster 7 is a Brazilian journalists’ group, with one exception: the Center for Politics and Economics in the Public Sector Studies (@cepesp). @gabrielacaesar is a journalist working for @g1, a global news portal based in Brazil, active member of Cluster 8.

As the only personal account in Cluster 8, @giubianconi) is a data journalist who mainly covers gender, race and politics in Latin America, and seems to connect the cluster. Likewise, Cluster 7 is linked through @ge_brandino, the journalist who involves data reporting in her work.

Cluster 9: global
UNDataForumearthjournalismresource_watch </g>
# hashtag freq.
#ddj 1
#opendata 1
#datajournalism 1

Three organisation accounts led by western countries, radiating the whole world. It includes UN World Data Forum (@UNDataForum) an annual event funded by The United Nations, which gathers data experts and users to spur data innovation, and builds a pathway to better data for sustainable development.

The other two accounts, Earth Journalism Network (@earthjournalism) and Resources Watch (@resource_watch) are both initiated in the US, empowering journalists and data visualisation experts to explore global sustainable development by addressing environmental topics.

Analysis #01

#ddj protagonists

Description

This section studies the individual contributions on the #ddj environment.

Question

Finding the most active users and the most influential ones, try to understand if they correspond and if they belong to any community, or if any lonely wolf exists.

Research protocol

The analysis was taken following two approaches and comparing them.

Local approach

This approach only considered people belonging to a community, to understand who, among the most connected users, emerges.

Global approach

This approach considered the whole dataset to verify if the local rankings corresponded to the global ones.

Findings

Roles in the network

There are three classes of users: the mostly inactive, the spreaders (those who mention other users), and the ones that are mostly mentioned: the stars.

Distribution of incoming mentions
Distribution of outcoming mentions

As we can expect, the majority of users fall in the first category: the casual users. Most of them were excluded from the analysis. For what concerns the other two classes, it emerged that they don't often overlap: the cases at the limit of the distributions tend to primarly mention other users or to be cited by many people.

This is not surprising since there exist accounts made for spreading specific contents made by and addressed to a precise target, in this case (data) journalists, activists, and designers.

A local glance at the most important users

It is useful to understand in what measure each user contributes to the Data Journalism community on Twitter. This subsection reveals the most active or influential users, and the spreaders. This classification follows the first approach, excluding the users outside the communities.

A profile was considered active if it produces a large amount of tweets, influential if many other users cite its work. It was named a spreader if it mainly retweet other users' work.

Two rankings were realised: one considering retweets, and the other excluding them, so prioritising original material.

Activity ranking including retweets
Activity ranking excluding retweets

At a first glance, it seems more useful to look at the data without retweets. Here we can find that the 4 out of 10 users belong to cluster 1. It is not so surprising, considering that it is primarly populated by big news agencies.

@gijn can be considered the most active user, also considering the other plot: as we see, it retweeted only 22 times, much less than the 192 of @pinadrag. These two users are significantly over the average number of #ddj tweets in 2020, around 2.5 (excluding the zero values).

The mentioners
The most mentioned

This second analysis focuses on the level of interaction between the users (excluding retweets). The situation displayed by the plots shows cluster 1 as very present among the top 10 spreaders.

While @pinadrag disappears, @gijn continues to dominate the rankings, consolidating its role of authority in Data Journalism, at least on Twitter. Also @datajournalism, the third most active user, confirms its position of importance becoming the second most mentioned in the network.

Another aspect to notice is that cluster 0 receives a large amount of mentions. Given the nature of its composition it can be foreseen that these are principally interactions inside the cluster. A qualitative proof of that is the visual display of the community in the network visualisation at the beginning of the article.

A global glance at the most important users

Global placements (inside the network) are fundamental to see local rankings in perspective, because important actors not strongly socially connected could emerge.

Ranking including retweet activity

The ranking including retweets contained at the first place @jornalismoDados with 6773 tweets. We decided to drop this information for the reasons explained in the following subsection.

Ranking excluding retweet activity

The global plots show the presence of a consistent number of independent users at the top of the placement. These could suggest the presence of less represented communities, like the Russian one (suggested by @gijnRu) and the Turkish one (suggested by @DagmedyaVeri).

The mentioners

The classification of the mentioners contained at the first place @jornalismoDados with 13649 outcoming citations. We decided to drop this information for the reasons explained in the next paragraphs.

The most mentioned

An anomaly was found: the two users at the top of the ranks, @EscolaDeDados and @journalismoDados probably represented the same institution. The latter was deleted before March 2021 in favor of @EscolaDeDados (cluster 0). This study, then, excluded @journalismoDados because it imbalanced the data without adding any contribution. In any case, it can be assumed that @EscolaDeDados has a quite important role, even if its first placement could be due to its strong relation with the twin account.

Again, @gijn, @pinadrag and @datajournalism appear to be often mentioned, confirming the local observations.

Conclusion

This analysis unveiled the presence of absolutely important users, like @gijn and @datajournalism, and uncovered that the most mentioned users, also in a global context, belong to a community. It also confirmed that the amount of tweets is not the only key of popularity: the global analysis of activeness shows many active users disappearing in the rankings of mentions; this suggests the importance for an individual to be part of a community, and at the same time the difficulty of small clusters to emerge in the Twitter environment.

Analysis #02

The geography of voices

Description

The analysis explores the geographical distribution of the data journalism community. In addition, it will analyse the factors that may influence the formation of communities on Twitter in Europe, the US, Nigeria and Brazil, and the reasons why some regions are missing from this space.

Question

What is the geography of these voices, and who is missing from this space.

Research protocol

Findings

The main countries in the communities

Overview

The accounts located in the European countries and in the US constitute the majority of the 197 users in the communities and have a relatively close connection with users inside and outside their cluster. Also, the most mentioned ones and the top mentioners are mainly from these regions, suggesting the importance of European and American organizations in the data journalism Twitter environment.

In parallel, the accounts from other parts of the world, like Nigeria and Brazil, tend to interact within their cluster, except for some local organisations being mentioned by international accounts.

The absence of voices from most of the countries on other continents may suggest that these countries have not built up a Data Journalism system of international relevance.

Brazil

In recent years, the availability of open data in Brazil significantly increased. Since the introduction of the Freedom of Information law in 2012, the country has been recognized as a leader in disclosing government data (Faleiros, 2012), and its government demonstrated concern over the improvement of information disclosure and transparency, to facilitate the public scrutiny and encourage democratic participation.

Data journalism saw its rise since the Access to Information Act came into force in 2012. Journalists gather at conferences and regular events where they can share information and help each other learn new methods (Kunova, 2018). The School of Data Brazil (@EscolaDeDados in cluster 0) found the main Data Journalism conference in Brazil, supporting community meet-ups and training more than 6,000 people.

Nigeria

cluster 5 is characterised by Nigerian nonprofit investigative news agencies, data journalism organizations and civic tech organizations. These groups have developed a network and work collaboratively to improve journalism. In Nigeria, corruption has permeated almost every sector of the economy, and has been a contributing factor to the current dwindling nature of the country’s economy. Therefore, the fight against corruption is one of the top priorities of the Government and the whole society, and investigative journalism plays a crucial role. Many organizations have trained journalists to expose bad-habits and combat corruption. However, despite tremendous improvements in computer and internet technology, many reporters still ‘struggle to work at this crucial intersection of journalism and numbers’ (Adebajo,2020)

In order to fill the vacuum of data-driven conversations in some underreported sectors in Nigeria, Joshua Olufemi (@jayangbayi, cluster 5) founded @Dataphyte, a startup focusing on data-driven storytelling (Adebajo,2020). Olufemi worked for Premium Times Centre for Investigative Journalism (@ptcij) and has experience in journalism. Although Dataphyte shares some similarities with BudgIT (@BudgITng), it is the first organization in Nigeria which treats data journalism as the core of its work and dedicates to combine journalism with open data.

Europe and the United States

The world has witnessed a domination of western model in media education and practice. This hegemony on the mass media was long seen by the Western world, especially Europe and the United States, where complete media systems have been built. The Western models of teaching and practicing journalism are also imported to other regions in the world. For example, Dataphyte was inspired by FiveThirtyEight (@FiveThirtyEight), a US-based establishment that uses statistical analysis to tell the stories (Adebajo,2020), while J-Forum, an event to develop Japanese journalists’ skills and knowledge of investigative reporting, was prompted by organizations like Investigative Reporters and Editors (@IRE_NICAR) in the US and the Global Investigative Journalism Network (@gijn) (Alecci, 2020).

Some organizations located in Europe and the US not only focus on regional affairs, but also look to matters of global reach. The Global Investigative Journalism Conference held by GIJN is the world’s largest international gathering of investigative reporters. By focusing on skills and training, the conferences help to spread state-of-the-art data journalism and promote cross-border collaboration around the world. Accounts like @gijn and @datajournalism also play an important role in connecting journalists and organizations around the world by frequently mentioning and being mentioned by these accounts.

Missing voices

While major western media such as the BBC and New York Times build their Data Journalism teams with cutting-edge methods, most of the Asian accounts are either posting content that has low connection with data journalism or limited in number. Japan and China are taken as two examples to illustrate why some countries may be considered the missing voices in this research.

Japan

As one of the most frequently used social media platforms in Japan (Bigbeat, 2020), Twitter is supposed to be the seedbed of all kinds of interactions, including data journalism. However, Japanese accounts seemed to be uninfluential. The reasons behind this are fourfold: terminological rift, talent cultivation difference, digitalisation lag, and limitation in information freedom.

In the researched dataset, nearly 78% of the Japanese tweets are about the promotion and users’ discussion of Rekordbox, a disc jockey software. The abbreviations of disc jockey (DJ) and Rekordbox’s new product DDJ-400 share the same hashtags, but with different meanings. Aside from this misinterpretation, the idea of journalism in Japan is different from Western countries. The Japanese practitioners who are defined as journalists in Western countries usually consider themselves Kisha (reporter) or employed (Hayashi and Kopper, 2014, p1144).

A Japanese recently graduated student from Kanagawa University was interviewed to better understand the context. He pointed out that students aiming to be journalists would choose majors in Advertising. Despite their impressive educational backgrounds, elite journalists in Japan mainly gain their skills from practise, because of the lacking disciplinary learning experiences (Hayashi and Kopper, 2014, p1140-1141). Moreover, the japanese apprenticeship tradition makes it hard to grasp digital tools.

Yagi (2020) stated that compared with Asian countries like China, South Korea and Singapore, the status quo of Japan’s digital economy is left behind, which leads to the lack of awareness and demand for data-oriented products. As a result, Japan has not enough drive to build journalism tools, and lacks professionals who can handle digitalisation.

Regardless of the law’s requirements on transparency, the majority officials and politicians don’t hand down information to the public (Hayashi and Kopper, 2014, p1144). They prefer organising conferences and interviews to practitioners within the Kisha-kurabu system, that is, an arrangement by which news-gathering activities are administered by the editorial offices of the member companies of the NSK (Hayashi and Kopper, 2014, p1143), a prominent national journalism award in Japan.

The concept of public information seems to be absent in Japan, its Freedom of information Act failed to facilitate journalists to access statistics. Japan’s recent State Secrets Law also imposed a negative impact on press freedom due to its blur of ‘personal information’ and ‘privacy’ (Alecci, 2020). Thus, limited access to governmental and public information also contributes to data journalism’s low profile in Japan.

China

Most of the few Chinese tweets detected in the dataset encourage people to learn data journalism skills and tools. It is safe to say that this term has a shared meaning in English and Chinese context.

However, for Chinese users, Twitter is not as accessible as other local microblog services such as Sina Weibo. Concomitantly, the data journalism environment is beginning to take shape in Sina Weibo owing to professional accounts’ emerging, among which include Pictorial Digital Room with 774 thousand followers.

By consequence, Twitter may not be the only suitable research environment for data journalism since it is not the most prevalent social media platform in some countries other than Europe and America. Hallin and Mancini (2012, p.280) stated that detailed empirical analysis of particular media systems in their own historical and structural context is essential to conceptualising them. To explore data journalism’s development in nations whose media landscapes differentiate from the Euro-Atlantic media system, applying tailored research objects and methods provides more flexibility and authenticity. In general, data journalism discussion clusters remain a rather exclusive niche group that requires educational, financial, and digital resources, leading to the situation that they scatter unevenly around the world.

Conclusions

This research strongly depends on two factors: the chronological factor and the data source. To begin with, 2020 saw two important events that could have distorted the analysis: the start of the pandemic and the US presidential election; it is possible that the discussion topics determined in some measure the appearance or not of some users. However, the top of the rankings don’t seem to be so much altered: the most important users actually constitute a reference for data journalists.

The dataset source could pose more problems, since some data journalism communities do not organize through the same hashtags, or do not publish on Twitter at all. This bias suggests that most of the ten communities extracted in analysis #00 could be considered part of a unique, bigger, occidental cluster, since the majority of accounts and communities come from Europe and the United States.

This imbalancing could be associated with local media freedom and access to information or to the usage of different means of communication. Since other datasets to measure these phenomena are quite difficult to produce, and that the local factors seem to be so influencing, a qualitative approach was preferred: looking at a few specific cases allowed to reflect on the global Data Journalism environment and how social media could influence its growth, limiting the contacts with communities that are less active in the western mainstream platforms.