Title: | Coordinated Networks Detection on Social Media |
---|---|
Description: | Detects a variety of coordinated actions on social media and outputs the network of coordinated users along with related information. |
Authors: | Nicola Righetti [aut, cre]
|
Maintainer: | Nicola Righetti <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.1.2 |
Built: | 2025-03-11 15:24:42 UTC |
Source: | https://github.com/nicolarighetti/coortweet |
Calculate account statistics: total posts shared, average time delta, average edge symmetry score.
account_stats( coord_graph, result, weight_threshold = c("full", "fast", "none") )
account_stats( coord_graph, result, weight_threshold = c("full", "fast", "none") )
coord_graph |
an |
result |
a table generated by detect_groups |
weight_threshold |
The threshold to be used for filtering the graph (options: "full", "fast", or "none"). |
With this helper function, you can obtain summary statistics for the accounts in
the network. When applied to a network for which a narrower time_window
has
been calculated using the flag_speed_share function, the summary statistics
are computed separately for the full and faster networks depending on the
'weight_threshold' option. When this option is set to "full", metrics are
computed on the set of nodes and edges surpassing the user-defined
edge_weight threshold in the generate_coordinated_network function. Also,
metrics for nodes and edges in the fastest network are returned, but they are
calculated on the specified subgraph. The same applies when the 'weight_threshold'
option is set to "fast". In this case, metrics are calculated on the fast subgraph.
When the option is set to "null", the entire inputted graph without further
subsetting is considered.
The node share count is performed thanks to the table resulting from the detect_groups function. If the user has used the optional flag_speed_share function and decides to calculate statistics on the fastest graph (by setting weight_threshold = "fast"), the share count is calculated considering only shares made in the fastest time window. Alternatively, shares in the largest time window are considered (option weight_threshold = "full" or weight_threshold = "none"). When calculating the share count, all shares made by accounts are considered, regardless of whether they are shares of posts shared in a coordinated fashion or not, according to the edge weight threshold. In other words, this is a measure of an account's activity in the time window under consideration.
a data.table with summary statistics for each account
Function to perform the initial stage in detecting coordinated behavior. It identifies pairs of accounts that share the same objects in a time_window. See details.
detect_groups( x, time_window = 10, min_participation = 2, remove_loops = TRUE, ... )
detect_groups( x, time_window = 10, min_participation = 2, remove_loops = TRUE, ... )
x |
a data.table with the columns: |
time_window |
the number of seconds within which shared contents are to be considered as coordinated (default to 10 seconds). |
min_participation |
The minimum number of actions required for a account
to be included in subsequent analysis (default set at 2). This ensures that
only accounts with a minimum level of activity in the original dataset are
included in subsequent analysis. It is important to distinguish this from the
frequency of repeated interactions an account has with another specific account,
as represented by edge weight. The edge weight parameter is utilized in the
|
remove_loops |
Should loops (shares of the same objects made by the same account within the time window) be removed? (default to TRUE). |
... |
keyword arguments for backwards compatibility. |
This function achieves the initial stage in detecting coordinated
behavior by identifying accounts who share identical objects within the same
temporal window, and is preliminary to the network analysis conducted using
the generate_coordinated_network function.
detect_groups
groups the data by object_id
(uniquely identifies
content) and calculates the time differences between all
content_id
(ids of account generated contents) within their groups.
It then filters out all content_id
that are higher than the time_window
(in seconds). It returns a data.table
with all IDs of coordinated
contents. The object_id
can be for example: hashtags, IDs of tweets being
retweeted, or URLs being shared. For twitter data, best use reshape_tweets.
a data.table with ids of coordinated contents. Columns:
object_id
, account_id
, account_id_y
, content_id
, content_id_y
,
timedelta
. The account_id
and content_id
represent the "older"
data points, account_id_y
and content_id_y
represent the "newer"
data points. For example, account A retweets from account B, then account A's
content is newer (i.e., account_id_y
).
This function is a private utility function that removes loops (i.e., accounts sharing their own content) from the result.
do_remove_loops(result)
do_remove_loops(result)
result |
The result of the previous filtering steps. |
The result with loops removed.
This private function filters the result by the minimum number of participation required.
filter_min_participation(x, result, min_participation)
filter_min_participation(x, result, min_participation)
x |
The original data table where a preliminary filter is applied |
result |
A data table containing the result data from calc_group_combinations. |
min_participation |
The minimum activity threshold. accounts with participation count greater than this threshold will be retained in the final 'result' table. |
A data table with filtered rows based on the specified minimum participation.
This function takes the results of detect_groups and generates a network from the data. It performs the second step in coordinated detection analysis by identifying users who repeatedly engage in identical actions within a predefined time window. The function offers multiple options to identify various types of networks, allowing for filtering based on different edge weights and facilitating the extraction of distinct subgraphs. See details.
generate_coordinated_network( x, fast_net = FALSE, edge_weight = 0.5, subgraph = 0, objects = FALSE )
generate_coordinated_network( x, fast_net = FALSE, edge_weight = 0.5, subgraph = 0, objects = FALSE )
x |
a data.table (result from detect_groups) with the
Columns: |
fast_net |
If the data.table x has been updated with the flag_speed_share function and this parameter is set to TRUE, two columns weight_full and weight_fast are created, the first containing the edge weights of the full graph, the second those of the subgraph that includes the shares made in the narrower time window. |
edge_weight |
This parameter defines the edge weight threshold, expressed as a percentile of the edge weight distribution within the network. This applies also to the faster network, if 'fast_net' is set to TRUE (and the data is updated using the flag_speed_share function). Edges with a weight exceeding this threshold are marked as 0 (not exceeding) or 1 (exceeding). The parameter accepts any numeric value between 0 and 1. The default value is set to "0.5", representing the median value of edge weights in the network. |
subgraph |
Generate and return the following subgraph (default value is 0, meaning that no subgraph is created):
|
objects |
Keep track of the IDs of shared objects for further analysis with
|
Two users may coincidentally share the same objects within the same time window, but it is unlikely that they do so repeatedly (Giglietto et al., 2020). Such repetition is thus considered an indicator of potential coordination. This function utilizes percentile edge weight to represent recurrent shares by the same user pairs within a predefined time window. By considering the edge weight distribution across the data and setting the percentile value p between 0 and 1, we can identify edges that fall within the top p percentile of the edge weight distribution. Selecting a sufficiently high percentile (e.g., 0.99) allows us to pinpoint users who share an unusually high number of objects (for instance, more than 99% of user pairs in the network) in the same time window.
The graph also incorporates the contribution of each node within the pair to
the pair's edge weight, specifically, the number of shared content_id
that
contribute to the edge weight. Additionally, an edge_symmetry_score
is
included, which equals 1 in cases of equal contributions from both users and
approaches 0 as the contributions become increasingly unequal.
The edge_symmetry_score is determined as the proportion of the unique
content_ids (unique content) shared by each vertex to the total content_ids
shared by both users.
This score, along with the value of contributions, can be utilized for further
filtering or examining cases where the score is particularly low. Working with
an undirected graph, it is plausible that the activity of highly active users
disproportionately affects the weight of edges connecting them to less active
users. For instance, if user A shares the same objects (object_id
) 100
times, and user B shares the same object only once, but within a time frame
that matches the time_window
defined in the parameter for all of user A's
100 shares, then the edge weight between A and B will be 100, although this
weight is almost entirely influenced by the hyperactivity of user A. The
edge_symmetry_score
, along with the counts of shares by each user user_id
and user_id_y
(n_content_id
and n_content_id_y
), allows for monitoring
and controlling this phenomenon.
A weighted, undirected network (igraph object) where the vertices (nodes)
are users and edges (links) are the membership in coordinated groups (object_id
).
Giglietto, F., Righetti, N., Rossi, L., & Marino, G. (2020). It takes a village to manipulate the media: coordinated link sharing behavior during 2018 and 2019 Italian elections. Information, Communication & Society, 23(6), 867-891.
An anonymized multi-platform and multi-modal dataset of social media messages from the 2021 German election campaign. Includes Facebook and Twitter posts.
german_elections
german_elections
mmmp
A data.frame with 218,971 rows and 7 columns:
character, with shorthand for platform
integer
integer, anonymized url contained in post
integer, anonymized hashtag contained in post
integer, anonymized domain of url
integer, anonymized perceptual hash of shared image
numeric, timestamp of post
Righetti, N., Giglietto, F., Kakavand, A. E., Kulichkina, A., Marino, G., & Terenzi, M. (2022). Political Advertisement and Coordinated Behavior on Social Media in the Lead-Up to the 2021 German Federal Elections. Düsseldorf: Media Authority of North Rhine-Westphalia.
With this helper function, you can obtain summary statistics for the objects in the network.
group_stats(coord_graph, weight_threshold = c("full", "fast", "none"))
group_stats(coord_graph, weight_threshold = c("full", "fast", "none"))
coord_graph |
A result |
weight_threshold |
The level of the network for which to calculate the statistic. It can be "full," "fast," or "none." The first two options are applicable only if the data includes information on a faster network, as calculated with the flag_speed_share function. These options preliminarily filter the nodes based on their inclusion in the subgraph filtered by edge weight threshold ("full"), filtered by edges created in the faster time window and surpassing the edge weight threshold in that network ("fast"), or apply to the unfiltered graph ("none"). |
a data.table
with summary statistics
EXPERIMENTAL. Batched version of load_tweets_json with control over retained columns. Not as efficient as load_tweets_json but requires less memory. Wrapper of the function fload
load_many_tweets_json( data_dir, batch_size = 1000, keep_cols = c("text", "possibly_sensitive", "public_metrics", "lang", "edit_history_tweet_ids", "attachments", "geo"), query = NULL, query_error_ok = TRUE )
load_many_tweets_json( data_dir, batch_size = 1000, keep_cols = c("text", "possibly_sensitive", "public_metrics", "lang", "edit_history_tweet_ids", "attachments", "geo"), query = NULL, query_error_ok = TRUE )
data_dir |
string that leads to the directory containing JSON files |
batch_size |
integer specifying the number of JSON files
to load per batch. Default: |
keep_cols |
character vector with the names of columns you want to
keep. Set it to |
query |
(string) JSON Pointer query passed on to
fload (optional). Default: |
query_error_ok |
(Boolean) stop if |
Unlike load_tweets_json this function loads JSON files
in batches and processes each batch before loading the next batch.
You can specify which columns to keep, which in turn requires less memory.
For example, you can decide not to keep the "text
column, which
requires quite a lot of memory.
a data.table with all tweets loaded
Very efficient and fast way to load tweets stored in JSON files. Wrapper of the function fload
load_tweets_json(data_dir, query = NULL, query_error_ok = TRUE)
load_tweets_json(data_dir, query = NULL, query_error_ok = TRUE)
data_dir |
string that leads to the directory containing JSON files |
query |
(string) JSON Pointer query passed on to
fload (optional). Default: |
query_error_ok |
(Boolean) stop if |
This function is optimized to load tweets that were
collected using the academicTwittr Package (Twitter API V2).
It uses RcppSimdJson to load the JSON files, which is extremely
fast and efficient. It returns the twitter data as is. The only changes
are that the function renames the id
of tweets to tweet_id
, and
it also deduplicates the data (by tweet_id
).
The function expects that the individual JSON files start with data
.
a data.table with all tweets loaded
Very efficient and fast way to load user information from JSON files. Wrapper of the function fload
load_twitter_users_json(data_dir, query_error_ok = TRUE)
load_twitter_users_json(data_dir, query_error_ok = TRUE)
data_dir |
string that leads to the directory containing JSON files |
query_error_ok |
(Boolean) stop if |
This function is optimized to load user data JSON files that were
collected using the academicTwittr Package (Twitter API V2).
It uses RcppSimdJson to load the JSON files, which is extremely
fast and efficient. It returns the user data as is. The only changes
are that the function renames the id
of tweets to user_id
, and
it also deduplicates the data (by user_id
).
The function expects that the individual JSON files start with user
.
a data.table with all users loaded
Utility function that normalizes text by removing mentions of other accounts, removing "RT", converting to lower case, and trimming whitespace.
normalize_text(x)
normalize_text(x)
x |
The text to be normalized. |
The normalized text.
Function to rename columns of a given data.table. This function standardizes column names to "object_id", "account_id", "content_id", and "timestamp_share". It is useful for preparing datasets for further analysis by ensuring consistent column naming.
prep_data( x, object_id = NULL, account_id = NULL, content_id = NULL, timestamp_share = NULL )
prep_data( x, object_id = NULL, account_id = NULL, content_id = NULL, timestamp_share = NULL )
x |
A data.table or an object that can be converted into a data.table. This is the dataset whose columns will be renamed. |
object_id |
The current name of the column that should be renamed to "object_id". If NULL, no renaming is performed on this column. |
account_id |
The current name of the column that should be renamed to "account_id". If NULL, no renaming is performed on this column. |
content_id |
The current name of the column that should be renamed to "content_id". If NULL, no renaming is performed on this column. |
timestamp_share |
The current name of the column that should be renamed to "timestamp_share". The data in this column should be either in UNIX format or in a "%Y-%m-%d %H:%M:%S" format. If the data is in a different format or conversion is unsuccessful, the function stops with an error. If NULL, no renaming or conversion is performed on this column. |
This function allows the user to specify the current names of columns in their data.table that they wish to rename to a standard format. The function checks for each parameter and renames the corresponding column in the data.table. If the parameter is NULL, no change is made to that column. The function ensures the data input is a data.table; if not, it converts it before renaming. For the 'timestamp_share' column, the function expects the format to be either UNIX format (integer representing seconds since the Unix epoch) or "%Y-%m-%d %H:%M:%S". If the 'timestamp_share' is in a different format, the function attempts to convert it to UNIX format using base R functions.
A data.table with the specified columns renamed according to the input parameters. If no renaming is required, the original data.table is returned unaltered.
dt <- data.table::data.table(old_object_id = 1:3, old_account_id_y = 4:6) dt <- prep_data(dt, object_id = "old_object_id", account_id = "old_account_id_y")
dt <- data.table::data.table(old_object_id = 1:3, old_account_id_y = 4:6) dt <- prep_data(dt, object_id = "old_object_id", account_id = "old_account_id_y")
Reformat nested Twitter data (retrieved from Twitter V2 API).
Spreads out columns and reformats nested a data.table
to
a named list of unnested data.tables.
All output is in long-format.
preprocess_tweets( tweets, tweets_cols = c("possibly_sensitive", "lang", "text", "public_metrics_retweet_count", "public_metrics_reply_count", "public_metrics_like_count", "public_metrics_quote_count") )
preprocess_tweets( tweets, tweets_cols = c("possibly_sensitive", "lang", "text", "public_metrics_retweet_count", "public_metrics_reply_count", "public_metrics_like_count", "public_metrics_quote_count") )
tweets |
a data.table to unnest. Twitter data loaded with load_tweets_json'. |
tweets_cols |
a character vector specifying the columns to keep (optional). |
Restructure your nested Twitter data that you loaded with
load_tweets_json. The function unnests the following columns:
public_metrics
(likes, retweets, quotes),
referenced_tweets
(IDs of "replied to" and "retweet"),
entities
(hashtags, URLs, other accounts).
Returns a named list with several data.tables
,
each data.table
represents one aspect of the nested data.
The function also expects that the following additional
columns are present in the data.table
:
created_at
, tweet_id
, author_id
,
conversation_id
, text
,
in_reply_to_user_id
.
Implicitely dropped columns: edit_history_tweet_ids
a named list
with 5 data.tables:
tweets (contains all tweets and their meta-data),
referenced (information on referenced tweets),
urls (all urls mentioned in tweets),
mentions (other accounts mentioned in tweets),
hashtags (hashtags mentioned in tweets)
Reformat nested twitter user data (retrieved from Twitter v2 API).
Spreads out columns and reformats nested data.table
to long format.
preprocess_twitter_users(users)
preprocess_twitter_users(users)
users |
a data.table with unformatted (nested user data). |
Take the Twitter user data that you loaded with
load_twitter_users_json and unnests the
following columns: public_metrics
and entities
.
a data.table with reformatted user data.
Utility function that removes hashtags from tags.
remove_hashtags(x)
remove_hashtags(x)
x |
The text to be processed. |
The text without hashtags.
Reshape twitter data for coordination detection.
reshape_tweets( tweets, intent = c("retweets", "hashtags", "urls", "urls_domains", "cotweet"), drop_retweets = TRUE, drop_replies = TRUE, drop_hashtags = FALSE )
reshape_tweets( tweets, intent = c("retweets", "hashtags", "urls", "urls_domains", "cotweet"), drop_retweets = TRUE, drop_replies = TRUE, drop_hashtags = FALSE )
tweets |
a named list of Twitter data (output of preprocess_tweets) |
intent |
the desired intent for analysis. |
drop_retweets |
Option passed to |
drop_replies |
Option passed to |
drop_hashtags |
Option passed to |
This function takes the pre-processed Twitter data
(output of preprocess_tweets) and reshapes it
for coordination detection (detect_groups).
You can choose the intent for reshaping the data. Use
"retweets"
to detect coordinated retweeting behaviour;
"hashtags"
for coordinated usage of hashtags;
"urls"
to detect coordinated link sharing behaviour;
"urls_domain"
to detect coordinated link sharing behaviour
at the domain level.
"cotweet"
to detect coordinated cotweeting behaviour
(accounts posting same text).
The output of this function is a reshaped data.table
that
can be passed to detect_groups.
a reshaped data.table
A anonymized dataset of Tweets. All IDs have been obscured using sha256 algorithm.
russian_coord_tweets
russian_coord_tweets
russian_coord_tweets
A data frame with 35,125 rows and 4 columns:
ID of retweeted content. Twitter API calls this "referenced_tweet_id".
ID of the user who tweeted. Twitter API: "author_id"
Tweet ID.
Ingeger. Timestamp (posix time)
Kulichkina, A., Righetti, N., & Waldherr, A. (2024). Protest and repression on social media: Pro-Navalny and pro-government mobilization dynamics and coordination patterns on Russian Twitter. New Media & Society. https://doi.org/10.1177/14614448241254126
Create a simulated input and output of
detect_groups
function.
simulate_data( approx_size = 200, n_accounts_coord = 5, n_accounts_noncoord = 4, n_objects = 5, min_participation = 3, time_window = 10, lambda_coord = NULL, lambda_noncoord = NULL )
simulate_data( approx_size = 200, n_accounts_coord = 5, n_accounts_noncoord = 4, n_objects = 5, min_participation = 3, time_window = 10, lambda_coord = NULL, lambda_noncoord = NULL )
approx_size |
the approximate size of the desired dataset.
It automatically calculates the lambdas passed to |
n_accounts_coord |
the desired number of coordinated accounts. |
n_accounts_noncoord |
the desired number of non-coordinated accounts. |
n_objects |
the desired number of objects. |
min_participation |
the minimum number of repeated coordinated action to define two accounts as coordinated. |
time_window |
the time window of coordination. |
lambda_coord |
|
lambda_noncoord |
|
This function generates a simulated dataset with fixed numbers for coordinated accounts, uncoordinated accounts, and shared objects. The user can set minimum participation and time window parameters and the coordinated accounts will "act" randomly within these restrictions.
The size of the resulting dataset can be adjusted using the approx_size
parameter, and the function will return approximately a dataset of the required
size. Additionally, the size of the dataset can also be adjusted with the
lambda_coord
and lambda_noncoord
parameters. These correspond to the lambda
for the rpois
Poisson distribution used to populate the coordination matrix.
If lambda is between 0.0 and 1.0, the dataset will be smaller compared to
choosing lambdas greater than 1. The approx_size
parameter also serves to
set the lambda of the rpois
function in a more intuitive way.
a list with two data frames: a data frame
with the columns required by the function detect_
coordinated_groups (object_id
, account_id
, content_id
, timestamp_share
)
and the output table of the same
detect_groups function and columns:
object_id
, account_id
, account_id_y
,
content_id
, content_id_y
, time_delta
.
# Example usage of simulate_data ## Not run: set.seed(123) # For reproducibility simulated_data <- simulate_data( n_accounts_coord = 100, n_accounts_noncoord = 50, n_objects = 20, min_participation = 2, time_window = 10 ) # Extract input input_data <- simulated_data[[1]] # Extract output and keep coordinated actors. # This is expected correspond to CooRTweet results from `detect_group` simulated_results <- simulated_data[[2]] simulated_results <- simulated_results[simulated_results$coordinated == TRUE, ] simulated_results$coordinated <- NULL # Run CooRTweet using the input_data and the parameters used for simulation results <- detect_groups( x = input_data, time_window = 10, min_participation = 2 ) # Sort data tables and check whether they are identical data.table::setkeyv(simulated_results, names(simulated_results)) data.table::setkeyv(results, names(simulated_results)) identical(results, simulated_results) ## End(Not run)
# Example usage of simulate_data ## Not run: set.seed(123) # For reproducibility simulated_data <- simulate_data( n_accounts_coord = 100, n_accounts_noncoord = 50, n_objects = 20, min_participation = 2, time_window = 10 ) # Extract input input_data <- simulated_data[[1]] # Extract output and keep coordinated actors. # This is expected correspond to CooRTweet results from `detect_group` simulated_results <- simulated_data[[2]] simulated_results <- simulated_results[simulated_results$coordinated == TRUE, ] simulated_results$coordinated <- NULL # Run CooRTweet using the input_data and the parameters used for simulation results <- detect_groups( x = input_data, time_window = 10, min_participation = 2 ) # Sort data tables and check whether they are identical data.table::setkeyv(simulated_results, names(simulated_results)) data.table::setkeyv(results, names(simulated_results)) identical(results, simulated_results) ## End(Not run)