This dataset contains all protest-relevant posts on the LIHKG forum between June 10 and July 11, 2019. The dataset comprises a substantial corpus of 2,389,590 individual posts that are organized into 49,658 threads and were contributed by 12,624 distinct users. Note: all data could be publicly accessible in the LIHKG forum.
Data key fields: thread_id: Unique identifier for a thread. cat_id: Identifier for thread category. user_id: User ID who created the thread. item_data_reply_time: Date and time of the reply to the post within the thread data. item_data_user_id: ID of the user who posted within the thread data. post_text_token: Token of the thread data. push_count: Whether contain any of the following terms: “push”, “pish”, “posh”, “pash”, “psuh”, “up”, “tui”, “推”, or “幫推”. issues_pred: Strategic framing identified in the thread by the Bayesian algorithm. topic: Substantive topics identified in the thread by the LDA model.