Twitter Collections

Autor: Gonley, Matt, Nicholas, Ryan, Fitz, Nicole, Knock, Griffin, Bruce, Derek
Jazyk: angličtina
Rok vydání: 2022
Předmět:
Popis: TwitterCollections is a continuation of work from a previous semester team called Library6Btweets. The prior team, which worked during Fall 2021, was composed of Yash Bhargava, Daniel Burdisso, Pranav Dhakal, Anna Herms, and Kenneth Powell. The current team that took this over, and worked on this during Spring 2022, is composed of Matt Gonley, Ryan Nicholas, Nicole Fitz, Griffin Knock, and Derek Bruce. Billions of tweets have been collected by the Digital Library Research Laboratory (DLRL). The tweets were collected in three formats: DMI-TCAT, YTK, and SFM. The tweets collected should be converted into a standard data format to allow for ease of access and data research. The goal is to convert the collected tweets into a unified JSON format. A secondary goal is to create a machine learning model to categorize uncategorized tweets. The standardized format is in two styles: an individual level, and a collection level. Conversion varies for these levels, requiring, respectively, conversion of each tweet and its attributes to a JSON object, and conversion of a whole collection of tweets to a separate JSON object. Our work involved familiarizing ourselves with the previous semester’s work and its schema. The three formats for the tweets were as follows: Social Feed Manager (SFM), yourTwapperKeeper (YTK), and Digital Methods Initiative Twitter Capture and Analysis Toolset (DMI-TCAT). The previous team designed this schema with these tweet types in mind as well as the Twitter version 2 schema. The previous team also created a collection level schema that listed all of the tweet IDs in a given collection, to allow for determining which tweets belong in which collection. They designed this in accordance with the events archive website. We were given the previous team's conversion scripts for each of the tweet formats as well. Each format needed a different script, as what attributes and what metadata from the tweets was collected differed. The format they were collected in also differed. DMI had the data split into six tables in SQL for any given topic, YTK had the data in separate tables for a topic, and SFM was in the format of JSON. The original scripts were written in Python. For simplicity, we continued using Python as well. Our focus was on optimizing the scripts, as some of them were unusably slow. The scripts also needed to be modified to accommodate scale, where all the data could not be loaded into memory. We were provided six scripts, two for each tweet format: one script for the individual schema and one for the collection level schema. In addition to the optimizations and modifications, a machine learning model was created to accurately classify the events for unlabeled tweet collections. The model can classify the tweets when fed the data from any of the formats. We experimented with a Naive Bayes model and BERT-based Neural Network model, and found the latter superior. The new scripts, optimized versions of prior scripts, best machine learning model, and converted Twitter collection JSON files are our deliverables for this semester. We hope that a standardized set of data can allow for fast and effective research for those who want to incorporate tweets into their study. TwitterCollectionsReport.pdf: This file is the report on the project, using PDF. TwitterCollectionsReport.docx: This file is the report on the project, as a .docx file produced by Word. TwitterCollectionsPresentation.pdf: This is the PDF version of our slide deck for the final project class presentation. TwitterCollectionsPresentation.pptx: This is the .pptx (from PowerPoint) version of our slide deck for the final project class presentation. TwitterCollections_TweetsPerCollection_ytk.csv: This is a CSV file giving the total number of tweets per collection in the YTK database. It lists 1399 collections, but has zero count for 25. The total number of tweets covered is 1,821,081,265. TwitterCollections_NumberOfTweetsPerEvent_ytk.csv: This is a CSV file giving the total number of tweets per event in the YTK database. It lists 790 events, but has zero count for 6. The total number of tweets covered is 1,821,081,265. TwitterCollections_Collection_Table_for_IA20180620_Labeled5.xlsx: This is a .xlsx (from Excel) file for a spreadsheet, describing 1498 collections, giving the event name for each, along with other attributes for many of the collections. TwitterCollections_AllEventNames.csv: This is a list of the 810 event names, as a one-column CSV file. TwitterCollections_CollectionInformation.xlsx: That contains the mapping between event names (given on row 1) and collection names (given on as many rows as needed below the corresponding event name). There are 809 columns. Most events just have one collection, but column 546 (Paleontology) has 59 collection names. TwitterCollectionsPythonScripts.zip: This is a ZIP file containing the code used for the project. It includes the readme for the project as well as the requirements.txt file. It follows the file structure mentioned in the report, with the zipped directory being 2022_optimizations.
Databáze: OpenAIRE