Cardimom: A Twitter Bot

October 10, 2021

This post is about Cardimom, a Twitter Bot that tweets interesting posts related to JavaScript and TypeScript!

Scope

This project is designed to tweet new posts published by a given set of blogs. This project is inspired from Planet Clojure which tweets new posts on topics related to Clojure that are published by a certain set of approved blogs.

Here’s how new posts get published by the Twitter bot, Cardimom. First, relevant details of a blog need to be added to the project’s config file, namely blog_spec.json. These details include: the link to the feed of the blog, the Twitter username of the author, and relevant filtering logic (discussed below).

Once a blog has been added to the config file, the system will parse the blog feed periodically and tweet any new post(s) published by that blog. To this end, a post will be considered to be a ‘new post’ if it was published after the most recent post tweeted by Cardimom.

Here’s an example of a post that was tweeted by the Cardimom:

Config

The config file contains the list of all the blogs that will be tracked by the system. To add a new blog, a pull request will need to be made to the config file, namely blog_spec.json. Each entry of the config file relates to a specific blog and it should contain the following properties:

Blogs on JavaScript and TypeScript

The purpose of this project is to create a Twitter bot that shares blog-posts related to JavaScript and TypeScript. To this end, it is encouraged that any contributor wishing to add their blog to the config file, should add (at the least) the keywords JavaScript, NodeJs and TypeScript to the includes_all list. Contributors are permitted to add other related key-words as well.

System Design

System Design

Launcher: This module triggers the system. It is run periodically, after every 60 minutes.

Config Reader: It parses the config file and returns a list of all the (valid) blogs from the config file.

Database Manager: It connects with the database maintaining the state of the project.

Parser: It parses each blog feed and extracts the necessary details of each post published by that blog, namely, the title of the post, the date of publication, the contents of the post, and the link to the post.

Filter: It filters out the posts published by each blog which fail to meet the relevant filter logic (inclusion and exclusion of key-words).

Fetcher: It fetches the list of all the previously unpublished posts of every blog.

Twitter Publisher: It handles the tweeting of new posts published by each blog, on behalf of the Cardimom Twitter account. This is done by accessing the Twitter REST API’s /status endpoint.

The following sequence diagram provides a more detailed description of how the control flows through the system:

Control Flow

Database Schema

The system remains connected to a database, which maintains the list of all the posts that were earlier fetched and tweeted by the system. This ensures that during each run, the system can rely on the contents of the database to determine if a particular post should be fetched, in accordance with design goals of the system (see below).

To maintain the state of the project, the database should maintain a table containing the following four columns:

In this project, the database schema (called posts) was created by using PostgreSQL. Here’s the structure of the database schema:

    Column    |           Type           | Collation | Nullable | Default 
--------------+--------------------------+-----------+----------+---------
 link         | text                     |           | not null | 
 author       | text                     |           | not null | 
 posted_at    | timestamp with time zone |           |          | 
 published_at | timestamp with time zone |           |          | 
Indexes:
    "posts_pkey" PRIMARY KEY, btree (link)

Content Aggregation Logic

The system expects blog feeds to be in either of the two standardised formats of web feed: RSS and atom. If a blog feed does not conform to the requisite specifications of either of these two formats, the system will reject that feed. Thus, it is important that blog-feeds adhere to the specification requirements of RSS and atom.

Each feed contains a list of items (called ‘items’ for RSS and ‘entries’ for atom), each with a set of extensible meta-data (such as title, date of publication, link etc). In the case of blogs, each of these items represent a post published by that blog.

Depending on the format of a given blog feed, the relevant details of each post can be extracted from the elements of each item, by using the standard DOM parsing API. For example, in the case for an RSS feed, the window.document.getElementsByTagName("item") returns the list of all items (i.e., post objects). Now, the getElementsByTagName function can again be used on each post object, to extract the element with a particular tag (such as the date of publication). Finally, the innerHTML property of that element so extracted will provide its text version.

Here’s an illustration for extracting the date of publication of the first post object from an RSS feed:

let listOfItems = window.document.getElementsByTagName("item");
let firstPost = listOfItems[0];
let dateOfFirstPost = firstPost.getElementsByTagName("pubDate")[0].innerHTML;

For the purpose of this project, the following meta-data needs to be fetched from each post published by every blog: title of the post, the date of publication, the contents of the post, and the link to the post. In the case of an RSS feed, the corresponding elements are fetched in this regard: title, pubDate, description/ content:encoded, and link. Similarly, for atom feeds, the following elements are fetched: title, published / updated, content, and content / summary.

Note that, the getElementByTagName returns an array of all elements with a given tag name. In an RSS or atom feed, each post object will contain only one element for the tags relating to its title, date of publication, link, content. So we need to only extract the first element of each tag.

Design Goals

The project is designed to ensure that no post gets tweeted more than once. In other words, the design goal of the project is that each post should be published at most once. To ensure this, an idempotency check is conducted, before tweeting a new post. To this end, the following steps are undertaken:

Note that, for the very first system run, the database would be empty and potentially every post would be considered as ‘new and unique’ posts. This may produce undesirable results, as the system will try to tweet every single post published by the blogs mentioned in the config file. To prevent this, a cut-off timestamp has been provided (January 1, 2021 00:00:00 UTC + 5:30), when the database is empty. The system will only tweet posts that are published after this cut-off timestamp.

The system runs periodically, every 60 minutes. During each run, the system filters out the list of each new and unique post and proceeds to tweet them. The fairly large interval between consecutive system runs was chosen keeping in mind Twitter’s rate limits.

Deployment and Testing

The project is deployed on Heroku. Heroku provides a developer-friendly platform for deploying and managing applications on the cloud. Once deployed, Heroku packages the source code of the application along with its dependencies into virtual containers (called ‘dynos’), which are responsible for executing the code of the application in the relevant run-time environment.

Need for deployment to Heroku

By virtue of the nature of the project, the system needs to be executed in real-time (subject to periodic intervals). Unlike the earlier project implementing a simplified version of Lisp, the present project requires the system to consistently check for new blog posts from time to time. Given this requirement, it would be difficult to reliably deploy the project locally (i.e., on a local machine), as the execution of the project would be crucially dependent on the state of the local machine for an indefinite period of time. In fact, to reliably deploy the project locally, one will need substantial allocation of resources (to prevent any future break in connectivity, failure of the local machine etc.). This is difficult to achieve on a small-scale. Thus, to avoid such road-blocks, this project is deployed on a cloud-based platform (namely, Heroku) which will offer a virtual container to consistently execute the project and maintain its database.

Secrets Management

To connect to Twitter and the database, we need to have certain shared secrets (e.g. password). But this cannot be part of the code, as it would be openly accessible (for example, on GitHub). Therefore the standard practice of exposing secrets through environment variables has been used in this project. To run the system locally, we can use shell environment variables. On Heroku, environment variables were exposed using the web UI, ensuring that the code on GitHub does not reveal secrets.

System Testing

The project implements system-wide tests (as opposed to unit tests for each module). This means that the test module will work with a sample config file and run the entire system to check if the system is producing the expected results (as opposed to testing each module separately). This was mostly done in the interest of time.

System testing was done using GitHub Actions, which allows automated integration tests to be run during every time a new commit is pushed or a pull-request is received. Given that GitHub Actions provides for continuous integration, the GitHub workflow is configured to automatically build the application, deploy it on GitHub-hosted virtual machines, set up a database and run integration tests during each commit. Thus, once each commit is pushed to the main branch, the project gets deployed momentarily, on GitHub’s containers, specifically to run tests.

Limitations

Here are some of the major limitations, arising out of the design of the project:

Latest Tweet

Here’s the latest tweet published by Cardimom: