Decisions

Design Decisions

An ever-growing document that holds information about choices we have made, so we can remember why we ended up designing things this way.

nginx

The cookiecutter this repo is based off of, came with traefik as the reverse proxy used for django. We decided to move to nginx because 1. That would handle serving static files for us 1. It follows suit with how ESnet generally runs webservers

We did decide to run nginx inside a container instead of on the host because we would like to avoid dependency on the underlying OS as much as possible.

Authentication

We are using OIDC as our authentication protocol of choice for production. The idea is that a group or groups are given admin level access automatically based on information in their OIDC claims. With this in place, there should almost never be a time to use the admin account. We suggest not even storing the admin password that was autogenerated for you. Instead, you should just use the make pass-reset target to set the password to a new (autogenerated elsewhere) password. This really should only happen if OIDC is currently broken, and you need to fix something from the webUI. For instance, OIDC may be broken because you accidentally blocked the OIDC servers.

Expiration

Expiration is actually handled by a docker-compose healthcheck on the django container. We knew we'd need some sort of job to run at a given interval. We initially thought of something like celery, but that seemed way too heavyweight for our needs. Cron came to mind as well, but it seemed like we'd have to use the host OS to run the cronjob, and we were trying to avoid caring about the underlying host as much as possible. The healthcheck allows us to run inside docker, as well verify/autofix our django container if something goes wrong.

Database and State

We quickly identified a need for a highly available and shared postgres server to share data between instances in the same group (ie wan-scram, dc-scram, netlab-scram). Initially, we had planned on using Redis Streams to handle communication between instances in the same group, however, we discovered that redis does not handle IP addresses in any form except for a string, which would not work for our purposes. We also investigaged running our own postgres cluster via docker, using posgresql's new pub/sub features, and sharing the blocks via API. All of these options failed to meet our needs, so we have chosen to use CloudSQL as a highly available, shared postgres database.

We also liked this method as it allowed us to keep our state in the hosted postgres that is highly avaiable, easy to get to from anywhere in our network, and backed up regularly. With state being so well protected, it means we can treat each SCRAM instance as ephemeral. In fact, our Ansible deployment actually fully blows away a running instance and replaces it with a new one.

Websockets/Channels/Redis

While redis did not pan out for data sharing, it does serve well to back django channels layers. We use channel layers to set up our "instance groups" of SCRAMs. Websockets is the method in which translator talks to django to get information about what actions should be taken on which entries. This was chosen because this application inherently uses ongoing updates coming in at ad-hoc, unknown times to communicate bidirectionally between the translator and django.

API Client Connectivity

Any API clients will connect to the SCRAM REST API, which is over HTTPS. We can rely on this encryption to protect our information being sent to the API from the client. We wanted to centralize the administration of the clients, so it seemed natural to use the built-in django admin site. The client is allowed to start up and hit a POST only endpoint on first connection where it registers itself with SCRAM. We only allow the client to tell us its fqdn and a unique identifier. The SCRAM administrator is expected to go to the admin site to toggle the boolean for setting the client as authorized, as well as choose from any of the list of available actiontypes. During an entry's creation, we verify this client is allowed to create an entry using the actiontype it is providing.

This model does allow for anyone to "register" a client, but until someone with admin level permissions goes to the admin site and accepts the registration and sets authorized actiontypes, this client cannot create any entries and therefore, not affect any sort of change. This endpoint is POST only, so nobody should be allowed to see what clients exist unless they can access the admin site. Our main security concern after all this is a DoS by constantly POSTing to this endpoint, which can be handled the same way any other DoS would be dealth with (ie likely blocked via SCRAM).

Configurable Payloads

Configurable payloads allows you to override certain data sent to the translator. Currently, this is ASN and BGP community. In order to do so, you update the JSON dictionary inside the Web Socket Message entry in the admin page. One thing to note is that you must include a "route" key in that dictionary with a string value of any kind. If you decide to change the key in the JSON payload whose value will contain the route being acted on field, you must change the route key in the dictionary to match that field.

Syncing

If you want two or more instances of SCRAM to share data between themselves we have a few ways of making sure that happens.

By depending on postgres, we can use a shared postgres instance to make sure both SCRAM instances have the same data
When a translator connects, it asks its local Django instance for all routes it already knows about in the DB and announces those.
For normal syncing where both translators have been connected, we are currently using process_updates (since it runs regularly) to grab new data out of the database that comes from other connected instances and reannounces those locally.

Honestly, step 3 is kind of gross and we realize this. We are probably looking at a task runner or something to handle this moving forward, but we needed to get this fixed in the meantime. Status can be tracked with Github Issue 125

Entries Page

We intentionally chose to only list the active entries. Our thinking is that the home page shows the most recent additions. Then, if you went to the entries page, it would be overwhelmingly huge to show all the historical entries including the ones that timed out/were deactivated. If you wanted to know about a specific entry even if it were not currently active (to see the history of it say), you would likely be using the search anyway.