You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Chris Mahoney 7eac42d26c add scheduling suggestion 2 months ago
README.md add scheduling suggestion 2 months ago
hourly.png add scheduling suggestion 2 months ago

README.md

New Mirror Backend?

For some time while mirror has had some issues with maintainiblity.

  1. There is a lot of technical debt
  2. All of our features are spread among multiple git repos
  3. There is zero testing
  4. There is no backup
  5. It's very difficult to make improvements

I would like to consider "refactoring" mirror's backend to be more centrilized, automated, and fun.

What makes current mirror work

rsync scripts

In /usr/local/mirrorscripts/ you'll find a shell script for each individual project we host on mirror. These scripts are called whenever we sync to upstream. These scripts must be writable

blender.sh:

#!/bin/bash

distro=blender
data=blender
data_location=/storage/blender
mirror=download.blender.org

source /usr/local/mirrorscripts/globalvars

if [ ! -d ${data_location} ]; then
  ${MKDIR} ${data_location}
fi

echo "Sync started on $(${DATE})." >> ${logfile}

${RSYNC} ${options} --password-file /usr/local/mirrorscripts/blenderpasswd clarksonedu@${mirror}::${data} ${data_location} >> ${logfile}

These scripts are backed up on gitea. They are mostly uniform but some projects (like blender and arch) require additional rsync command line options.

systemd timers

In /etc/systemd/system all of the systemd service and timer files are stored. Like so:

mirror-blender.service:

[Unit]
Description=Mirror Blender

[Service]
User=mirroradmin
Type=simple
ExecStart=/usr/local/mirrorscripts/blender.sh
RuntimeMaxSec=28800

mirror-blender.timer:

[Unit]
Description=Mirror every 6 hours starting at 02:00

[Timer]
OnCalendar=*-*-* 1/6:00:00

[Install]
WantedBy=timers.target

These files are very uniform. The only things that I've found that change are OnCalandar, Description and ExecStart. As far as I can tell no backups of these files exist.

nginx

nginx is our webserver that is configured to host everything. Any files placed in /var/www (and is readable by www-data) is served to the public. The config at /etc/nginx is pretty unremarkible and no known remote backups exist.

One thing of note is our pretty "Index of /" pages are generated by nginx.

SSL Certificate

Our SSL Certificates are handled using https://letsencrypt.org and certbot. Doesn't seem like anything special is going on here.

Frontend

The frontend html client is loosely related to https://github.com/COSI-Lab/MirrorWebsite. The idea is we could just write .pug files and compile those into static html documents and simply serve those. However techical debt has kind of killed those hopes and now we manually write html.

The stats page is really nice.

Stats

This https://github.com/COSI-Lab/MirrorBandwidthStats go project has a script that can read nginx access logs and pull out the distro usage information. This info is saved as json to a javascript file which is then loaded by the client to be shown on the stats page.

Another tid-bit is we use dstat to constantly record bandwidth usage, save that to a csv and then use a seperate script to consolidate that to the bandwidth usage info. This is commically easy to break.

Ideas

What can we do to make this better or make mirror cooler?

Better Bandwidth Tracking

Instead of using dstat we could instead just write something to read data straight from /proc/net/dev every hour and create a time series database. This could have awesome future uses.

Mirror Sync Testing

Most repos include a "lastsync" file with a unix timestap which can be used to verify that we are sucessfully mirroring the repo.

Status Bot

If mirror ever goes down or if a package stops being mirrored there is no way to find out until one of use stumbles into it or we start getting angry emails. This could be changed using a discord bot to scream whenever it can't find mirror.

Centralized Management

Our frontend, systemd, and rsync scripts all use repeated data. We could automate the generation of all of this to make sure the data everywhere is kept up to date.

Scheduling

Take a look at this picture:

Hourly Stats

We have slowly been stacking rsync jobs on top of each other and the only thing really saving us from slowing our service is bandwidth limit flag attached to all of our rsync commands --bwlimit=50000 All of our scripts also execute exactly on the hour which makes us a bad downstream.

Basically thinking about a better way to schedule rsync tasks could make us a better upstream and downstream.

Blog & Mailing List

A spot on the mirror site to communicate with the public would be a huge improvement in my opinion.

A mailing list could make it much more convient to communicate with other admins.

API

Name says it all, it would be a lot easier to create cool things if mirror data was easily accessible.