This is an attempt at optimizing service startup.
The effect is most pronounced when many services are restarted one by one.
The systemd service manager role sometimes does this - for example when `just install-service synapse` runs.
In such cases, a 5-second delay for each Synapse worker service
(or other bridge/bot service that waits on the homeserver) quickly adds up to a lot.
When services are all stopped fully and then started, the effect is not so pronounced, because
`matrix-synapse.service` starts first and pulls all worker services (defined as `Wants=` for it).
Later on, when the systemd service manager role "starts" these worker services, they're started already.
Even if they had a 5-second wait each, it would have happened in parallel.
Many of these do depend on the Synapse master process (`matrix-synapse.service`),
so it makes sense to do it.
Furthermore, we're doing it so that one can stop the `matrix-synapse.service`
service and have systemd cascade this into stopping all the workers as well.
This is useful for easily stopping all of Synapse, so that Postgres
upgrades (`--tags=upgrade-postgres`) can happen cleanly.
Postgres upgrades currently stop `devture_postgres_systemd_services_to_stop_for_maintenance_list` which
includes Synapse, but stopping just the master process and leaving workers running is not safe enough and sometimes leads to errors like:
> ERROR: insert or update on table "event_forward_extremities" violates foreign key constraint "event_forward_extremities_event_id"
With this dependency in place, stopping `matrix-synapse.service` will stop all Synapse processes.
This was mostly affecting the stream writer (events) worker, which was
being reported as unhealthy. It wasn't causing any issues, but it just
looked odd and was confusing people.
As an alternative to hitting the regular `/health` healthcheck route (on
the "client" API which this stream writer does not expose),
we may have went for hitting some "replication" API endpoint instead.
This is more complicated and likely unnecessary.
This allows people to augment the Synapse image with custom tools and
addons without having to rebuild it from scratch.
If customizations are enabled, the playbook will build a new
`localhost/matrixdotorg/synapse:VERSION-customized` image
on top of the default one (`FROM matrixdotorg/synapse:VERSION`)
and with custom Dockerfile build steps.
For servers that self-build the Synapse image, the Synapse image will be
built first, before proceding to extend it the same way.
In the future, we'll also have easy to enable Dockerfile build steps
for modules that the playbook supports.
As stream writer workers are also powered by the `generic_worker`
Synapse app, this necessitated that we provide means for distinguishing
between them and regular `generic_workers`.
I've also taken the time to optimize nginx configuration generation
(more Jinja2 macro usage, less duplication).
Worker names have also changed.
Workers are now named sequentially like this:
- `matrix-synapse-worker-0-generic`
- `matrix-synapse-worker-1-stream-writer-typing`
- `matrix-synapse-worker-2-pusher`
instead of `matrix-synapse-worker_generic_worker-18111` (indexed with a
port number).
People who modify `matrix_synapse_workers_enabled_list` directly will
need to adjust their configuration.
* appservice: add and use matrix_homeserver_* vars
* appservice: use the new vars
* Apply suggestions from code review
Co-authored-by: Slavi Pantaleev <slavi@devture.com>
Co-authored-by: Slavi Pantaleev <slavi@devture.com>
Reverts b1b4ba501fdfaa, 90c9801c560b6, a3c84f78ca9c65a, ..
I haven't really traced it (yet), but on some servers, I'm observing
`ansible-playbook ... --tags=start` completing very slowly, waiting
to stop services. I can't reproduce this on all Matrix servers I manage.
I suspect that either the systemd version is to blame or that some
specific service is not responding well to some `docker kill/rm` command.
`ExecStop` seems to work great in all cases and it's what we've been
using for a very long time, so I'm reverting to that.
We had to remove UID/GID environment variables that we used to pass
to the Synapse container, because it was causing a problem after
https://github.com/matrix-org/synapse/pull/11209
We were using both `--user` and UID/GID environment variables until now.
Fixes a regression caused by a5ee39266c.
If the user id and group id were different than 991:991
(which used to be a hardcoded default for us long ago),
there was a mismatch between what Synapse was trying to use (991:991)
and what it was actually started with (in `--user=..`). It was then
trying to change ownership, which was failing.
This was mostly affecting newer installations which were not using the
991:991 defaults we had long ago (since a1c5a197a9).
This switches the `docker exec` method of spawning
Synapse workers inside the `matrix-synapse` container with
dedicated containers for each worker.
We also have dedicated systemd services for each worker,
so this are now:
- more consistent with everything else (we don't use systemd
instantiated services anywhere)
- we don't need the "parse systemd instance name into worker name +
port" part
- we don't need to keep track of PIDs manually
- we don't need jq (less depenendencies)
- workers dying would be restarted by systemd correctly, like any other
service
- `docker ps` shows each worker separately and we can observe resource
usage