Gitlab On-call Run Books#
This project provides a guidance for Infrastructure Reliability Engineers and Managers who are starting an on-call shift or responding to an incident. If you haven't yet, review the Incident Management page in the handbook before reading on.
On-Call#
GitLab Reliability Engineers and Managers provide 24x7 on-call coverage to ensure incidents are responded to promptly and resolved as quickly as possible.
Shifts#
We use PagerDuty to manage our on-call schedule and incident alerting. We currently have two escalation policies for , one for Production Incidents and the other for Production Database Assistance. They are staffed by SREs and DBREs, respectively, and Reliability Engineering Managers.
Currently, rotations are weekly and the day's schedule is split 12/12 hours with engineers on call as close to daytime hours as their geographical region allows. We hope to hire so that shifts are an 8/8/8 hours split, but we're not staffed sufficiently yet across timezones.
Joining the On-Call Rotation#
When a new engineer joins the team and is ready to start shadowing for an on-call rotation, overrides should be enabled for the relevant on-call hours during that rotation. Once they have completed shadowing and are comfortable/ready to be inserted into the primary rotations, update the membership list for the appropriate schedule to add the new team member.
This pagerduty forum post was referenced when setting up the blank shadow schedule and initial overrides for on-boarding new team member
Checklists#
To start with the right foot let's define a set of tasks that are nice things to do before you go any further in your week
By performing these tasks we will keep the broken window effect under control, preventing future pain and mess.
Things to keep an eye on#
Issues#
First check the on-call issues to familiarize yourself with what has been happening lately. Also, keep an eye on the #production and #incident-management channels for discussion around any on-going issues.
Alerts#
Start by checking how many alerts are in flight right now
- go to the fleet overview dashboard and check the number of Active Alerts, it should be 0. If it is not 0
- go to the alerts dashboard and check what is being triggered
- watch the #alerts and #feed_alerts-general channels for alert notifications; each alert here should point you to the right runbook to fix it.
- if they don't, you have more work to do.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
Prometheus targets down#
Check how many targets are not scraped at the moment. alerts are in flight right now, to do this:
- go to the fleet overview dashboard and check the number of Targets down. It should be 0. If it is not 0
- go to the [targets down list] and check what is.
- try to figure out why there is scraping problems and try to fix it. Note that sometimes there can be temporary scraping problems because of exporter errors.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
Incidents#
First: don't panic.
If you are feeling overwhelmed, escalate to the IMOC. Whoever is in that role can help you get other people to help with whatever is needed. Our goal is to resolve the incident in a timely manner, but sometimes that means slowing down and making sure we get the right people involved. Accuracy is as important or more than speed.
Roles for an incident can be found in the incident management section of the handbook
If you need to declare an incident, follow these instructions located in the handbook.
Communication Tools#
If you do end up needing to post and update about an incident, we use Status.io
On status.io, you can Make an incident and Tweet, post to Slack, IRC, Webhooks, and email via checkboxes on creating or updating the incident.
The incident will also have an affected infrastructure section where you can pick components of the GitLab.com application and the underlying services/containers should we have an incident due to a provider.
You can update incidents with the Update Status button on an existing incident, again you can tweet, etc from that update point.
Remember to close out the incident when the issue is resolved. Also, when possible, put the issue and/or google doc in the post mortem link.
Production Incidents#
Reporting and incident#
Roles#
During an incident, we have roles defined in the handbook
General guidelines for production incidents.#
- Is this an emergency incident?
- Are we losing data?
- Is GitLab.com not working or offline?
- Has the incident affected users for greater than 1 hour?
- Join the
#incident management
channel - If the point person needs someone to do something, give a direct command: @someone: please run
this
command - Be sure to be in sync - if you are going to reboot a service, say so: I'm bouncing server X
- If you have conflicting information, stop and think, bounce ideas, escalate
- Gather information when the incident is done - logs, samples of graphs, whatever could help figuring out what happened
- use
/security
if you have any security concerns and need to pull in the Security Incident Response team
PostgreSQL#
- PostgreSQL
- more postgresql
- PgBouncer
- PostgreSQL High Availability & Failovers
- PostgreSQL switchover
- Read-only Load Balancing
- Add a new secondary replica
- Database backups
- Database backups restore testing
- Rebuild a corrupt index
- Checking PostgreSQL health with postgres-checkup
- Reducing table and index bloat using pg_repack
Frontend Services#
- GitLab Pages returns 404
- HAProxy is missing workers
- Worker's root filesystem is running out of space
- Azure Load Balancers Misbehave
- GitLab registry is down
- Sidekiq stats no longer showing
- Gemnasium is down
- Blocking a project causing high load
Supporting Services#
Gitaly#
- Gitaly error rate is too high
- Gitaly latency is too high
- Sidekiq Queues are out of control
- Workers have huge load because of cat-files
- Test pushing through all the git nodes
- How to gracefully restart gitaly-ruby
- Debugging gitaly with gitaly-debug
- Gitaly token rotation
- Praefect is down
- Praefect error rate is too high
CI#
Geo#
ELK#
Non-Critical#
Non-Core Applications#
Chef/Knife#
Certificates#
Learning#
Alerting and monitoring#
- GitLab monitoring overview
- How to add alerts: Alerts manual
- How to add/update deadman switches
- How to silence alerts
- Alert for SSL certificate expiration
- Working with Grafana
- Working with Prometheus
- Upgrade Prometheus and exporters
- Use mtail to capture metrics from logs
CI#
Access Requests#
Deploy#
- Get the diff between dev versions
- Deploy GitLab.com
- Rollback GitLab.com
- Deploy staging.GitLab.com
- Refresh data on staging.gitlab.com
- Background Migrations
Work with the fleet and the rails app#
- Reload unicorn with zero downtime
- How to perform zero downtime frontend host reboot
- Gracefully restart sidekiq jobs
- Start a rails console in the staging environment
- Start a redis console in the staging environment
- Start a psql console in the staging environment
- Force a failover with postgres
- Force a failover with redis
- Use aptly
- Disable PackageCloud
- Re-index a package in PackageCloud
- Access hosts in GCP
Restore Backups#
- Deleted Project Restoration
- PostgreSQL Backups: WAL-E, WAL-G
- Work with Azure Snapshots
- Work with GCP Snapshots
- PackageCloud Infrastructure And Recovery
Work with storage#
- Understanding GitLab Storage Shards
- How to re-balance GitLab Storage Shards
- Build and Deploy New Storage Servers
- Manage uploads
Mangle front end load balancers#
- Isolate a worker by disabling the service in the LBs
- Deny a path in the load balancers
- Purchasing/Renewing SSL Certificates
Work with Chef#
- Create users, rotate or remove keys from chef
- Update packages manually for a given role
- Rename a node already in Chef
- Reprovisioning nodes
- Speed up chefspec tests
- Manage Chef Cookbooks
- Chef Guidelines
- Chef Vault
- Debug failed provisioning
Work with CI Infrastructure#
- Runners fleet configuration management
- Investigate Abuse Reports
- Create runners manager for GitLab.com
- Update docker-machine
- CI project namespace check
Work with Infrastructure Providers (VMs)#
- Getting Support from GCP
- Create a DO VM for a Service Engineer
- Create VMs in Azure, add disks, etc
- Bootstrap a new VM
- Remove existing node checklist
Manually ban an IP or netblock#
Dealing with Spam#
Manage Marvin, our infra bot#
ElasticStack (previously Elasticsearch)#
Selected elastic documents and resources:
- elastic/
ElasticStack integration in Gitlab (indexing Gitlab data)#
elasticsearch-integration-in-gitlab.md
Logging#
Selected logging documents and resources:
- logging/
Internal DNS#
Debug and monitor#
Secrets#
Security#
Other#
- Setup oauth2-proxy protection for web based application
- Register new domain(s)
- Manage DNS entries
- Setup and Use my Yubikey
- Purge Git data
- Getting Started with Kubernetes and GitLab.com
- Using Chatops bot to run commands across the fleet
Gitter#
Manage Package Signing Keys#
Other Servers and Services#
Adding runbooks rules#
- Make it quick - add links for checks
- Don't make me think - write clear guidelines, write expectations
- Recommended structure
- Symptoms - how can I quickly tell that this is what is going on
- Pre-checks - how can I be 100% sure
- Resolution - what do I have to do to fix it
- Post-checks - how can I be 100% sure that it is solved
- Rollback - optional, how can I undo my fix
Developing in this repo#
Generating a new runbooks image#
To generate a new image you must follow the git commit guidelines below, this will trigger a semantic version bump which will then cause a new pipeline that will build and tag the new image
Git Commit Guidelines#
This project uses Semantic Versioning. We use commit messages to automatically determine the version bumps, so they should adhere to the conventions of Conventional Commits (v1.0.0-beta.2).
TL;DR#
- Commit messages starting with
fix:
trigger a patch version bump - Commit messages starting with
feat:
trigger a minor version bump - Commit messages starting with
BREAKING CHANGE:
trigger a major version bump. - If you don't want to publish a new image, do not use the above starting strings.
Automatic versioning#
Each push to master
triggers a semantic-release
CI job that determines and pushes a new version tag (if any) based on the last version tagged and the new commits pushed. Notice that this means that if a Merge Request contains, for example, several feat:
commits, only one minor version bump will occur on merge. If your Merge Request includes several commits you may prefer to ignore the prefix on each individual commit and instead add an empty commit summarizing your changes like so:
git commit --allow-empty -m '[BREAKING CHANGE|feat|fix]: <changelog summary message>'
Tool Versioning#
This project has adopted adsf version-manager
for tool versioning.
Installation instructions for asdf
can be found at https://asdf-vm.com/#/core-manage-asdf-vm?id=install.
For compatibility, please configure the following line in ~/.asdfrc
legacy_version_file = yes
Dependencies and required tooling#
Following tools and libraries are required to develop dashboards locally:
- Go programming langugage
- Ruby programming language
go-jsonnet
- Jsonnet implementation written in Gojsonnet-bunder
- package manager for Jsonnetjq
- command line JSON processor
You can install most of them using asdf
tool.
Manage your dependencies using asdf
#
Our asdf
toolset uses the following plugins:
golang
:asdf plugin add golang
ruby
:asdf plugin add ruby
go-jsonnet
:asdf plugin add go-jsonnet
.jsonnet-bundler
:asdf plugin add jb
.
Once you have installed these plugins, run the following command to install the required versions.
$ asdf install
go-jsonnet 0.16.0 is already installed
golang 1.14 is already installed
ruby 2.6.5 is already installed
$ # Confirm everything is working with....
$ asdf current
go-jsonnet 0.16.0 (set by ~/runbooks/.tool-versions)
golang 1.14 (set by ~/runbooks/.tool-versions)
ruby 2.6.5 (set by ~/runbooks/.ruby-version)
You don't need to use asdf
, but in such case you will need install all dependencies manually and track their versions.
Go, Jsonnet#
We use .tool-versions
to record the version of go-jsonnet that should be used for local development. The asdf
version manager is used by some team members to automatically switch versions based on the contents of this file. It should be kept up to date. The top-level Dockerfile
contains the version of go-jsonnet we use in CI. This should be kept in sync with .tool-versions
, and a (non-gating) CI job enforces this.
To install go-jsonnet, you have a few options.
You could follow that project's README to install manually;
Or via homebrew:
brew install go-jsonnet
asdf
, you can use an asdf plugin. Ruby#
Additional to adsf
, many developers use rbenv
, rvm
or other tooling, so, for convenience, we maintain the standard .ruby-version
file for the Ruby version. ASDF needs to be configured using the legacy_version_file = yes
setting described in the parent section.
Contributing#
Please see the contribution guidelines