Stories
Slash Boxes
Comments

SoylentNews is people

posted by Fnord666 on Friday February 03 2017, @06:39AM   Printer-friendly
from the 5-backup-strategies-weren't-enough dept.

Ruby Paulson at BlogVault reports

GitLab, the online tech hub, is facing issues as a result of an accidental database deletion that happened in the wee hours of last night. A tired, frustrated system administrator thought that deleting a database would solve the lag-related issues that had cropped up... only to discover too late that he'd executed the command for the wrong database.

[...] It's certainly freaky that all the five backup solutions that GitLab had were ineffective, but this incident demonstrates that a number of things can go wrong with backups. The real aim for any backup solution, is to be able to restore data with ease... but simple oversights could render backup solutions useless.

Computer Business Review adds

The data loss took place when a system administrator accidentally deleted a directory on the wrong server during a database replication process. A folder containing 300GB of live production data was completely wiped.

[...] The last potentially useful backup was taken six hours before the issue occurred.

However, this is not seen to be of any help as snapshots are normally taken every 24 hours and the data loss occurred six hours after the previous snapshot which [resulted in] six hours of data loss.

David Mytton, founder and CEO [of] Server Density, said: "This unfortunate incident at GitLab highlights the urgent need for businesses to review and refresh their backup and incident handling processes to ensure data loss is recoverable, and teams know how to handle the procedure.

GitLab has been updating a Google Doc with info on the ongoing incident.

Additional coverage at:
TechCrunch
The Register


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 0) by Anonymous Coward on Friday February 03 2017, @07:37AM

    by Anonymous Coward on Friday February 03 2017, @07:37AM (#462269)

    I just wish less serious people would be using google docs. It makes it impossible to see the document for those unwilling to allow google code on their hardware.

  • (Score: 0) by Anonymous Coward on Friday February 03 2017, @07:42AM

    by Anonymous Coward on Friday February 03 2017, @07:42AM (#462270)

    Consider using a VM to browse. I am going to do just that.

  • (Score: 2) by mechanicjay on Friday February 03 2017, @07:59AM

    I tried to post a quote of the google doc, but I encountered the lameness filter. Regardless, It's a decent read on how things went bad, and a good lesson on how tired sysadmins just need to hand it off rather than power through.
    --
    My VMS box beat up your Windows box.
    • (Score: 3, Informative) by Anonymous Coward on Friday February 03 2017, @08:27AM

      by Anonymous Coward on Friday February 03 2017, @08:27AM (#462282)

      GitLab.com Database Incident - 2017/01/31

            GitLab.com Database Incident - 2017/01/31

      Note: This incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis).

      YouTube Live stream - Follow us live debating and problem solving!

      Timeline (all times UTC) 1

      Recovery - 2017/01/31 23:00 (backup from ±17:20 UTC) 2

      Problems Encountered 4

      External help 5

      Hugops (please add kind reactions here, from twitter and elsewhere) 5

      Stephen Frost 5

      Sam McLeod 5
      Impact

              ±6 hours of data loss
              4613 regular projects, 74 forks, and 350 imports are lost (roughly); 5037 projects in total. Since Git repositories are NOT lost, we can recreate all of the projects whose user/group existed before the data loss, but we cannot restore any of these projects’ issues, etc.
              ±4979 (so ±5000) comments lost
              707 users lost potentially, hard to tell for certain from the Kibana logs
              Webhooks created before Jan 31st 17:20 were restored, those created after this time are lost

      Timeline (all times UTC)

              2017/01/31 16:00/17:00 - 21:00

              YP is working on setting up pgpool and replication in staging, creates an LVM snapshot to get up to date production data to staging, hoping he can re-use this for bootstrapping other replicas. This was done roughly 6 hours before data loss.
              Getting replication to work is proving to be problematic and time consuming (estimated at ±20 hours just for the initial pg_basebackup sync). The LVM snapshot is not usable on the other replicas as far as YP could figure out. Work is interrupted due to this (as YP needs the help of another collegue who’s not working this day), and due to spam/high load on GitLab.com

              2017/01/31 21:00 - Spike in database load due to spam users - Twitter | Slack

              Blocked users based on IP address
              Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load). This was communicated with the infrastructure and support team.
              Removed users for spamming (by creating snippets) - Slack
              Database load goes back to normal, some manual PostgreSQL vacuuming is applied here and there to catch up with a large amount of dead tuples.

              2017/01/31 22:00 - Replication lag alert triggered in pagerduty Slack

              Attempts to fix db2, it’s lagging behind by about 4 GB at this point
              db2.cluster refuses to replicate, /var/opt/gitlab/postgresql/data is wiped to ensure a clean replication
              db2.cluster refuses to connect to db1, complaining about max_wal_senders being too low. This setting is used to limit the number of WAL (= replication) clients
              YP adjusts max_wal_senders to 32 on db1, restarts PostgreSQL
              PostgreSQL complains about too many semaphores being open, refusing to start
              YP adjusts max_connections to 2000 from 8000, PostgreSQL starts again (despite 8000 having been used for almost a year)
              db2.cluster still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anything
              At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.

              2017/01/31 23:00-ish

              YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
              2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left - Slack

      Recovery - 2017/01/31 23:00 (backup from ±17:20 UTC)

              Suggested recovery solutions:

              Migrate db1.staging.gitlab.com data to GitLab.com (±6 hours old)

              CW: Problem with web hooks, these are removed as part of the staging sync.

              Restore LVM snapshot (6 hours old)
              Sid: try to undelete files?

              CW: Not possible! `rm -Rvf` Sid: OK
              JEJ: Probably too late, but isn't it sometimes possible if you make the disk read-only quickly enough? Also might still have file descriptor if the file was in use by a running process according to http://unix.stackexchange.com/a/101247/213510 [stackexchange.com]
              YP: PostgreSQL doesn't keep all files open at all times, so that wouldn't work. Also, Azure is apparently also really good in removing data quickly, but not at sending it over to replicas. In other words, the data can't be recovered from the disk itself.
              SH: It appears the db1 staging server runs a separate PostgreSQL process under the gitlab_replicator directory that streams production data from db2. Due to replication lag, db2 was killed 2016-01-31 05:53, which caused the gitlab_replicator to stop. The good news is that the data up until that point looks unaltered, so we may be able to recover the WebHook data.

              Action taken:

              2017/02/01 23:00 - 00:00: The decision is made to restore data from db1.staging.gitlab.com to db1.cluster.gitlab.com (production). While 6 hours old and without webhooks, it’s the only available snapshot. YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN.
              2017/02/01 00:36 - JN: Backup db1.staging.gitlab.com data
              2017/02/01 00:55 - JN: Mount db1.staging.gitlab.com on db1.cluster.gitlab.com

              Copy data from staging /var/opt/gitlab/postgresql/data/ to production /var/opt/gitlab/postgresql/data/

              2017/02/01 01:05 - JN: nfs-share01 server commandeered as temp storage place in /var/opt/gitlab/db-meltdown
              2017/02/01 01:18 - JN: Copy of remaining production data, including pg_xlog tar’ed up as ‘20170131-db-meltodwn-backup.tar.gz’
              2017/02/01 01:58 - JN: Start rsync from stage to production
              2017/02/01 02:00 - CW: Updated deploy page to explain the situation. Link
              2017/02/01 03:00 - AR: rsync progress approximately 50% (by # of files)
              2017/02/01 04:00 - JN: rsync progress approximately 56.4% (by # of files). Data transfer is slowed by two factors: network I/O between us-east and us-east-2 and disk throughput cap on staging server (60 Mb/s).
              2017/02/01 07:00 - JN: Found a copy of pre-sanitized data in on db1 staging in /var/opt/gitlab_replicator/postgresql. Started db-crutch VM in us-east to backup this data to another host. Unfortunately, this system maxes out at 120 GB RAM and cannot support the production load. This copy will be used to check the database state and export the WebHook data.
              2017/02/01 08:07 - JN: Data transfer has been slow: total transfer progress by data size is 42%.
              2017/02/02 16:28 - JN: Data transfer complete
              2017/02/02 16:45 - Following restore procedures below

              Restore procedure

              [x] - Make a snapshot of the DB1 server - or 2 or 3 - done at 16:36 UTC
              [x] - Upgrade db1.cluster.gitlab.com to PostgreSQL 9.6.1 as it’s still running 9.6.0 while staging uses 9.6.1 (PostgreSQL might not start otherwise) -

              To install 8.16.3-EE.1
              YP move from chef-noop to chef-client (was disabled manually)
              YP running chef-client in the host @16:45

              [x] - Start the DB - 16:53 UTC

              Monitor that it starts and becomes available.
              It’s back up

              [x] - Update the Sentry DSN on the DB so errors don’t go to staging.
              [x] - Increment IDs on all tables by 10k to be on the safe side when creating new projects/notes. Done using https://gist.github.com/anonymous/23e3c0d41e2beac018c4099d45ec88f5, [github.com] which reads from a text file containing all sequences (one per line)
              [x] - Flush Rails/Redis cache
              [x] - Attempt to restore webhooks, if possible

              [x] Start staging using the snapshot prior to webhook removal
              [x] Check if webhooks are present
              [x] Generate SQL dump (data only) of the “web_hooks” table (if there is data)
              [x] Copy SQL dump to the production server
              [x] Import SQL dump into the production database

              [x] - Test if workers can connect via a Rails console
              [x] - Gradually start workers
              [x] - Disable deploy page
              [x] - Tweet from @gitlabstatus
              [x] - Create outage issues describing future plans/actions:

              https://gitlab.com/gitlab-com/infrastructure/issues/1094 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1095 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1096 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1097 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1098 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1099 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1100 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1101 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1102 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1103 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1104 [gitlab.com]
              https://gitlab.com/gitlab-com/infrastructure/issues/1105 [gitlab.com]

              [ ] - Create new Project records for Git repositories without a Project record, where the namespace path corresponds to an existing user/group.

              PC - I’m getting a list of these repos so we can then check on the DB that they do exists or not - it’s there

              [ ] - Delete repositories for unknown (lost) namespaces

              AR - working on script based on previous point data.

              [x] - Remove the spam users again (so they don’t cause problems again)

              [x] CDN user with 47 000 IPs

              TODO after data restored:

              Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging). Show the full hostname in the bash prompt for all users by default (e.g., “db1.staging.gitlab.com” instead of just “db1”): https://gitlab.com/gitlab-com/infrastructure/issues/1094 [gitlab.com]
              Somehow disallow rm -rf for the PostgreSQL data directory? Unsure if this is feasible, or necessary once we have proper backups
              Add alerting for backups: check S3 storage etc. Add a graph with the size of the backups over time, page when it goes down by more than 10%: https://gitlab.com/gitlab-com/infrastructure/issues/1095 [gitlab.com]
              Consider adding a last successful backup time in DB so admins can see this easily (suggested by customer in https://gitlab.zendesk.com/agent/tickets/58274) [zendesk.com]
              Figure out why PostgreSQL suddenly had problems with max_connections being set to 8000, despite it having been set to that since 2016-05-13. A large portion of frustration arose because of this suddenly becoming a problem: https://gitlab.com/gitlab-com/infrastructure/issues/1096 [gitlab.com]
              Look into increasing replication thresholds via WAL archiving / PITR - also will be useful after failed upgrades: https://gitlab.com/gitlab-com/infrastructure/issues/1097 [gitlab.com]
              Create troubleshooting guide for problems users might encounter after we go online
              Experiment with moving data from one datacenter to another via AzCopy: Microsoft says this should be faster than rsync:

              This seems to be a Windows only thing, and we don’t have any Windows experts (or anybody remotely familiar enough with it to try this out properly)

      Problems Encountered

              LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
              Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

              SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

              Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
              The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
              The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented

              SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.

              Our backups to S3 apparently don’t work either: the bucket is empty
              We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.

      So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. => we're now restoring a backup from 6 hours ago that worked

      http://monitor.gitlab.net/dashboard/db/postgres-stats?panelId=10&fullscreen&from=now-24h&to=now [gitlab.net]

      External help
      Hugops (please add kind reactions here, from twitter and elsewhere)

              A Twitter Moment for all the kind tweets: https://twitter.com/i/moments/826818668948549632 [twitter.com]
              Jan Lehnardt https://twitter.com/janl/status/826717066984116229 [twitter.com]
              Buddy CI https://twitter.com/BuddyGit/status/826704250499633152 [twitter.com]
              Kent https://twitter.com/kentchenery/status/826594322870996992 [twitter.com]
              Lead off https://twitter.com/LeadOffTeam/status/826599794659450881 [twitter.com]
              Mozair https://news.ycombinator.com/item?id=13539779 [ycombinator.com]
              Applicant https://news.ycombinator.com/item?id=13539729 [ycombinator.com]
              Scott Hanselman https://twitter.com/shanselman/status/826753118868275200 [twitter.com]
              Dave Long https://twitter.com/davejlong/status/826817435470815233 [twitter.com]
              Simon Slater https://twitter.com/skslater/status/826812158184980482 [twitter.com]
              Zaim M Ramlan https://twitter.com/zaimramlan/status/826803347764043777 [twitter.com]
              Aaron Suggs https://twitter.com/ktheory/status/826802484467396610 [twitter.com]
              danedevalcourt https://twitter.com/danedevalcourt/status/826791663943241728 [twitter.com]
              Karl https://twitter.com/irutsun/status/826786850186608640 [twitter.com]
              Zac Clay https://twitter.com/mebezac/status/826781796318707712 [twitter.com]
              Tim Roberts https://twitter.com/cirsca/status/826781142581927936 [twitter.com]
              Frans Bouma https://twitter.com/FransBouma/status/826766417332727809 [twitter.com]
              Roshan Chhetri https://twitter.com/sai_roshan/status/826764344637616128 [twitter.com]
              Samuel Boswell https://twitter.com/sboswell/status/826760159758262273 [twitter.com]
              Matt Brunt https://twitter.com/Brunty/status/826755797933756416 [twitter.com]
              Isham Mohamed https://twitter.com/Isham_M_Iqbal/status/826755614013485056 [twitter.com]
              Adriå Galin https://twitter.com/adriagalin/status/826754540955377665 [twitter.com]
              Jonathan Burke https://twitter.com/imprecision/status/826749556566134784 [twitter.com]
              Christo https://twitter.com/xho/status/826748578240544768 [twitter.com]
              Linux Australia https://twitter.com/linuxaustralia/status/826741475731976192 [twitter.com]
              Emma Jane https://twitter.com/emmajanehw/status/826737286725455872 [twitter.com]
              Rafael Dohms https://twitter.com/rdohms/status/826719718539194368 [twitter.com]
              Mike San Romån https://twitter.com/msanromanv/status/826710492169269248 [twitter.com]
              Jono Walker https://twitter.com/WalkerJono/status/826705353265983488 [twitter.com]
              Tom Penrose https://twitter.com/TomPenrose/status/826704616402333697 [twitter.com]
              Jon Wincus https://twitter.com/jonwincus/status/826683164676521985 [twitter.com]
              Bill Weiss https://twitter.com/BillWeiss/status/826673719460274176 [twitter.com]
              Alberto Grespan https://twitter.com/albertogg/status/826662465400340481 [twitter.com]
              Wicket https://twitter.com/matthewtrask/status/826650119042957312 [twitter.com]
              Jesse Dearing https://twitter.com/JesseDearing/status/826647439587188736 [twitter.com]
              Franco Gilio https://twitter.com/fgili0/status/826642668994326528 [twitter.com]
              Adam DeConinck https://twitter.com/ajdecon/status/826633522735505408 [twitter.com]
              Luciano Facchinelli https://twitter.com/sys0wned/status/826628970150035456 [twitter.com]
              Miguel Di Ciurcio F. https://twitter.com/mciurcio/status/826628765820321792 [twitter.com]
              ceej https://twitter.com/ceejbot/status/826617667884769280 [twitter.com]
              Mark Imbriaco https://twitter.com/markimbriaco/status/826805451245383684 [twitter.com]

      Stephen Frost

      https://twitter.com/net_snow/status/826622954964393984 [twitter.com] @gitlabstatus hey, I'm a PG committer, major contributor, and love what you all do. HMU if I can help in any way, I'd be happy to help.
      Sam McLeod

      (twitter DM): Hey Sid, Sorry to hear about your database / LVM issue, bugger of a thing to happen. Hey we run quite a number of PostgreSQL clusters (master/slave) and I noticed a few things in your report.

      1. You're using Slony - that thing is a POS, not an understatement even have a laugh at it by following http://howfuckedismydatabase.com [howfuckedismydatabase.com] , PostgreSQL's inbuilt binary streaming replication however is rock solid and very fast, I suggest switching to that.

      2. No mention of a connection pooler and mention of having thousands of connections set in postgresql.conf - this is really bad and very inefficient for performance, I suggest using pg_bouncer as a connection pooler - https://pgbouncer.github.io/ [github.io] and not setting PostgreSQL's max_connection over 512-1024, realistically if you're using more than 256 active connections - you need to scale out not up.

      3. The report mentions how fragile your failover and backup processes are, we wrote a simple script for postgresql failover and documentation to go with it - would you like me to provide you with it? As far as backups - we use pgbarman to perform many incremental backups during the day and fill backups twice daily bother via barman and postgresql's pg_dump command, it's important to have your backup directory on different storage from your postgresql data for both performance and resiliency / portability.

      4. You're still on Azure?!?! I'd suggest getting off that crudbucket, so many internal DNS, NTP, routing and storage IO issues with Microsoft's platform it's ridiculous, I've heard some horror stories of how its held together internally too.

      Let me know if you'd like any more advice on tuning PostgreSQL, I've had a lot of experience with it.

        Capt. McLeod

        also - question - how big is your database(s) on disk? like are we talking TB here or still in the GB?

      7h 7 hours ago

        Capt. McLeod

        open sourced my failover / replication sync script:

      7h 7 hours ago

      Also - I see you're looking at pgpool - I would not suggest that, look at pgbouncer instead

        Capt. McLeod

        Pgpool has lots of problems, we tested it thoroughly and then binned it

      5h 5 hours ago

        Capt. McLeod

        Also, let me know if there's anything I can say publicly on twitter or whatever to support GitLab and your transparency through this event, I know how shit these things are, we had SAN level split brain at infoxchange when I first started and I was so nervous I was vomiting!

      4h 4 hours ago

        Sid Sijbrandij

        Hi Sam, thanks for all the help. Mind if I paste it in a public document to share with the rest of the team?

      3m 3 minutes ago

        Capt. McLeod

        The failover script?

      3m 2 minutes ago

        Sid Sijbrandij

        Everything you messaged.

      2m 1 minute ago

      Sure, it's a public repo anyway, but yeah I'm not saying it's perfect - far from but it does work really reliable, I fail between hosts all the time without issue, but YMMV etc etc

      Yeah absolute re: other recommendations too

      If you can send me information about your VM that has PostgreSQL on it and your PostgreSQL.conf file I can make comments on any changes / concerns and explain each one

              Sid - Slony was only used for upgrading from 9.2 to 9.6, we use streaming replication for our regular replication needs.

      Comment - Ok that's good to know, FYI, within major versions of PostgreSQL you can just use the built in replication to perform upgrades.

              Rails already pools/reuses connections, with 25 connections per process. With 20-ish processes for 20 hosts this produces 10 000 max connections, though there will only be 400-ish active concurrently (as Unicorn is single threaded)

      Comment - Each PostgreSQL connection uses memory, it's inefficient to have a lot of connections open, this is where pg_bouncer comes into play, it's a fantastically simple, fast connection pooler, it does one thing and it does it well, whereas pgpool really complicates things, it can rewrite queries and prevent certain queries from working as expected, it's really designed for people not using an ORM / db framework and have complete control over the SQL gets run.

      Worth reading: https://wiki.postgresql.org/wiki/Number_Of_Database_Connections [postgresql.org]

              For load balancing, pooling, better failover, etc, we’re looking into pgpool + streaming replication, with synchronous commits (for data consistency). Pgbouncer does not do load balancing (at least out of the box) as far as we’re aware, only connection pooling. https://github.com/awslabs/pgbouncer-rr-patch [github.com] might also be an option.

      Question - Are you running multiple, active / active PostgreSQL nodes at present, if not, how are you performing load balancing?

      Question - How many RPM does the site pull on an average day? aka how many page loads and requests per second?

      # START Update from Sam McLeod 02/02/2016:

      Note to GitLab: I (Sam) am more than happy to assist by gathering this information and the answers to these questions myself if you are able to provide me remote access to a host or VM snapshot that is the same as the production database host, but without any client or private day in it, I.e. Just a fresh image and schema built to your standard, it'd be very quick for me to answer and document all these questions by doing so and it'd only leave me requesting information regarding your specific hosting environment that I couldn't get / rely on from a clone (I.e, ‘cat /proc/cpuinfo’).

      Otherwise… where I've requested output from specific commands or files (check files for confidential data first obviously)

      Hi Sam, Sid here, it says 'waiting for information from GL Comments' but Yorick looked through the document and didn't see what he should respond to.

      Ah yeah the whole block is a work in progress at the moment, the one thing that could be responded to is the output from the commands I've started to write below, for example ‘cat /etc/fstab’ etc…. which will give me an idea of how you're mounting filesystems, what fs, what options etc… oh and if you're doing sain things like keeping logs and backups in separate directories etc…

      Otherwise happy just for you guys / gals to wait for me to finish my list of questions tomorrow, then reply when you can and I'll look over it and make comments / recommendations from what I've worked with / seen fail etc...

      Really, I just want to do whatever might help you out, if it's actually a burden to have a resource look through and respond to questions about the config or run some (discovery) commands - I totally get that and we can look at it later. But if as part of recovering from the ‘event’ you're reconfiguring things - it might make sense to at least review the PostgreSQL.conf options, filesystem mounting options and kernel tuning.

      Initial observations prior to information being further provided

              Reduce max_connections and use pg_bouncer as a connection pooler
              Investigation into VM tuning and performance configuration required
              Review of postgresql.conf required
              Suggest using binary / streaming replication for the database
              Backups should be restored and decompressed then tested for consistency regularly, an alert should create a ticket / whatever for someone to action upon any failures occurring

      Further information required from GitLab engineers

      Information required from GitLab for tuning recommendations, before we tune anything we need to form a baseline both to share knowledge and to make any future changes measurable.

      Question - What are the general specs of the VMs you’re running PostgreSQL on?

      Information Request - Please run the following to gather information about the VMs resources and configuration:

      cat /proc/meminfo

      cat /proc/cpuinfo

      cat /etc/security/limits*

      cat /var/lib/pgsql/9.6/data/postgresql.conf # CentOS/RHEL

      cat /etc/postgresql/9.6/main/postgresql.conf # Debian

      cat /etc/fstab

      cat /etc/redhat-release # CentOS/RHEL

      cat /etc/debian_version # Debian

      mount

      systemctl zram status

      lsblk

      pvdislay

      vgdisplay

      lvdisplay

      lsmod

      /sbin/sysctl -a

      fdisk -l /dev/

      blockdev --get-ra /dev/ # For each of your mounted disks

      smartctl -a /dev/whatever your first disk is, i.e. sda

      rpm -qa | geel -iE ‘postgres|kernel|psql|e2fs|ext|xfs|btrfs’ # CentOS/RHEL
      dpkg --list | grep -iE ‘postgres|linux-image|psql|e2fs|ext|xfs|btrfs’ # Debian

      Information Request - Benchmark storage performance (Warning, this should be run on a copy of the production VM as the settings / backing performance should represent production, but you don’t want the application running while the test are in progress.

      yum install fio

      fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=10G --readwrite=randread

      fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=10G --readwrite=randwrite

      fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=2048k --iodepth=64 --size=10G --readwrite=read

      fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=2048k --iodepth=64 --size=10G --readwrite=write

      # END Update from Sam McLeod 03/02/2016

      Published by Google Drive–Report Abuse–Updated automatically every 5 minutes

  • (Score: 3, Informative) by butthurt on Friday February 03 2017, @08:23AM

    by butthurt (6141) on Friday February 03 2017, @08:23AM (#462278) Journal

    Did you try? I have Javascript disabled, and the document is perfectly readable for me. On other occasions I have experienced what you describe.

    • (Score: 3, Informative) by Anonymous Coward on Friday February 03 2017, @09:19AM

      by Anonymous Coward on Friday February 03 2017, @09:19AM (#462291)

      That's just it, you never know whether it will work or not this time, and sooner or later you just stop clicking googledoc links altogether.

  • (Score: 2) by WizardFusion on Friday February 03 2017, @10:07AM

    by WizardFusion (498) Subscriber Badge on Friday February 03 2017, @10:07AM (#462308) Journal

    At least you have the option of viewing it (or not), my company proxy blocks it.

  • (Score: 0) by Anonymous Coward on Friday February 03 2017, @07:35PM

    by Anonymous Coward on Friday February 03 2017, @07:35PM (#462550)

    google is also impossible for people who care about their privacy