Search and destroy corrupt pages in Postgres databases

At work I found a very slow and rather huge Postgres database. Over 750GB of Bareos backup information with such incredible slow I/O that the jobs sometime timeout. So I started to clean the thing up and realized that the problem wasn’t just the amount of data but obviously the DB was corrupt.

The corrupt database

During my attempts to clean the database I tried some Bareos cleanup job which all failed.

root@bareos-director-live:~# systemd-run -t bareos-dbcheck -b -f
Query failed: SELECT FileId,Name from File WHERE Name LIKE '%/': ERR=PANIC:  corrupted item lengths: total 22664, available space 7912
SSL SYSCALL error: EOF detected

Then I logged into the Postgres DB directly and manually started a VACUUM task to see what exactly fails. These should normally run automatically as AUTOVACUUM but I the only error report from them which I could find was `autovacuum: found orphan temp table “pg_temp_20”.
Attempting an manual VACUUM FULL which rewrites all data and discards obsolete entries was unsuccessful:

bareos_catalog=# VACUUM FULL;
WARNING:  concurrent delete in progress within table "file"
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Okay the working threat just crashes when doing a VACUUM, that explains why the Bareos cleanup jobs cannot finish and probably also explains why the database is so big. Space with old data just cannot be claimed again and is therefore never overwritten.

So I tried to figure out where exactly the corrupt data lies. Therefore I ran the VACCUUM on every single table in the database and voilà: the ‘file’ table seems to be the culprit.

bareos_catalog=# VACUUM VERBOSE file;
INFO:  vacuuming "public.file"
INFO:  scanned index "file_pkey" to remove 89478200 row versions
DETAIL:  CPU 33.06s/183.07u sec elapsed 298.52 sec
...
INFO:  scanned index "file_pjidpart_idx" to remove 89478227 row versions
DETAIL:  CPU 1.54s/2.08u sec elapsed 13.91 sec
INFO:  "file": removed 89478227 row versions in 2349497 pages
DETAIL:  CPU 32.93s/17.48u sec elapsed 873.81 sec
PANIC:  corrupted item lengths: total 22664, available space 7912
server closed the connection unexpectedly

Attempting to fix it with Postgres tools

Somewhere in this table there seems to be an entry which says it is longer than it can be.

I tried to dump the whole database and hoped that I would be able to reimport it, but no:

root@bareos-dbm-live01:~# sudo -u postgres pg_dump --user bareos_catalog bareos_catalog > /var/lib/postgresql/dump.sql
Password: *************************

pg_dump: error: Dumping the contents of table "file" failed: PQgetResult() failed.
pg_dump: error: Error message from server: PANIC:  corrupted item lengths: total 22664, available space 7912
pg_dump: error: The command was: COPY public.file (fileid, fileindex, jobid, pathid, deltaseq, markid, fhinfo, fhnode, lstat, md5, name) TO stdout;

I search the web for an solution and found out that one approach is to set the option zero_damaged_pages = true and REINDEX the table afterwards.
Dataloss was okay for me at this time, as long as just one entry from the file tables would be lost. The Bareos backups which rely on this database would be still restorable.

But of course it ain’t that simple:

bareos_catalog=# REINDEX table file;
WARNING:  concurrent delete in progress within table "file"
ERROR:  could not access status of transaction 125829127
DETAIL:  Could not open file "pg_subtrans/0780": No such file or directory.
CONTEXT:  while checking uniqueness of tuple (23963062,6) in relation "file"

Now this looks like Postgres searches for data in a non existing file, ugh. Tried to touch that file, made it even worse. Well but at least we’ve got some information from this. The CONTEXT line explains that the error occures when working on a specific tuple (a dataset in the database).

Investigating the bad datasets

So maybe we can work on that thing directly, read it or delete it.

bareos_catalog=# select * from file where ctid='(23963062,6)';
ERROR:  invalid memory alloc request size 18446744073709551613

Nope! Don’t have that much memory installed right now and I don’t think the tuple should be that big. Although it is possible to read the first few columns of the tuple like this:

bareos_catalog=# SELECT ctid,fileid,fileindex,jobid,pathid,deltaseq,markid FROM file WHERE ctid='(23963062,6)';
     ctid     |       fileid        | fileindex  |   jobid    |   pathid   | deltaseq |   markid
--------------+---------------------+------------+------------+------------+----------+------------
 (23963062,6) | 7885913182697562177 | 1109424715 | 1664625506 | 1682055211 |    25968 | 1092632864
(1 row)

But as soon as I tried to read the next column fhinfo I get an abnormal long response and all following columns are unreadable. Also the pointer to the tuple which resides before the corrupt one (23963062,5) works fine.

Then I tried to read all the tuples in the page. The page no. is the 23963062 and the digit after the comma is the tuple. I found out that there are dozens of tuples with the same type of curruption and also one which complains about ERROR: duplicate key value violates unique constraint "file_pkey". Reading data from other pages was possible. Maybe we have to accept that this whole page is lost…

Tracking down the bad pages on disk

Now I really was on new territory. I found two guides which describe more or less the same way to zapp out an entire page from a corrupt Postgres DB.
https://www.endpoint.com/blog/2010/06/01/tracking-down-database-corruption-with
https://www.postgresql.org/message-id/1184245756.24101.178.camel@coppola.muc.ecircle.de
Respect for these howtos, and thaks a lot for sharing!

First it must be clarified in which file the corrupt page resides. Therefore we need to get the OID of the database, the RELFILENODE of the table and the blocksize to calculate the offset:

bareos_catalog=# select oid from pg_database where datname = 'bareos_catalog';
  oid
-------
 16387
(1 row)
bareos_catalog=# select relfilenode from pg_class where relname = 'file';
 relfilenode
-------------
       16402
(1 row)
postgres=# SELECT current_setting('block_size');
 current_setting
-----------------
 8192
(1 row)

With this information the file can be located under /var/lib/postgresql/9.6/main/base/16387/16402 but since the table is several hundred GB in size, the table is splitted into multiple files.

root@syseleven.managementbki1.backupng.live.dbm:/var/lib/postgresql/9.6/main# ls -1 base/16387/16402\* | wc -l
359

So we still need to figure out in which sub file the page resides. Let’s start calculating

Byte offset of the page is the page no. multiplied with the block size: 23963062 * 8192 = 196305403904
Now each split file of the table has the size of 1 GB (I think I read that somewhere). So we’re search for the no. of the file in which out page offset lies.
That is the page no. multiplied with the block size devided by 1GB: 23963062 * 8192 / 1024 / 1024 / 1024 = 182
The page should be located in the 182nd file; now we’re searching for the byte offset inside that file.
To get that I just substracted 182GB from the overall offset: 23963062 * 8192 - ( 182 * 1024 * 1024 * 1024 ) = 884391936
And last but not least we calculate the offset inside that file in page no. instead of bytes by deviding the offset by the block size: 884391936 / 8192 = 107958

Now before you do anything of this, make sure your database is stopped an you have a backup. Things can still get much worse.

We can build a dd command with that no. to read out the corrupt page. If you know what data is to be expected here, you can verify that you’re on the correct page. I didn’t.

dd if=base/16387/16402.182 bs=8192 count=1 skip=107958 of=test.dd

xxd test.dd | head
000001d0: 0000 0000 0700 8007 0080 6941 2041 2049  ..........iA A I
000001e0: 4567 2041 2041 2041 2041 202d 4220 4241  Eg A A A A -B BA
000001f0: 4120 4220 4264 706d 4b7a 2042 6347 6765  A B BdpmKz BcGge
00000200: 6320 4264 7065 7868 2041 2041 2043 5954  c Bdpexh A A CYT
00000210: 6448 776a 5737 6943 7a4a 5870 3941 4f48  dHwjW7iCzJXp9AOH
00000220: 4448 4b4b 3668 3231 6a70 3558 7174 4537  DHKK6h21jp5XqtE7
00000230: 5a48 7255 5059 416e 6738 1363 6f6e 7465  ZHrUPYAng8.conte
00000240: 6e74 7302 a455 ee02 520a c003 0000 0000  nts..U..R.......
00000250: 0000 0000 6d01 b5a5 3b00 0b00 0209 1800  ....m...;.......
00000260: d8fd 5fcd 0100 0000 ef52 0000 d6e9 0000  .._......R......
...

I just overwrote the whole page with zeros.

Deleting the corrupt pages

Yes that means dataloss but there’s no way to repair the corrupt tuples and most probaly the data in the page is already obsolete but could not be reclaimed.

As mentioned in the other guides: notice the conv=notrunc to prevent dd from truncating the rest of the file!

dd if=/dev/zero conv=notrunc of=base/16387/16402.182 bs=8192 count=1 seek=107958

I read the page again and saw 8K of zeros in there. Looks good to me. Time to look confident and starting Postgres again.

bareos_catalog=# select * from file WHERE ctid='(23963062,6)';
 fileid | fileindex | jobid | pathid | deltaseq | markid | fhinfo | fhnode | lstat | md5 | name
--------+-----------+-------+--------+----------+--------+--------+--------+-------+-----+------
(0 rows)
bareos_catalog=# VACUUM VERBOSE file;
INFO:  vacuuming "public.file"
WARNING:  relation "file" page 23963062 is uninitialized --- fixing
INFO:  scanned index "file_pkey" to remove 178956713 row versions
DETAIL:  CPU 36.52s/189.18u sec elapsed 324.00 sec
....
INFO:  "file": found 863859297 removable, 21859469 nonremovable row versions in 25139106 out of 46748205 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 287140903 unused item pointers.
Skipped 0 pages due to buffer pins.
1 page is entirely empty.
...