Complex OpenUpgrade Migration

Day 3: Problems

Graeme Gellatly

After kicking off the migration and watching logs a while, things slowed when it hit products.  This database has 270k products and used product_variant_multi extensively.  Whats more is there was already a requirement for company specific and product specific costs, a feature that Odoo didn't introduce until v10, but partially released in v8.  What's more is the migration script cannot handle attributes with identical names and the migration crashes irrecoverably.

So for the product module to v8 we have a script that is,

  • way too slow (17 hours until crash),

  • inserts incorrect data, and

  • doesn't work.

So if we are going to continue with OpenUpgrade, at the very least product migration is going to need a big rethink. 

Moving on, we removed the script causing the crash and restarted.  This database has millions of workflow records, and at writing has been running 2 hours deleting just a couple million.  So again we have another script which is simply too slow.  Luckily however OpenUpgrade contains this in a function delete_model_workflow.  Handily it tells us in the docstring why it is needed, to - Forcefully remove active workflows for obsolete models, to prevent foreign key issues when the orm deletes the model.

Knowing that by our end state there will no longer be any workflow related tables anyway, as it was completely removed by v11, we just need to prevent the foreign key issues from occurring, while not creating new problems.  The simplest way would be just to truncate the tables before starting the first upgrade but to do that we need to know that first they aren't required. 

To do that we need to check the migration scripts for all models that presently contain a workflow.

Presently in this database, the major workflows to worry about are Pickings, Procurements, Invoices, Account Voucher, Sales Orders and Purchase Orders.

In v8, Pickings and Procurements have their workflows removed.  Procurements looks straightforward to delete as it it the first thing done.  Pickings looks like it might be OK as well, as it doesn't appear anything requiring workflow would occur before then.  This isn't surprising, as the workflow entries are removed so unless some state was moved elsewhere it wouldn't matter.  Invoices, Production Orders and Account Vouchers are all free of any workflow changes, however purchase orders make extensive updates.  It seems however that the purchase order updates are entirely self contained to workflow tables without impact elsewhere so we might still get away with it.  Sales Orders are tangentially affected, but shouldn't be an issue with workflow shifted from sale_stock to sale.

After a quick check through later OpenUpgrade versions it seems that just truncating the table is worth a try.  The time savings won't be quite linear, as where all the workflow is finally removed, there is no removal of records but at least cuts 3 million out.  To do this we add the file odoo/custom/entrypoint.d/30-truncate-workflow.sh to our docker build with the following content.

#!/bin/bash

result="$(psql -qc 'TRUNCATE TABLE wkf_workitem CASCADE;' 2>&1)"
ok=$?
if [ $ok -eq 0 -a -z "$result" ]; then
log INFO Workflow items removed from database
elif [ $ok -ne 0 ]; then
log WARNING Attempt to remove workflow items in \
$PGDATABASE@$PGHOST failed with this message: $result
fi

I've waited enough to know I'm not waiting anymore, as eventually, for this to be feasible we need to fit in an upgrade window, so we'll rebuild our image, and create a new database to get started on.

So in our compose terminal

docker-compose build
docker-compose down

Then in psql

DROP DATABASE upgrade1;
CREATE DATABASE upgrade1 OWNER upgrade TEMPLATE upgrade_master;

Then back in our compose terminal.

docker-compose run --rm odoo -u all --workers=0 --stop-after-init

We notice again a big hold up on procurements, this time writing around 1m new mail messages. A check reveals that 99% of the messages are not only identical but also entirely pointless.  Adding 990,000 pointless messages to another table is only going to cause even more delays downstream so we add another SQL file and repeat.  Initially I added a file like this which dropped the messages to 3500 or so, but on reflection, the procurement model is going anyway and these messages aren't really necessary, so just set them all NULL instead.

#!/bin/bash

result="$(psql -qc 'UPDATE procurement_order SET message = NULL WHERE create_date < $$2018-03-31$$ AND message IS NOT NULL AND message != $$$$;' 2>&1)"
ok=$?
if [ $ok -eq 0 -a -z "$result" ]; then
log INFO Messages from old procurements removed
elif [ $ok -ne 0 ]; then
log WARNING Attempt to remove old procurement messages \
$PGDATABASE@$PGHOST failed with this message: $result
fi

result="$(psql -qc 'UPDATE procurement_order SET message = NULL WHERE message IN ($$Products reserved from stock.$$,$$from stock: products assigned.$$);' 2>&1)"
ok=$?
if [ $ok -eq 0 -a -z "$result" ]; then
log INFO Normal procurement messages removed
elif [ $ok -ne 0 ]; then
log WARNING Attempt to remove normal procurement messages \
$PGDATABASE@$PGHOST failed with this message: $result
fi

result="$(psql -qc 'UPDATE procurement_order SET message = NULL WHERE state = $$cancel$$ AND message IS NOT NULL AND message != $$$$;' 2>&1)"
ok=$?
if [ $ok -eq 0 -a -z "$result" ]; then
log INFO Messages from cancelled procurements removed
elif [ $ok -ne 0 ]; then
log WARNING Attempt to remove cancelled procurement messages \
$PGDATABASE@$PGHOST failed with this message: $result
fi

In any case, these messages are non essential and can be progressively restored later by a shell script.  So I just drop the column and readd it.

So end of Day 3 and still no completed upgrade for even one version. But we've cut hours off the eventual upgrade with procurement message and identified where the first big problems will be.  After 1 last restart, I notice now it is hung up in the post migration of the account module.  A check reveals it is in setting message_last_post which uses an openupgradelib function.  A check of the function and it looks very easy to port to SQL.  In the time to fork the lib, test the SQL it is still stuck here after 90 minutes, so we update pip.txt of our scaffolding to use the forked version and rebuild the image.  Account is now migrating in 9 minutes, vs the 90 it was at when I killed it.

As the upgrade is going on, lots of little data inconsistencies and warnings are being picked up which we'll address tomorrow.