GitLab CI: Prevent cache reuploads without changes

In several projects now, I had an issue with the way Gitlab CI cache policies are implemented. The spec allows you to configure a given job to only pull, only push, or pull-push a given cache. This makes it possible to configure some jobs in your pipeline to push downloaded or built data into a cache without pulling a previous cached state first (useful when the job always re-creates the cache contents from scratch), or to pull a cache without pushing it back (if the job only consumes the cache contents, but never modifies them). However, pull-push jobs will always upload the cache after execution, even if the cache contents have not changed.

What I want for some jobs is a pull-and-push-if-changed policy: a cache policy that always pulls (downloads) the cache, but only pushes it back (uploads) when the cache contents have changed during the job execution.

This cannot be done easily with out-of-the-box Gitlab CI tooling, but a workaround can be implemented with a few lines of code.

Why is this useful? Link to heading

Often, I need some data cached across multiple pipelines, that only rarely changes.

A good example is the node_modules folder in a Node.js project. It only ever changes when the package-lock.json is modified, either by changing the project dependencies manually or after Renovate pushes a new version update.

Usually, I will use a Gitlab CI cache for my node_modules folder, and run npm ls || npm ci to keep it up-to-date in my ensure_dependencies pre-build job. This is a fast and efficient way to check that the node_modules folder matches the lockfile, or rebuild it completely if it does not - such as when some versions were changed, or the folder cache is missing.

For this example, a pull-and-push-if-changed policy would have these benefits:

  • The cache is always downloaded when it is available, allowing for the fastest possible execution of the pipeline.
  • The cache is only uploaded when it has changed, which saves execution time (node_modules is often quite massive and slow to compress and upload) and bandwidth.

The solution Link to heading

The workaround I have found is based on the fact that Gitlab CI will not upload a cache, if it finds no matching contents. The build job output produces a warning in this case, but the effect is exactly as desired - the cache is not re-uploaded.

gitlab-ci-output.png

So, what I do is delete the cached folder after the job execution, if I know that the contents have not been modified.

Importantly, Gitlab CI will also not delete the existing cache in this case. This allows future CI jobs to keep using the same last uploaded cache state.

Specific solution for node_modules Link to heading

As mentioned above, the basic thing my ensure_dependencies job does is to run npm ls || npm ci. To delete the node_modules folder if npm ls reports that is already up-to-date, I can extend the code to something like this:

(npm ls && echo 'node_modules is up-to-date, deleting it to prevent cache re-upload...' && rm -rf node_modules) || (echo 'node_modules is missing or not up-to-date, reinstalling packages...' && npm ci)

Now, if the latest cache already contains an up-to-date node_modules folder, it will be deleted after the job execution and the cache will not be re-uploaded. If the cache is missing or outdated, npm ci will run and install the packages again, and then upload the new cache for following jobs to re-use.

More generic solution for any cached folders Link to heading

This same workaround can also be applied to any cache configuration that caches some folder in the build directory.

To do this, I use a script that generates a hash of the complete file tree of the folder in question, once before and once after the job execution. If the hash does not change, the script then deletes the folder and prevents the unnecessary re-upload.

# reusable snippets that can be included in any job
.clear_cache_folder_if_unchanged_before:
  script:
    # check that $CACHE_FOLDER is set
    - if [ -z "$CACHE_FOLDER" ]; then echo "CACHE_FOLDER is not set"; exit 1; fi
    # calculate hash of the file list in $CACHE_FOLDER
    - export CACHE_HASH_1=$(find $CACHE_FOLDER -type f -print0 | sort -z | tr '\0' '\n' | sha256sum | awk '{print $1}')

.clear_cache_folder_if_unchanged_after:
  script:
    # recalculate hash of the file list in $CACHE_FOLDER
    - export CACHE_HASH_2=$(find $CACHE_FOLDER -type f -print0 | sort -z | tr '\0' '\n' | sha256sum | awk '{print $1}')
    # if the hash has not changed, delete the folder to avoid unnecessary cache updates
    # (Gitlab will not upload the cache, if the folder is missing)
    - if [ "$CACHE_HASH_1" == "$CACHE_HASH_2" ]; then
    -  echo "$CACHE_FOLDER contents have not changed, deleting to avoid unnecessary cache updates"
    -  rm -rf $CACHE_FOLDER
    - else
    -  echo "$CACHE_FOLDER contents have changed, cache will be re-uploaded"
    - fi

# example job (incomplete) - Playwright tests that use a cache to avoid downloading the browsers every time
run_playwright_tests:
  variables:
    CACHE_FOLDER: .playwright_browsers
    PLAYWRIGHT_BROWSERS_PATH: $CACHE_FOLDER # this configures Playwright to use the cache folder for the browsers
  cache:
    - key: playwright-browsers
      paths:
        - $CACHE_FOLDER
      policy: pull-push # this is the default, but make it explicit - the cache is always pulled, and pushed if changed
  script:
    - !reference [ .clear_cache_folder_if_unchanged_before, script ]
    - gradle test # or whatever test command, that will run Playwright and also download the browsers if they are missing / outdated
    - !reference [ .clear_cache_folder_if_unchanged_after, script ]

You can see how by using the CACHE_FOLDER variable, this solution can be reused in any CI job with a cached folder.

Important limitation - changes in file contents are not detected Link to heading

The above script only lists all directories and files in the cache folder and builds a hash from this tree listing for comparison. However, it does not check if the file contents have changed.

This means that if a file in the cache folder is modified, but the file names and paths in the folder do not change, the cache will be treated as unmodified and not re-uploaded.

This has not been an issue for my use cases so far, and not checking the entire file contents makes the whole operation quite fast. However, if you need to be sure that the cache contents are identical, you will need a more complex approach that compares the before-and-after file contents and not just the file names and paths.

Summary Link to heading

By using a simple workaround, we have shown how to implement a cache policy in Gitlab CI that only uploads the cache when the contents have changed. This can be useful for large caches that are expensive to upload and do not often need updating, such as node_modules.

This solution is not limited to Node.js projects, but can be used in any Gitlab CI job that uses a cache. It is a simple and effective way to optimize your CI pipeline and save time and bandwidth.

I hope that at some point Gitlab CI implements a pull-and-push-if-changed cache policy natively, but until then, this workaround is a good solution.