PMPL Sprint 2 — Dockerizing DIGIPUS Application
This article is written as a part of Individual Reflection competency for Software Quality Assurance course 2020 at Faculty of Computer Science, University of Indonesia.
This article is both a continuation and the end of PMPL Sprint Reflection series. You can read the Sprint 1 article here. It explains what is PMPL and the software project that is involved.
For Sprint 2, I worked on an issue about dockerizing the DIGIPUS application. It is summarized by this user story:
As a developer, I want to make our GitLab CI job faster by creating a Docker image that contains only the required packages, so we can use for testing jobs.
The gist of this user story is that the project should be dockerized, both for development and Continuous Integration (CI) purposes. The concern came from these two backgrounds explained below, with each one was given a resolution as a realization of the user story.
Let’s take a look on what kind of things that are needed in order to run the project.
- A mass collection of Python libraries that are needed to be installed first, including Django and other useful libraries. This can be solved immediately by using a Python virtual environment. But given the sheer amount of libraries, there’s no guarantee that installation process could be completed in one go. Some packages may require another dependency in a form of OS library and that is beyond the scope of Python itself. In other words, this could potentially bring problems when we try to install this project’s dependencies on different platforms.
- A database. By default, running this project without configuring anything related to database will result the use of SQLite3. Although it is usable in development, still, we should try making our development environment as closely as possible to the staging or production environment. And that is by providing an installed PostgreSQL database on the host, as originally intended by the original project contributors. However, it implies that we need to have a PostgreSQL installed to our host. Sadly, the process of installing PostgreSQL can be quite cumbersome.
The two things explained above should give a sensible reasoning on why this application should be bundled in a standardized manner that can be installed easily on different platforms.
Dockerizing the Application for Use in Development
The objective here is that the project’s dependencies (Python libraries and PostgreSQL) can be installed easily without tinkering the host OS and to enable easier distribution or even deployment through container technology.
I won’t explain the nitty-gritty about Docker as it is beyond the scope of this article. Previously I made an article about Docker and how it was implemented in Software Project Course 2020. You can read it here.
To achieve this, I created a Dockerfile consisting of commands to build the DIGIPUS application image, as shown below
This Dockerfile utilizes a multi-stage build strategy, in which two different stages of image building are used. The first stage collects all the Python libraries in “wheel” form. “Wheel” is preferable than plain old
pip install [packages] because it gives some advantages:
- Faster installation.
- Avoids using
setup.py, which potentially could cause arbitrary code execution during installation.
- No compiler required on the host OS for installing packages which are extensions of C.
- Provides better caching for testing and Continuous Integration process.
- More consistent installation process across platforms and machines.
The advantages listed above are what we’re trying to achieve to our final Docker image. Thus, the second (final) stage only installs the required Python libraries from wheel provided by the first stage without caching capability in order to make the Docker image size as small as possible.
I also utilized Docker Compose in order to orchestrate the DIGIPUS Docker image with PostgreSQL Docker image so they can communicate with each other. The YML file content is shown below:
What have we achieved here?
- Standardized installation across platforms and machines. Just provide the required environment variables, run
docker-compose up -d --buildand we’re good to go.
- Minimize the extra steps needed in installing compilers for certain Python packages and PostgreSQL to our host OS.
- A good start for deploying the application to a production server in a containerized manner.
The Time-Consuming Continuous Integration Process
DIGIPUS project’s source code is contained in the faculty’s self-hosted GitLab platform. While it works flawlessly as a Version Control System, it is known to have slow GitLab runners that are used in GitLab pipeline. For every time the testing job is run, the runner always re-download all the Python dependencies as it doesn’t cache the process. This can potentially slow down the pipeline across branches or projects, especially in “peak” times. Worse, if it took too much time, it could fail the pipeline, which isn’t good.
From the screenshot above, the first pipeline from the list lasted more than an hour!
Custom Docker Image to Simplify the Continuous Integration Process
A solution that I came up with is building a Python-based Docker image containing all the Python libraries, excluding the source code to DIGIPUS itself, push it to Docker Hub, then use said image as a base image for the testing job.
The Dockerfile is simple and straightforward as it only installs the required Python libraries, as shown below:
Using above Dockerfile, I built the image from my local environment, and pushed the image to Docker Hub under
farhanazmi/digipus-base , which can be pulled from here.
After using the DIGIPUS base image for testing job during another “peak” times:
Overall, the duration of the pipeline went shorter at a reasonable rate. It was possible due the base image has already have the libraries installed and can just proceed to run the tests.
However, there’s a concern that revolves around this solution: what if a new Python library is added to the dependency list at some point in the future? Clearly every job that follows will fail if the base image isn’t updated to have the newly added library. A much better solution is needed to mitigate this issue, e.g. have a “build” stage to update the base image and push it to Docker Hub.
This article discussed the issue that I closed for Sprint 2, including its background and the technical aspects on how I implemented the user story. While the solutions are far from perfect in terms of efficiency. Thus, I think these solutions could be a starting point in bettering the containerization of DIGIPUS project. So I am humbly open to any comments, concerns, or suggestions.
This wrap ups the article. Thank you for reading :)
Use multi-stage builds
Multistage builds are useful to anyone who has struggled to optimize Dockerfiles while keeping them easy to read and…