As I have been arguing for years that the python ecosystem urgently needs to mature its project dependency management tools, I am a very big fan of python poetry. Since it got very stable to use, I started transitioning my projects to work with a pyproject.toml instead of working with conda or setup.py (the latter I perceive as pretty complicated). It is important to note, that it always depends very strongly on the particular use case to decide for which tools to go with. Confronted with lots of computer science / programming students at work, I notice a huge gap of understanding of software engineering principles. Especially in the realm of Data Science it is very common to do grubby prototyping (not to confuse with reasonable prototyping) and mix lots of tools and principles. My reduced recommendation is then to separate project concerns into e.g. implementation and data analysis. If a tool or library is part of the implementation, it totally makes sense to go with s.th. like poetry, while visual analytics often makes more sense in an interactive jupyter notebook – especially at times when the learning curve will still be very high. In my eyes, many engineers are afraid of separating projects (such as an experimental thesis) into parts.
Anyway, when does it make sense to use poetry and when does it make sense to use e.g. conda or pip? It is important to notice, that all of those tools serve slightly different purposes. The short answer, however, is still: If you are designing a library/tool and want to publish it (even if just for yourself) poetry is the way to go provided you have no heavy dependencies which require you to stick to setuptools. To quickly create an experiment or if you need system dependencies conda still has its right to exist in the python world. In my opinion, pip should simply be avoided at all cost and can be wrapped behind poetry and conda.
Jump to:
Quickly setting up an experiment environment
Experiments usually require heavy libraries to hide complexity underhood. This makes experiment design well-arranged, clearly laid-out and decoupled from underlying technical changes. In Machine Learning / Data Science such libraries in python usually include e.g. numpy, pandas, matplotlib, graphviz, networkx, scikit-learn, tensorflow, pytorch or simpy to just mention a few. Experiments should be fully reproducible which is still not as easy as one might think. On a dependency management level, conda suits this task the best, as it is able to fixate not only python dependencies to very distinct versions, but also system dependencies. This fixation on very distinct versions is a major difference to setting up a project such as a library or tool. Libraries or tools should have very loose version restrictions to make them fit into as many other projects as possible.
Steps:
- create a new directory and initialize a git repository
- create an environment.yml file
- create an environment auto-completion file
- create a conda environment given the environment specification
A short code for your bash:
EXP='exp00-experimenttitle' && mkdir $EXP && cd $EXP && echo 'This file is for conda env name auto-completion' >> 'sur-'$EXP && printf 'name: sur-'$EXP'\nchannels:\n- defaults\ndependencies:\n- python>=3.8' >> environment.yml && git init && conda env create -f environment.yml && conda activate 'sur-'$EXP
To make your environment specification very restrictive on particular dependency versions, you can use conda env export
to generate a specification from versions resolved by your environment.
However, this also includes indirect dependencies – something which might not be desired.
sur-exp00-experimenttitle: I usually create auto-completion files for my conda environments, as having dozens of projects it can get quite tricky to remember names. The prefix “sur-” (for surrounding) is a custom choice and a prefix I seldomly encounter, thus making the auto-completion quite easy to work.
environment.yml: Usually I do heavy computations based on given libraries (sometimes also with own ones). This requires not much engineering except for persisting data. Analyzing this data, then, should be decoupled from computation code, e.g. by loading pre-processed data from CSV or JSON files in which results have been persisted. This allows you to switch between computation results and to fast execute statistical analysis and visualizations. Visual analytics then usually involves following packages:
name: sur-exp00-experimenttitle
channels:
- defaults
dependencies:
- jupyter
- networkx
- numpy
- scikit-learn>=0.19.1
- scipy>=1.0.0
- seaborn>=0.9.0
- pip>=10.0
- plotly
- python>=3.6
- requests
- pip:
- filelock
Setting up a project environment
Short hand code: PROJ='projecttitle' && mkdir $PROJ && cd $PROJ && git init && poetry init -n --name $PROJ && poetry install
A purely python-based project does not have any more dependencies than that.
You can add packages, lock versions and manage your virtual environment completely with poetry.
Setting up GitLab CI with poetry and pytest
It took me some time to get a continuous integration with pytest working.
My result is inspired by a .gitlab-ci.yml of github.com/pawamoy.
To avoid unnecessary commits to your repository you can install gitlab-runner and test your configuration with e.g. gitlab-runner exec docker test-python3.8
(in which the last name specifies the job name from your .gitlab-ci.yml).
Install gitlab-runner with at least version 12.3 or up:
# For Debian/Ubuntu/Mint
curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | sudo bash
# For RHEL/CentOS/Fedora
curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.rpm.sh | sudo bash
apt-get update
apt-get install gitlab-runner
$ gitlab-runner -v
Version: 12.3.0
To reproduce the poetry virtual environment, you have to make sure that your base image has pyhthon and pip available. This is why the base images are e.g. based on python:3.6, otherwise you would need to add package installations for python3-dev and python3-pip:
script:
- apt update -qy
- apt install -y python3-dev python3-pip
Then you can install poetry e.g. from pip: pip install poetry
(or use the official recommended version curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
).
As soon as you installed poetry as a global command within your image, you can use it to install your virtual environment for your particular project.
To make packages such as pytest available, I found two tricks helpful:
- installing the virtual environment within the project folder.
This can be cone by
poetry config virtualenvs.in-project true
- a config option for poetry. - calling pytest through poetry, which resolved the package in the virtual environment by
poetry run pytest tests/
.
Templating in the gitlab-ci-configuration helps furthermore to test different base images without duplicating configuration descriptions. Also note, that custom stage names have been defined (default ones are s.th. like test and deploy).
The full content of my .gitlab-ci.yml:
cache:
key: "project-${CI_JOB_NAME}"
paths:
- .cache/pip
- .venv
stages:
- stage-quality
- stage-tests
.install-deps-template: &install-deps
before_script:
- pip install poetry
- poetry --version
- poetry config virtualenvs.in-project true
- poetry install -vv
.test-template: &test
<<: *install-deps
stage: stage-tests
script:
- poetry run pytest tests/
# Test Jobs
test-python3.6:
<<: *test
image: python:3.6
test-python3.7:
<<: *test
image: python:3.7
test-python3.8:
<<: *test
image: python:3.8