3. Contributing guide

3.1. Contributing to Detectree2

Thanks for taking the time to contribute to detectree2!

3.1.1. Code of Conduct

All contributors are expected to abide by the Code of Conduct

3.1.2. Getting Started - Issues

The “to do” task list for detectree2 starts at the Issues page in Github.

Note: Before submitting a new issue, you should search the existing issues to make sure that it hasn’t already been reported.

3.1.3. Filing a bug report

Bug reports are important in helping identify things that aren’t working in detectree2. Filing a bug report is straightforward: Head to the Issues page, click on the New issue button, and select the Bug Report template to get started.

3.1.4. Detectree2 Github tips

There are many excellent resources for learning Git. See Git/Github, for a brief overview of Git and Github. For tips relating to detectree2 see detectree2 tips.

3.1.5. Tips for creating a detectree2 PR

This guide assumes that most developers do not have permission to contribute to the repo directly and therefore need to fork the project. For users with privileges to contribute to detectree2 directly the concepts are the same, but a PR can be initiated from a feature branch within the detectree2 project directly.

First consult Git/Github or any of the above resources if you are unclear on how to create a good PR. See issue/6 for some recommendatations on best practices when forking a project.

Once a PR has been created, it is good practice to keep the PR’s feature branch up-to-date with upstream master during the PR submission and review process. This can be done simply with the Sync fork link on github which can synchronise upstream master to any remote branch, and then pull the changes using the command line to continue developing locally. This is the simplest way to keep the PR branch updated.

It’s also possible to do all this locally using Git. I.e. In your local clone do the following:

git remote add upstream https://github.com/PatBall1/detectree2
git fetch upstream
git checkout master # if not already there
git rebase upstream/master # or git merge upstream/master
git checkout <feature-branch>
git rebase master # or git merge master
# now push to your remote repo
git push --force-with-lease origin <feature-branch>

This process can be abbreviated to a couple of commands for advanced Git users. We can modify commits using git rebase -i (interactive rebase) to clean up the commit history. We can squash commits, remove unnecessary commits or edit commits.

TIP: The process of rebasing on master may need to be done multiple times during a PR.

At the end of the PR we can use GitHub’s UI to commit. The available options are explained here: Pull request merges.

3.1.5.1. Using GitHub’s UI to commit PR

Whether to squash commits or not is explained well in this article: https://blog.mergify.com/what-is-the-difference-between-a-merge-commit-a-squash/

The github UI will give the squash and merge option when committing a PR to master, but one should proceed with caution. It squashes all of the commits down to one commit in the base branch with an option to edit a commit summary (please modify the commit summary from the default one provided with a more concise message of the PR’s contributions). One could argue that this leads to a linear history (i.e. doesn’t contain merges) but often it can result in large commits that are difficult to read, so if following this approach try to ensure that PRs have a single focus. Squashing also loses valuable information from individual commits.

Alternatively, you can select the rebase and merge option - in this case all commits from the head branch are added onto the base branch individually without a merge commit. If you have conflicts and you still wish to rebase and merge, these need to be resolved locally using the command line as described here and the next section. My advice is to ensure that PRs have a single focus and to reduce unnecessary style commits with pre-commit hooks. git rebase -i is an extremely useful command that every developer should be aware of to neaten history when ready to merge.

Finally, GitHub provides the option to Create a merge commit which simply merges branch into main. I would advise against this as it creates complicated git histories that are difficult to read.

For Detectree2, it’s up to you which approach to take. Given that the project is quite small and there are few contributors I’d suggest using pre-commit hooks and interactive rebases to improve commit quality in favour of squashing. Squashing is also generally fine (and the easier approach of the two), but try to avoid doing it all the time.

3.1.5.2. Using command line to rebase

If you want to rebase the commits but are unable to rebase and merge automatically on GitHub.com you must:

  • Rebase the PR branch onto master locally on the command line

  • Resolve any merge conflicts on the command line

  • Force push to the PR branch

There is plenty of useful information on this online in the official docs and this stackoverflow post. This makes the GitHub UI process trivial and one can select rebase and merge. Rebasing is more involved than merging but leads to a linear history.

TIP: If following this approach, make sure to use pre-commit hooks to improve quality of individual commits.

3.1.5.3. General tips

TIP: Try to avoid merging the PR to a dev branch. This is considered bad practice since when it comes to merge to master the eventual PR can be large and difficult to understand. PRs should have a single focus.

TIP: Delete the branch after merging to master.

TIP: Always nominate a collaborator to review the PR before merging.

TIP: Do not merge unless all tests are passing.

Todo

  • Many of the recommendations above can be made default in Github’s settings:

    • Prohibit commits direct to master.

    • Automatically squash on merge.

    • Prevent merge unless all tests are passing.

    • Only allow a merge if approved by assigned Reviewers.

  • Change master to main. GitHub provides a step-by-step walkthrough.

3.2. Detectree2 setup for developers

For projects with many contributors it is good practice to adhere to a programming style and testing framework. The programming style is enforced with a combination of pre-commit hooks and CI checks. Settings for detectree2’s programming style components are given in setup.cfg, in the project root.

We adopt GitHub actions to deploy software development workflows detectree2/actions, workflows are steered automatically using github actions in .github/workflows directory, a good example is: python-app.yaml. The code checks are triggered automatically on pushing to a branch. The workflows detail the required dependencies for developing and testing detectree2, and should be consulted if anything in the following section is unclear.

For reference, the relevant detectree2 subprojects are:

Todo

  • Publish model on model_zoo

3.2.1. Setting up development environment

Using conda or pip (below we show pip):

pip install flake8 flake8-docstrings mypy autopep8 isort

Todo

  • Create dev-environment.yaml.

3.2.2. Programming style

Detectree2 currently utilises the following tools to check code style and consistency.

  • yapf: Autoformatter ensures consistent formatting of Python files.

  • flake8: Multiple checks for - linting - syntax errors or anti-patterns - (lack of) executable flags on files - docstring validation - function complexity.

  • mypy: Validate Python type hints.

  • isort: Checks that imports are correctly sorted.

A number of other style choices have been enforced across the project:

  • Line length = 120 characters

  • Google style docstrings

  • Indent width = 4 spaces per tab.

This style is enforced both locally (using pre-commit hooks) and remotely (using github workflows). It is possible to use remote checks only, but it is often faster to check locally first than to wait for the github workflow to execute.

3.2.2.1. Pre-commit hooks

Pre-commit hooks ensure that each commit passes a minimum set of style checks. To set up pre-commit hooks do:

pip install pre-commit
pre-commit install
pre-commit run --all-files # it is a good idea to do this over all files not just the files that have changed

# or run on the files that have changed:

git add -u  # e.g
git commit -m "your message"

If tests do not pass, correct errors and then do git add -u again so that changes are staged.

If it is desirable to avoid pre-commit hooks:

git commit -m "your message" --no-verify

With checks configured in the .pre-commit-config.yaml file in the project root. Note that a commit will not be made unless the tests pass. This generally has the effect of improving the quality of individual commits without needing to rely too much on server side checks for code quality.

As an alternative to running pre-commit hooks, one can still run the checks manually but the programmer must be careful that all checks pass. If everything is setup correctly, the CI should not permit a commit to master unless all tests are successful. To see up-to-date commands, consult the relevant workflow. Note that different versions of python (+packages) may give different errors to the CI, so correcting errors may take a few attempts. There may also be discrepancies between the client pre-commit hooks and server CI checks. It is best to update the pre-commit hooks if possible in this case.

Todo

  • Prevent commits unless all tests pass.

  • Convert setup.cfg to pyproject.toml (if using black - does not support setup.cfg)

  • Consider using style-guide instead of black (i.e. autopep8) - black ensures consistency at the detriment of readability.

  • Add dev-requirements file. flake8, flake8-docstrings, mypy, black, isort.

  • Function arguments on individual lines may be preferred to make diffs slightly clearer. But I recommend writing a comprehensive style-guide (by extending the above) rather than using a strict autoformatter like black.

3.2.3. Linting

Flake8 includes linting, syntax errors, and McCabe function complexity analysis.

The are several instances where Flake8 errors have been purposely ignored in Detectree2 using noqa: <CODE> annotations to allow flake8 CI to pass. This is not a permanent fix and the errors should eventually be addressed. For example: noqa: E501 ensures that line lengths beyond (120 characters) are ignored by the linter and noqa: 901 ignores the McCabe complexity measure.

These can also be set globally in setup.cfg, but fewer the better. It is also possible to set continue-on-error in the flake8 workflow or --exit-zero flake8 argument to allow other checks to continue. In practice it was found that developers tend to ignore flake8 errors as a result of these two options, so the noqa solution is preferred.

McCabe function complexity analysis is useful for detecting over-complex code (as determined by the amount of branching - if, else statements). A value of 10 is set as default. This is perhaps overkill and may be removed.

3.2.4. Docstrings

We adopt google docstrings (https://google.github.io/styleguide/pyguide.html)

Other dependencies include flake8-docstrings,

Todo

  • Remove pydocstyle

3.2.5. Autoformatters

We adopt yapf for this project, but others are listed for completion.

3.2.5.1. YAPF

From the YAPF docs:

Most of the current formatters for Python — e.g., autopep8, and pep8ify — are made to remove lint errors from code. This has some obvious limitations. For instance, code that conforms to the PEP 8 guidelines may not be reformatted. But it doesn’t mean that the code looks good.

YAPF is highly customisable and shares a similar philosophy to black. It is possible to customise behaviour of any autoformatter like autopep8 or black with project modifications.

3.2.5.2. Black

From the Black docs:

Black is the uncompromising Python code formatter. By using it, you agree to cede control over minutiae of hand-formatting. In return, Black gives you speed, determinism, and freedom from pycodestyle nagging about formatting. You will save time and mental energy for more important matters.

It favours consistency, meaning it is guaranteed to give the same results across the team - a style guide is not needed.

3.2.5.3. Autopep8

Autopep8 is an autoformatter (like Black) with enforces the PEP8 style guide. Autopep8 is a loose formatter, which will fix PEP8 errors but will not make the code uniform. It relies a little more on the programmer, whereas black, which also produces PEP8 compatible code, is more opinionated in its approach.:

pip install --upgrade autopep8 # if not already installed
autopep8 --in-place --aggressive --aggressive <filename>

It is possible to configure vscode to autoformat with autopep8 on save if desired.

Todo

  • Consider configuring YAPF with pep8 settings to create unformity for project contributors.

3.2.5.4. Formatting aside

The difference between the strict autoformatter yapf and the official demonstrated for function arguments with the example below. Both examples are PEP8 compliant and will pass flake8 linting checks. The former is better for diffs and typing clarity, whereas the latter has fewer lines.

# black or yapf:
def tile_data(
    data: DatasetReader,
    out_dir: str,
    buffer: int = 30,
    tile_width: int = 200,
    tile_height: int = 200,
    dtype_bool: bool = False
) -> None:


# Python's official `style`

def tile_data(data: DatasetReader, out_dir: str, buffer: int = 30, tile_width: int = 200, tile_height: int = 200,
              dtype_bool: bool = False) -> None:

3.2.6. Static typing

From the Mypy docs:

Mypy is an optional static type checker for Python that aims to combine the benefits of dynamic (or ‘duck’) typing and static typing. Mypy combines the expressive power and convenience of Python with a powerful type system and compile-time type checking.

The general idea is to add typing to functions that are most frequently used. It is not necessary to apply across the entire codebase.

The mypy syntax adopted in Detectree2 supports python3.7 and above, but could be updated as the project moves towards more modern python3 (I see no reason not to adopt python 3.10). mypy will attempt to type check all third-party libraries - which might not be desirable. It is possible to install stubs for third-party libraries (i.e. pandas, openCV) if type-checking is desired, but it is easier to suppress all missing import errors libraries by adding ignore_missing_imports = True in setup.cfg.

3.3. Continuous integration

The idea of Continuous integration (CI) is to frequently commit code to a shared repo. This has the effect of detecting errors sooner thereby reducing the amount of code a developer needs to debug when finding an error. Frequent updates also make it easier to merge changes from different members of the software development team. This is especially powerful when paired automated code building and testing. Testing can include code linters, as well as unit and integration tests.

Building and testing code requires a server. CI using GitHub actions offers workflows that can build the repository code and run tests. We can run on GitHub’s own virtual machines (using GitHub-hosted runners), or on machines that we host ourselves (or on compute clusters). The latter is desirable as GitHub does not currently support access to GPU resources.

Currently there are three files that steer workflows. The schedule is set at the top of the file. The workflows are found here

  • python-app.yml: All style CI - builds the code on Ubuntu-20.04

  • dockertest.yml: All style CI - uses docker image for dependencies and installs detectree2 using pip.

  • documentation.yml: Generates documentation and hosts on github pages. Builds code first for sphinx-apidoc.

The dockertest.yml workflow is an attempt to utilise docker to speed up deployment and testing of detectree2. It pulls the docker image: ma595/detectree-cpu-20.04:latest (Python3.8) and installs detectree2 on top. A more up to date docker container, utilising python3.10 and ubuntu 22.04 has been successfully built but has yet to be integrated into the workflow, the file can be found in github:ma595/detectree2-docker/Dockerfile-22.04.

All dockerfiles are in github:ma595/detectree2-docker, which uses github:ma595/detectree2-data to store the data required for the workflow.

Todo

  • Harmonise sphinx.yml and python-app.yml into single file where appropriate. There is no good reason to separate.

  • Add GPU testing to workflow (currently unsupported on Github, but we can use CSD3’s A100 resources).

  • Prevent merge unless all tests are passing

  • Build docker image as part of an action and push to dockerhub (or use github’s docker features)

  • Check 22.04 docker image

  • Move dockerfiles into detectree2 project.

  • Style check documentation.

3.4. Automatic Documentation

Documentation is generated automatically using Sphinx and GitHub actions in documentation.yaml.

Documentation can be generated locally to test rendering. It is better to develop locally rather than rely on the CI and hosted docs as a check, as it can take quite some time to build using the workflow.

To generate locally it is necessary to install the following dependencies (either in pip or conda):

pip install sphinx sphinx_rtd_theme

Then generate api documentation, and build the html.:

sphinx-apidoc -o ./docs/source/ detectree2/
sphinx-build -b html docs/source/ docs/build/html

Then using your favourite browser open docs/build/html/index.html. It’s often necessary to delete the build output to remove old html.

Todo

  • Style checks on documentation.

3.5. Tests

Test-driven development stipulates that tests should be written as new features are introduced to the code. To run the tests simply do:

# install Pytest if haven't already done so.
pip install pytest
# pytest should be run from the project root:
pytest .

It can sometimes be helpful to run individual tests:

pytest -rP -v detectree2/tests/test_preprocessing.py -k test_to_traintest_folders

Which runs the test_to_traintest_folders function in the test_preprocessing.py module, and captures whatever output that may be produced.

A few unit tests have been implemented but there is definitely scope to add more. test_preprocessing.py executes a few functions like tile_data_train, and to_traintest_folders, but does not test if the output is correct. pytest.mark.order is used to force dependencies but better approaches may exist. Further TODOs are listed within the code.

test_prediction.py executes some quite trivial tests. test_evaluation.py computes the area intersection over union (with dummy .geojson data containing square shapes with known areas). The test is still incomplete as some of the code in evaluation.py needs refactoring slightly.

As of August 2022, an integration test has been written which demos the tiling, and training steps. The integration test will run the training on the CPU only. It is possible to run tests on other systems using GitHub, but this will take more work.

TIP: Always write tests for newly introduced logic when contributing code.

Todo

  • Write more unit tests for existing code.

3.6. Building Detectree2

3.6.1. GDAL complexities

GDAL presents a number of complexities. The issue is covered in gdal/issue We must point to the location of the preinstalled GDAL headers, and the GDAL version must match the pip package version. https://github.com/OSGeo/gdal/issues/2293 For instance, on my cluster:

gdal-config -v  # gives 3.0.4

So this means we must install the corresponding pip version: GDAL==3.0.4.

In the event that GDAL does not exist on the system, install it as so (assuming root access):

sudo apt install libgdal-dev gdal-bin

3.6.2. Using pip

It is relatively straightforward to install detectree2 on Colab. Simply pip install and all dependencies will be installed automatically.

On other systems the process is more involved especially if root access is not available. See workflow python-app.yaml workflow for a working CPU deployment.

First we need to install pytorch, torchvision and torchaudio (compatible versions https://pypi.org/project/torchvision/).

This can be done inside virtualenv (if root access is unavailable):

python3 -m venv ./venv # (check version of python is sufficiently high >=3.7)
. venv/bin/activate
pip install --upgrade pip
pip install wheel
pip install opencv-python
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

Then point to preinstalled GDAL header files:

export CPLUS_INCLUDE_PATH=/usr/include/gdal
export C_INCLUDE_PATH=/usr/include/gdal

then:

pip install .  # (add -e flag to allow editable installs)

3.6.2.1. Fixing detectron2 version

We can fix the version of detectron2 by pointing to the pre-built wheel using pip:

python -m pip install detectron2==0.6 -f \ https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

Or by changing the detectron2 line in setup.py (which will build the latest version from source):

detectron2@https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/detectron2-0.6%2Bcu113-cp38-cp38-linux_x86_64.whl

It may be preferable to do this as errors have a tendency to be introduced into the detectron2 codebase and may take a day or two to fix. We can also point to a specific working commit:

pip install git+https://github.com/facebookresearch/detectron2.git@5aeb252b194b93dc2879b4ac34bc51a31b5aee13

# or within setup.py (not tested):
detectron2@git+https://github.com/facebookresearch/detectron2.git@5aeb252b194b93dc2879b4ac34bc51a31b5aee13

3.6.3. Building Detectree2 using Conda

Many of the aforementioned complexities can be solved using Conda. This is currently working for python 3.9.13, in branch matt/conda. The most important file is environment.yaml which specifies the required dependencies.

Install miniconda, and source (usually ~/.miniconda/bin/activate if not in .bashrc already). Begin by installing mamba:

conda install mamba -c conda-forge
mamba env create -f envrironment.yaml
mamba activate detectree2env

Alternatively we may use a conda lock file which has transitive dependencies pinned. This improves reproducibility.:

mamba create --name detectree2env --file conda-linux-64.lock

and if we modify our environment, we can update the lock file as so:

conda-lock -k explicit --conda mamba

and then update conda packages based on the regenerated lock file:

mamba update --file conda-linux-64.lock

The downside of this approach is that it takes much longer to install compared to pip, even with Mamba’s improved dependency resolution.

Todo

  • Determine how this can be integrated into current pip install without breaking colab pip deployment.

  • Investigate use of poetry as it is easier to package a distribution. But detectron2 is not PEP517 compliant.

  • It is possible to combine Conda and Poetry, where Conda is used for packages like GDAL / detectron2 / openCV.

3.6.4. Distributing detectree2 using Conda

In the matt/conda branch, the conda/meta.yaml packages detectree2. An initial attempt can be found here: ma595/detectree2. To install, do the following:

conda install -c ma595 detectree2

To rebuild from meta.yaml:

conda-build . -c conda-forge -c pytorch --output-folder ./conda-bld

Then upload to anaconda:

anaconda login
anaconda upload <path-to-tar.bz2>

Todo

  • Automate distribution of package to anaconda using workflow.

3.7. Python development environment

Todo

  • Setting up visual studio.

  • Create dev-environment.yaml file.