==================================
In-Depth Guide: Data Preparation
==================================

This guide provides a detailed look at preparing your data for use with ``detectree2``, 
covering everything from basic tiling to advanced options and multi-class data handling.

------------------------------------------
Core Tiling Concepts (RGB & Multispectral)
------------------------------------------

An example of the recommended file structure when training a new model is as follows:

.. code-block:: bash

   ├── Danum                                       (site directory)
   │   ├── rgb
   │   │   └── Dan_2014_RGB_project_to_CHM.tif     (RGB orthomosaic in local UTM CRS)
   │   └── crowns
   │       └── Danum.gpkg                          (Crown polygons readable by geopandas e.g. Geopackage, shapefile)
   │ 
   └── Paracou                                     (site directory)
       ├── rgb                                     
       │   ├── Paracou_RGB_2016_10cm.tif           (RGB orthomosaic in local UTM CRS)
       │   └── Paracou_RGB_2019.tif                (RGB orthomosaic in local UTM CRS)
       ├── ms
       │   └── Paracou_MS_2016.tif                 (Multispectral orthomosaic in local UTM CRS)
       └── crowns
           └── UpdatedCrowns8.gpkg                 (Crown polygons readable by geopandas e.g. Geopackage, shapefile)

Here we have two sites available to train on (Danum and Paracou). Several site directories can be 
included in the training and testing phase (but only a single site directory is required).
If available, several RGB orthomosaics can be included in a single site directory (see e.g ``Paracou -> RGB``).

For Paracou, we also have a multispectral scan available (5-bands). For this data, the ``mode`` parameter in the 
``tile_data`` function should be set to ``"ms"``. This calls a different routine for tiling the data that retains the
``.tif`` format instead of converting to ``.png`` as in the case of ``rgb``. This comes at a slight expense of speed
later on but is necessary to retain all the multispectral information.

.. code-block:: python
   
   from detectree2.preprocessing.tiling import tile_data
   import geopandas as gpd
   import rasterio

   # Set up input paths
   site_path = "/path/to/data/Paracou"
   img_path = site_path + "/rgb/2016/Paracou_RGB_2016_10cm.tif"
   crown_path = site_path + "/crowns/220619_AllSpLabelled.gpkg"

   data = rasterio.open(img_path)
   crowns = gpd.read_file(crown_path)
   crowns = crowns.to_crs(data.crs.data)

   # Set tiling parameters
   buffer = 30
   tile_width = 40
   tile_height = 40
   threshold = 0.6
   out_dir = site_path + "/tiles/"

   tile_data(img_path, out_dir, buffer, tile_width, tile_height, crowns, threshold, mode="rgb")

.. warning::
   If tiles are outputting as blank images set ``dtype_bool = True`` in the ``tile_data`` function. This is a bug
   and we are working on fixing it.

.. note::
   You will want to relax the ``threshold`` value if your trees are sparsely distributed across your landscape or if you
   want to include non-forest areas (e.g. river, roads). Remember, ``detectree2`` was initially designed for dense,
   closed canopy forests so some of the default assumptions will reflect that and parameters will need to be adjusted
   for different systems.

------------------------
Advanced Tiling Options
------------------------

The ``tile_data`` function exposes many parameters to control how tiles are created. Here are some of the most useful ones in more detail:

- ``tile_placement``: Choose how tile origins are generated.

  - ``"grid"`` (default): Lays tiles on a fixed grid across the image bounds. Fast and predictable.
  - ``"adaptive"``: A more efficient method for training. It works by first creating a single polygon that is the union of all your training crowns, then intelligently places tiles only in rows that intersect this polygon. This avoids creating empty tiles in areas where you have no training data. Requires supplying ``crowns``; if ``crowns`` is ``None``, it falls back to ``"grid"`` with a warning.

- ``overlapping_tiles``: When ``True``, adds a second set of tiles shifted by half a tile's width and height, creating a "checkerboard" pattern. This is useful for ensuring crowns that fall on a tile boundary are fully captured in at least one tile and can help reduce prediction artifacts at tile edges.

- ``ignore_bands_indices``: Zero-based indices of bands to skip (multispectral only). These bands are ignored both when computing image statistics and when writing the output tiles. For example, to exclude band 0 and band 4 in a 5-band raster, pass ``ignore_bands_indices=[0, 4]``.

- ``nan_threshold``: The maximum proportion of a tile that can be NaN (or other no-data values) before it is discarded.

- ``use_convex_mask``: When ``True``, this creates a tight "wrapper" polygon (a convex hull) around all the training crowns within a tile. Any pixels outside this wrapper are masked out. This is a way to reduce noise by forcing the model to ignore parts of the tile that are far from any labeled object.

- ``enhance_rgb_contrast``: When ``True`` (for RGB images only), this applies a percentile contrast stretch. It calculates the 0.2 and 99.8 percentile pixel values and rescales the image to a 1-255 range. This is effective for normalizing hazy, dark, or washed-out imagery. It allows the model to more easily differentiate between tree crowns. 0 is reserved for masked-out areas.

- ``additional_nodata``: Provide a list of pixel values that should be treated as "no data". This is a data cleaning tool for real-world rasters that may have multiple invalid or uncommon values (e.g., -9999, 0, 65535) from sensor errors or previous processing steps.

- ``mask_path``: Path to a vector file (e.g., a GeoPackage) that defines your area of interest. If provided, no tiles will be created outside of this area.

- ``multithreaded``: When ``True``, uses multiple CPU cores to process tiles in parallel, significantly speeding up the tiling process for large orthomosaics. Currently, this can cost a linear amount of added memory.


----------------------------------
Practical Recipes for Tiling
----------------------------------

**Recipe 1: Batch Tiling from Multiple Orthomosaics**

To create a larger, more diverse training dataset, you can tile data from several orthomosaics at once and combine them into a single output directory. This can be done by iterating through your data sources in Python.

.. code-block:: python

   from detectree2.preprocessing.tiling import tile_data
   import geopandas as gpd
   import rasterio

   sites = [
       {
           "img_path": "/path/to/data/SiteA/ortho.tif",
           "crown_path": "/path/to/data/SiteA/crowns.gpkg",
       },
       {
           "img_path": "/path/to-data/SiteB/ortho.tif",
           "crown_path": "/path/to/data/SiteB/crowns.gpkg",
       },
   ]

   output_dir = "/path/to/my-combined-training-data/"

   for site in sites:
       # Read crowns and ensure CRS matches the raster
       with rasterio.open(site["img_path"]) as raster:
           crowns = gpd.read_file(site["crown_path"])
           crowns = crowns.to_crs(raster.crs)
           tile_data(
               img_path=site["img_path"],
               out_dir=output_dir,
               crowns=crowns,
               tile_placement="adaptive",
               mode="ms",
               # other parameters...
               buffer=30,
               tile_width=40,
               tile_height=40,
               threshold=0.6,
           )

**Recipe 2: Tiling Noisy Multispectral Rasters**

This recipe is ideal for large, real-world multispectral datasets that may contain various "no data" artifacts.

.. code-block:: python

   from detectree2.preprocessing.tiling import tile_data
   import geopandas as gpd
   import rasterio

   img_path = "/path/to/your/large_ms_ortho.tif"
   crown_path = "/path/to/your/crowns.gpkg"
   output_dir = "/path/to/ms_tiles"

   # Read crowns and ensure CRS matches the raster
   with rasterio.open(img_path) as raster:
       crowns = gpd.read_file(crown_path)
       crowns = crowns.to_crs(raster.crs)

       tile_data(
           img_path=img_path,
           out_dir=output_dir,
           crowns=crowns,
           mode="ms",
           tile_placement="adaptive",
           additional_nodata=[-10000, -20000],
           tile_width=80,
           buffer=10,
           # other parameters...
           tile_height=80,
           threshold=0.6,
       )

-----------------------------
Handling Multi-Class Data
-----------------------------

For multi-class problems (e.g., species or disease mapping), you need to provide a class label for each crown polygon. 

First, ensure your crowns GeoDataFrame has a column specifying the class for each polygon.

.. code-block:: python

    import geopandas as gpd

    crown_path = "/path/to/crowns/Danum_lianas_full2017.gpkg"
    crowns = gpd.read_file(crown_path)
    
    # The 'status' column here indicates the class of each crown
    print(crowns.head())
    class_column = 'status'

Next, use the ``record_classes`` function to create a class mapping file. This JSON file stores the relationship between class names and their integer indices, which is crucial for training.

.. code-block:: python

    from detectree2.preprocessing.tiling import record_classes

    out_dir = "/path/to/tiles/"
    record_classes(
        crowns=crowns,          # Geopandas dataframe with crowns
        out_dir=out_dir,        # Output directory to save class mapping
        column=class_column,    # Column to be used for classes
        save_format='json'      # Choose between 'json' or 'pickle'
    )

This creates a ``class_to_idx.json`` in your output directory. When you tile the data, provide the ``class_column`` argument to embed this class information into the training tiles.

.. code-block:: python

    # Tile the data with class information
    tile_data(
        img_path=img_path,
        out_dir=out_dir,
        crowns=crowns,
        class_column=class_column, # Specify the column with class labels
        # ... other parameters
        buffer=30,
        tile_width=40,
        tile_height=40,
        threshold=0.6,
    )

----------------------------------
Utilities for Tiled Data
----------------------------------

**Converting Multispectral Tiles to RGB**

If you have multispectral (MS) tiles but want to use them with an RGB-trained model or simply visualize them easily, you can use the ``create_RGB_from_MS`` utility. This function converts a folder of MS tiles into a new folder of 3-band RGB tiles.

.. note::
  This utility is very powerful. It not only converts the images but also copies all ``.geojson`` annotation files and the ``train/test`` folder structure, automatically updating the image paths inside the ``.geojson`` files to point to the new RGB ``.png`` files.

The function offers two conversion methods:
- ``conversion="pca"``: Performs a Principal Component Analysis to find the 3 most important components and maps them to R, G, and B. This is great for visualization.
- ``conversion="first-three"``: Simply takes the first three bands of the MS image.

Here is how you would use it in Python:

.. code-block:: python

    from detectree2.preprocessing.tiling import create_RGB_from_MS

    # Path to the folder containing your multispectral .tif tiles
    ms_tile_folder = "/path/to/ms_tiles/"

    # Path for the new RGB tiles
    rgb_output_folder = "/path/to/rgb_tiles_from_ms/"

    # Convert the tiles using PCA
    create_RGB_from_MS(
        tile_folder_path=ms_tile_folder,
        out_dir=rgb_output_folder,
        conversion="pca"
    )

**Splitting Data into Train/Test/Validation Folds**

After tiling, send geojsons to a train folder (with sub-folders for k-fold cross validation) and a test folder.

.. code-block:: python
   
   from detectree2.preprocessing.tiling import to_traintest_folders

   data_folder = "/path/to/tiles/"
   to_traintest_folders(data_folder, data_folder, test_frac=0.15, strict=False, folds=5)

.. note::
   If ``strict=True``, the ``to_traintest_folders`` function will automatically remove training/validation geojsons
   that have any overlap with test tiles (including the buffers), ensuring strict spatial separation of the test data.
   However, this can remove a significant proportion of the data available to train on. If validation accuracy is a 
   sufficient test of model performance, you can either not create a test set (``test_frac=0``) or allow for 
   overlap in the buffers between test and train/val tiles (``strict=False``).

----------------------------------
Visually Inspecting Your Tiles
----------------------------------

It is recommended to visually inspect the tiles before training to ensure that the tiling has worked as expected and
that crowns and images align. This can be done with the inbuilt ``detectron2`` visualisation tools. For RGB tiles
(``.png``), the following code can be used to visualise the training data.

.. code-block:: python
   
   from detectron2.data import DatasetCatalog, MetadataCatalog
   from detectron2.utils.visualizer import Visualizer
   from detectree2.models.train import combine_dicts, register_train_data
   import random
   import cv2
   from PIL import Image

   name = "Danum"
   train_location = "/content/drive/Shareddrives/detectree2/data/" + name + "/tiles_" + appends + "/train"
   dataset_dicts = combine_dicts(train_location, 1) # The number gives the fold to visualise
   trees_metadata = MetadataCatalog.get(name + "_train")

   for d in dataset_dicts:
      img = cv2.imread(d["file_name"])
      visualizer = Visualizer(img[:, :, ::-1], metadata=trees_metadata, scale=0.3)
      out = visualizer.draw_dataset_dict(d)
      image = cv2.cvtColor(out.get_image()[:, :, ::-1], cv2.COLOR_BGR2RGB)
      display(Image.fromarray(image))


.. image:: ../../../report/figures/trees_train1.png 
   :width: 400
   :alt: Training tile 1
   :align: center

|

.. image:: ../../../report/figures/trees_train2.png
   :width: 400
   :alt: Training tile 2
   :align: center


|

Alternatively, with some adaptation the ``detectron2`` visualisation tools can also be used to visualise the
multispectral (``.tif``) tiles.

.. code-block:: python
   
   import rasterio
   from detectron2.utils.visualizer import Visualizer
   from detectree2.models.train import combine_dicts
   from detectron2.data import DatasetCatalog, MetadataCatalog
   from PIL import Image
   import numpy as np
   import cv2
   import matplotlib.pyplot as plt
   from IPython.display import display

   val_fold = 1
   name = "Paracou"
   tiles = "/tilesMS_" + appends + "/train"
   train_location = "/content/drive/MyDrive/WORK/detectree2/data/" + name + tiles
   dataset_dicts = combine_dicts(train_location, val_fold)
   trees_metadata = MetadataCatalog.get(name + "_train")

   # Function to normalize and convert multi-band image to RGB if needed
   def prepare_image_for_visualization(image):
      if image.shape[2] == 3:
         # If the image has 3 bands, assume it's RGB
         image = np.stack([
               cv2.normalize(image[:, :, i], None, 0, 255, cv2.NORM_MINMAX)
               for i in range(3)
         ], axis=-1).astype(np.uint8)
      else:
         # If the image has more than 3 bands, choose the first 3 for visualization
         image = image[:, :, :3]  # Or select specific bands
         image = np.stack([
               cv2.normalize(image[:, :, i], None, 0, 255, cv2.NORM_MINMAX)
               for i in range(3)
         ], axis=-1).astype(np.uint8)

      return image

   # Visualize each image in the dataset
   for d in dataset_dicts:
      with rasterio.open(d["file_name"]) as src:
         img = src.read()  # Read all bands
         img = np.transpose(img, (1, 2, 0))  # Convert to HWC format
         img = prepare_image_for_visualization(img)  # Normalize and prepare for visualization

      visualizer = Visualizer(img[:, :, ::-1]*10, metadata=trees_metadata, scale=0.5)
      out = visualizer.draw_dataset_dict(d)
      image = out.get_image()[:, :, ::-1]
      display(Image.fromarray(image))