Feature engineering

Reading image data#

Many applications in environmental sciences involve working with image data. Here we will briefly look at reading and manipulate image data using the Python Image Library (PIL).

Note, other useful libraries include rasterio to read and write raster data, and opencv and scikit-image for image processing. We won’t cover those here though.

from PIL import Image
img = Image.open("_images/advanced_supervised.jpg")
img

../_images/c40452efc28153c21c7c7dbc734d50cdbe1db0f358ba338e202935af72c111aa.png

Handling PIL images#

PIL image objects have a number of useful methods and attributes. Here we’ll look at a few of them.

We can determine the image size, format, and mode using the size, format, and mode attributes, respectively:

img.size, img.format, img.mode

((787, 832), 'JPEG', 'RGB')

img.resize((800, 128))

../_images/4076fb6ae01e4d2aa43b55420b71d203a39f332ed6e050a20f56982f91207581.png

img.resize((128, 128)).rotate(45) # degrees counter-clockwise

../_images/a59376ecb04a1e8c0b673182e37cb88d4602ef490ab81426a25f2d42ffd751e0.png

img.resize((128, 128)).convert('L')

../_images/c7e1d2f56b583a4fb79b4e71dcfbc754200c85cac4bc6abeae1d70df514f8c84.png

Handling image data#

We can also treat these images as numpy arrays. This allows us to perform operations on the image data using numpy functions:

print(np.asarray(img).shape)

(832, 787, 3)

np.asarray(img.convert('RGBA')).shape

(832, 787, 4)

Handling image data#

Now we can easily perform xarray operations on the image data. For example, we can calculate the variance of the image data across the channels:

da_img.std('channel').plot.imshow()

<matplotlib.image.AxesImage at 0x17f5a5780>

../_images/713ce35e2e814a51d78bc62fd8b5d26798fd67dc26332a91ef668cc8cbbf51d2.png

Or pick out specific regions:

da_img.sel(x=slice(100,200), y=slice(500,300), channel='r').plot()

<matplotlib.collections.QuadMesh at 0x16d3f9ff0>

../_images/1413b250da87549e7a6e82f5589d49b8d51c37a05f09456a45cfcafa5f9682cd.png

Handling missing data#

Notice that because the two images above are slightly different shapes xarray has automatically broadcast the smaller image to the larger image. This is a very useful feature of xarray that makes working with image data much easier.

In doing so, it has filled the missing data with nan values. This is a common way to handle missing data in xarray and is very useful for many applications, but it can also cause problems in ML pipelines when e.g. calculating the loss function.

The solution is very much problem dependent, but some common approaches include:

Removing the missing data, by e.g. cropping the images
Resizing the data to the same shape before concatenating them
Filling the missing data with a constant value (da.fillna())
Filling the missing data with the mean or median value (da.fillna(da.mean()))
Interpolating the missing data (da.interpolate_na())

One-hot encoding#

When handling multiple classes in a classification problem, it is often useful to encode the classes as one-hot vectors. This is a vector where all elements are zero except for one element which is one.

This is in contrast to a “label” encoding where the classes are simply encoded as integers.

sklearn provides a host of useful tools for these kind of tasks which can be combined with models to form a ‘Pipeline’. These can be very powerful, but I find them quite cumbersome to work with so I often use pandas and xarray to handle these tasks instead.

One-hot encoding#

For example, we can use the pd.get_dummies function to convert a pandas series of labels to a one-hot encoded matrix:

import pandas as pd
# Sample xarray Dataset with a 'category' dimension
df = pd.DataFrame({
    'data': [1, 2, 3, 1, 2, 4],
    'category': ['A', 'B', 'C', 'A', 'B', 'D']
})
df

	data	category
0	1	A
1	2	B
2	3	C
3	1	A
4	2	B
5	4	D

# Apply one-hot encoding
encoded_df = pd.get_dummies(df, columns=['category'], dtype=int)
encoded_df

	data	category_A	category_B	category_C	category_D
0	1	1	0	0	0
1	2	0	1	0	0
2	3	0	0	1	0
3	1	1	0	0	0
4	2	0	1	0	0
5	4	0	0	0	1

One-hot encoding images#

Now we convert that to a dataframe:

df=da_img2_bit.to_dataframe(name='img')
df

		img
y	x
833	0	225
	1	225
	2	225
	3	225
	4	225
...	...	...
1	774	225
	775	225
	776	225
	777	225
	778	225

648907 rows × 1 columns

And then one-hot encode the data and convert it back to an xarray object:

Data scaling and normalization#

Again, sklearn has tools and pipelines to do this, but with pandas and xarray it is very easy to do this manually:

X_scaled = (X_data.max('sample') - X_data) / (X_data.max('sample') - X_data.min('sample'))

X_scaled.min(['x', 'y', 'channel']), X_scaled.max(['x', 'y', 'channel'])

(<xarray.DataArray (sample: 2)> Size: 8B
 array([0., 0.], dtype=float32)
 Dimensions without coordinates: sample,
 <xarray.DataArray (sample: 2)> Size: 8B
 array([1., 1.], dtype=float32)
 Dimensions without coordinates: sample)

X_scaled.plot.imshow(col='sample', col_wrap=2)

<xarray.plot.facetgrid.FacetGrid at 0x17f808400>

/Users/watson-parris/miniconda3/envs/sio209_dev/lib/python3.10/site-packages/matplotlib/cm.py:494: RuntimeWarning: invalid value encountered in cast
  xx = (xx * 255).astype(np.uint8)

../_images/9d91a04c98fabdc09def0bda42a7c38189c26dcd3e5e77a7df6b1fc2c5516e45.png

X_normed = (X_data - X_data.mean('sample')) / X_data.std('sample')

/Users/watson-parris/miniconda3/envs/sio209_dev/lib/python3.10/site-packages/numpy/lib/nanfunctions.py:1879: RuntimeWarning: Degrees of freedom <= 0 for slice.
  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,

X_scaled.mean(['x', 'y', 'channel']), X_scaled.std(['x', 'y', 'channel'])

(<xarray.DataArray (sample: 2)> Size: 8B
 array([0.62178177, 0.37821826], dtype=float32)
 Dimensions without coordinates: sample,
 <xarray.DataArray (sample: 2)> Size: 8B
 array([0.48494247, 0.48494244], dtype=float32)
 Dimensions without coordinates: sample)

X_normed.plot.imshow(col='sample', col_wrap=2)

<xarray.plot.facetgrid.FacetGrid at 0x17f8e8700>

/Users/watson-parris/miniconda3/envs/sio209_dev/lib/python3.10/site-packages/matplotlib/cm.py:494: RuntimeWarning: invalid value encountered in cast
  xx = (xx * 255).astype(np.uint8)

../_images/fcf9f47574c2c5005e77f82516ea48a5949b3e0dea2061a1fdfb57a6d35dc0ce.png

Feature selection#

In machine learning we usually have a lot of data, and not all of it is useful. Feature selection is the process of selecting a subset of relevant features for use in model construction. Feature selection techniques are used for several reasons:

simplification of models to make them easier to interpret by researchers/users,
shorter training times,
to avoid the curse of dimensionality,
to improve the accuracy of a model by removing irrelevant features or noise.

Feature selection#

There are many different feature selection techniques, but here we’ll look at a few simple ones:

Removing features with low variance
Univariate feature selection such as SelectKBest
Akaike / Bayesian information criterion (AIC / BIC)

These all make some assumptions about the data though, so it is important to understand the data and the assumptions of the feature selection technique before using it. For deep learning models, feature selection is often not necessary as the model can learn the relevant features itself.

Feature engineering

Contents

Feature engineering#

Introduction to Xarray#

Data structures#

What’s in a Dataset?#

What’s in a DataArray?#

Named dimensions#

Coordinate variables#

Arbitrary attributes#

Underlying data#

Why Xarray?#

Analysis without xarray:#

Analysis with xarray#

Extracting data and indexing#

Label-based indexing#

Position-based indexing#

Exercise 1#

Reading image data#

Handling PIL images#

Handling image data#

Handling image data#

Handling image data#

Handling image data#

Handling missing data#

Data layout, one-hot encoding etc.#

One-hot encoding#

One-hot encoding#

One-hot encoding images#

One-hot encoding images#

One-hot encoding images#

Exercise 2#

Data scaling and normalization#

Feature selection#

Feature selection#

		img
y	x
833	0	225
	1	225
	2	225
	3	225
	4	225
...	...	...
1	774	225
	775	225
	776	225
	777	225
	778	225

		img
y	x
833	0	225
	1	225
	2	225
	3	225
	4	225
...	...	...
1	774	225
	775	225
	776	225
	777	225
	778	225

		img
y	x
833	0	225
	1	225
	2	225
	3	225
	4	225
...	...	...
1	774	225
	775	225
	776	225
	777	225
	778	225