Wiki

Basics

tf.keras is an apparently well integrated version of Keras, which takes advantage of most TensorFlow efficiencies and optimizations, including distributed training and model deployment, while still abstracting away coding details.

As of TensorFlow 2.0, all the use and creation of layers and models is made through tf.keras. The remaining TensorFlow framework consists of both more intricate details like certain data processing methods, as well as other tools like uncertainty estimation and model deployment.

Custom layers and models

TensorFlow allows the creation of custom layers, which can then work similarly to inbuilt ones like Linear and Conv2D. To do this, one should extend the keras.layers.Layer class, ideally defining the __init__ (initialization), build (weights definition) and call (feedforward operation). Here's an example:

class Linear(keras.layers.Layer):
    def __init__(self, units=32):
        super(Linear, self).__init__()
        self.units = units

    def build(self, input_shape):
        self.w = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer="random_normal",
            trainable=True,
        )
        self.b = self.add_weight(
            shape=(self.units,), initializer="random_normal", trainable=True
        )

    def call(self, inputs):
        return tf.matmul(inputs, self.w) + self.b

Additionally, one can define a custom model. The advantage of doing so is to be able to edit the training (fit), evaluation (evaluation), output (predict) and saving (save and save_weights) methods. This is done by extending the tf.keras.Model class. Example:

class ResNet(tf.keras.Model):

    def __init__(self):
        super(ResNet, self).__init__()
        self.block_1 = ResNetBlock()
        self.block_2 = ResNetBlock()
        self.global_pool = layers.GlobalAveragePooling2D()
        self.classifier = Dense(num_classes)

    def call(self, inputs):
        x = self.block_1(inputs)
        x = self.block_2(x)
        x = self.global_pool(x)
        return self.classifier(x)

resnet = ResNet()
dataset = ...
resnet.fit(dataset, epochs=10)
resnet.save(filepath)

So if you're wondering, "should I use the Layer class or the Model class?", ask yourself: will I need to call fit() on it? Will I need to call save() on it? If so, go with Model. If not (either because your class is just a block in a bigger system, or because you are writing training & saving code yourself), use Layer.

To be confirmed: Custom layers might not work directly with TensorFlow's usual model constructions like the sequential model. When there are custom layers involved, we might be required to define a custom model as well, even if we don't need to change the fit method or others (we can just __init__ and call methods).

Making new Layers and Models via subclassing | TensorFlow Core

Custom training

There might be times when it's useful to define a custom training pipeline. In this case, a couple of nuances should be taken into account:

Ideally, one should define two functions with the @tf.function decorator, one corresponding to a single training step and another for a single testing step. This @tf.function decorator is useful to improve performance and to allow saving the model to be exported. But beware that debugging is easier without it.

Only use tf.function to decorate high-level computations - for example, one step of training or the forward pass of your model.

Better performance with tf.function | TensorFlow Core
The expression with tf.GradientTape() as tape: is needed during training, to keep track of the gradients. It does basically the opposite of PyTorch's with torch.no_grad().
In order to run the optimization step, we need to run tape.gradient and the optimizer's apply_gradients method.
The training boolean parameter of each model's predict method can switch the operating logic of some components which vary whether it is training or testing. For example, dropout should be activated during training by setting model(input, training=True) and then shut off during testing by setting model(input, training=False). This is similar to PyTorch's settings of train and eval.
When starting a new epoch, the loss and the remaining metrics should be reset through reset_states.