Feedback Report 03.05.2024

While reading this, I made the assumption that the current content is close to final and missing content will be added later. Generally, the report is very good and the only thing that you need to take care of is pointing out the relationships between the different parts of the report (e.g. objects don't change with translation -> CNNs should be translation invariant; however, sometimes positional information is important -> translation equivariant, etc.).

Writing is excellent, no complaints there.

I also suggest you add the results for one study right now, doesn't matter if they change again, just to calibrate the writing, formal and stylistic aspects.

Here is a list of a few things that should still be included imo:

a few more references, especially for the background section
more background for CNNs: architecture (at least from alexnet to resnet, including the evolution away from FC layers), backpropagation, application, maybe an overview of important dataset (collection techniques) as we also work in that space a bit

Structure

generally good, "Results and discussion" -> "Results" as you have a separate Discussion chapter anyway
why have another section on RQs in the background? I like them at the beginning, the introduction should provide just enough context to understand them and from then on you can reference them whenever you like

Introduction

the introduction is good, I like where it starts - in medias res but it still gives a good overview of why we believe weight sharing is important in CNNs
maybe qualify the "good results in tasks" (e.g. "surpassed human performance in task 1, task 2, task 3...") with respective references
Maybe also mention that ever since resnet, CNNs have global pooling but there are few ablation studies into the effects
"tackle the problem of translation non invariance" - that's true but we also tackle the more general problem of translation non-equivariance
RQs:
- develop toy dataset to expose translation non-equivariance problems in contrast with real-world images
- quantify advantage of global pooling in CNNs
- investigate effect of augmentation on translation equivariance
- employ positional encoding and attention to improve equivariance

Background

very good, take care to not have "orphan headings" that don't have any content below them (e.g. "Translation non invariance in CNNs") - you should at least always explain what comes next and why
I think that you can already have a diagram to explain the pixel grid and later re-use the same style etc. for visual consistency
Next should be another visualization of the convolution operation only
maybe have a separate section on just the convolution operation, then another on architecture with the other layers and how the backprop works etc.
Figure 2.1 need a reference if it's not your own, but it's best to have your own anyway
similar for the other figures, 2.2 is especially ugly and should be redone - I mostly use draw.io, which works ok for this kind of stuff
another important concept wrt. the capability of CNNs is the increasing feature complexity with depth, I think the resource for that is the deconv paper - might be important as in the last layers, where we pool, features should be pretty high-level
chapter 2.1.3 could have more content of the effect / downside of translation invariance, e.g. loss of spatial information which can be desired and useful in some scenarios (e.g. OCR, scene segmentation)
- then revisit this in the beginning of section 2.1.5
For data augmentation I want lots and lots of examples - how the augmentation affects the image but maybe also a visualization of how they can affect the label. e.g. a rotated 7 is still a 7 but might look more like a 1 - so label-preserving is more of a distribution than a fixed criterion
- maybe also add a table which classifies the augmentation purpose (e.g. translation, rotation, color space, etc.) and the effect on the label
"~~This concept~~ This aspect necessitated investigations" (p. 9)
form: if you cite a paper title, maybe italicize it (e.g. "The paper Group equivariant convolutional networks [CW16] leverages [...]")
2.2 has two duplicate paragraphs
"Orthogonal moments" is missing the actual orthogonality property
the last paragraph ("In summary, the papers cited in this section [...]") is a bit misplaced imo, as in its current position its the last part of 2.2.3. Rather, move it to the beginning of 2.3 and expand it a bit:
- instead of RQ, maybe point out a Research Gap: no foundational investigation of translation invariance, no connection to pooling methods or augmentation strategies

Proposed Methods

maybe change ~~Dataset preparation~~ to something like "Translation Invariance Testing Dataset" (TITd, lol)
also, dataset should be much better motivated here, but also in the introduction, RQs and the research gap section mentioned above
3.1.2 need a few more diagrams (e.g. class distribution, background generation)
Some of your citations might have gone wrong, e.g. [Kai+16] is the first not last name - best practice: copy google scholar bibtex string (e.g. via a browser plugin)
3.2.2 sparse activation should also mention that in conjunction with batch norm, half of the activations are zeroed on average, providing some regularization
are known to ~~be enhancing~~ enhance
For the spatial encodings I would like to see a small comparison at the beginning, maybe a table with some images?
Also, it should really start with the transformer style sinusoidl PE as it is the most basic and easiest to understand and feature independent
"In the context of achieving translation equiinvariance, [...]"
I think it's also important to mention how and what the encoding is used for, e.g. how it's combined with the feature maps - e.g. aggregation function (avg, concat, product?), normalization, etc.)
How does the laplacian achieve positional encoding if it's just a filter? or, in other words, how does it do so more than just the learned filters we have anyway?
- Further, if we use laplacian, should we also use other edge detection like sobel - at least that has two components
- we could also use edge detection on the original image (as it should have the same size) or on an aggregated representation of all the filters (maybe via attention pooling), to achieve a more general positional encoding that is not tied to the specific filter - this would also be more in line with the original idea of positional encoding in transformers
For the zernike encoding, maybe add a more detailed rationale - it is rotation invariant, but how does it encode position? Do we use it in addition to normal PE or is it just a sanity check to investigate the effect of rotation invariance?
The PE in fig. 3.10 is not symmetric, is that intended? If so, why? If not, how can we fix it? Right now, there is now y component in the left half of the image
- I think that in the most basic case, we should start with aligning it with the top left corner - that is closest to the original definition. Then we can think of another variant that is aligned to the image center, and maybe also rotation invariant (e.g. by using a polar coordinate system) - a bit like this
- Then we could also simply combine the two, to get both perspectives?
"The Attention Pooling mechanism follows several steps. Initially, the layer computes the global average across the input feature maps," - I don't quite see why it should be global average, as we want to keep the spatial information, right? The attention scores should be computed based on the full feature representation, i.e. the concatenation of all filters for each pixel location. We might have a representation size of [32, 32, 64] and we want to compute the attention scores for each of the 32x32 locations, so we get a [32, 32] tensor of scores for each location. Then we use this to get a weighted average of size [64], which is the final representation of the image.

Edited May 17, 2024 by Lorenz Wendlinger