Skip to content

Feedback Report 24.05.2024

  • I like the introductory blurb, it will get more detailed as you finish this chapter I think.

3.1

  • The hardware and software environment are low priority and should definitely not be first. It is debatable if they are even part of the experimental setup - it's best to put them into the appendix or at the end of this section.
  • Experimental setup should describe your experimental setup - how do you measure the effect of your interventions? Empirically, on two datasets, to test two different scenarios. You report test accuracy (across 3 runs?). So you at least need a subsection for dataset and evaluation methodology.
    • This should also include (the results of) your discussion on efficiency comparison
  • You are measuring model performance, so measurement error is to be expected. In your setup there are multiple sources of error, i.e. dataset selection bias and initial network configuration. These are both randomly controlled, so it is advantageous (i.e. effectively computationally free) to test them at the same time (i.e. vary the seed for dataset splitting and network init simultaneously). If we want to estimate the magnitude of this error, single point estimates won't do. It helps to visualize your measurement as drawn from a measurement distribution - so we need multiple samples to at least report the estimated location (i.e. mean) and scale (i.e. std) of this distribution. The best setup in your case might be something like a 3-fold shufflesplit (so you can set the test size) or cross validation. If this is not computationally feasible, we can at least perform a sensitivity or robustness study that reports the measurement error as estimated by performing multiple runs.

3.2

  • We notice the model with global pooling performs exhibits the best test accuracy in these experiments. (Amine: done)
  • Figures 3.1 and 3.2 need to be combined, otherwise it's impossible to compare the learning curves. If you want to compare efficiency here, you can do it in terms of number of epochs to reach equivalent performance (in validation loss / accuracy) and mark it accordingly. (Amine: done)
  • Like this, I'm not sure about the interpretation, but if the pooling model converges earlier, mention that (Amine: done)
  • Do you vary the learning rate during training? Both model have a jump in performance at around epoch 20 (Amine: done)
  • Comparing the training accuracy / loss is a bit of a moot point because we don't really care about them. It would be more accurate to compare the generalization gap (i.e. the difference in training vs. test performance)
  • What is bloc pooling? Is it the normal spatial pooling in the resnet blocks? If so, it should be moved to a separate section concerning the (conv stem) architecture. It might be a similar operation, but it serves a completely different purpose.
  • Table 3.4 is excellent, though the caption should mention which operation the inference time and FLOPs are for - I assume one forward pass? Then the inference time should probably be divided by the batch size to get the time for a single image. Alternatively, you could also measure the throughput (i.e. images per second) as that is sometimes easier to interpret.
  • It would probably make more sense to combine tables 3.5 - 3.8 as the interesting aspect is how these to models react to increased translation. You can still split them into subtables for each metric. Also this should really have a baseline result without translation. (Amine: half done)
  • Overall, Figure 3.5 is the most interesting presentation of these results, so it should probably go first - though I'm not sure about the scale of the y-axis. This does give a lot of context by showing absolute 0, however that is slightly misleading as no model would actually be this bad. You can also just show the relevant section of the y-axis, it is not considered misleading for ML results. (Amine: done)
  • Then maybe also add a sub-plot or second curve for validation accuracies to fig 3.5? The learning curves seem to show a big difference in validation acc, but it's hard to see at a glance.
  • Also maybe try combining the learning curves, at least have the curves for pooling and no pooling for each scenario in the same plot. (Amine: done)
  • Otherwise the results are interesting, I didn't expect there to be an almost constant gap between the two models - definitely add a baseline result without translation to further study this gap. This is because my assumption was that the pooling model would be unaffected by increased translation - though it probably isn't because the objects might be clipping the image borders. This brings me to my previous points about reporting effective translation, I've pasted them below:
    • Another thing: what does a translation range of 50px mean? Sure, I can look it up in the previous chapters, but also mention it here and discuss the effect - if you draw from a uniform dist, mean translation should be 25, but this might be limited by image size in some cases. If you want to stick with this schema, please keep track of the effective (i.e. the actual) translation range (e.g. mean +- std, maybe also show the distributions) and report it. Still this probably neglects another aspect, that of scale - for a 50px object in a 100px image, 50px translation is huge, for a 500px object in a 4000px image not so much. So maybe also report the scale of the object in the image and the translation range as a ratio of these two.
  • Here the efficiency difference between the two models is negligible, is this because the backbone is more expensive due to the larger image size? (Amine: done)

3.3

  • this still needs an introductory blurb
  • Is it T-COCO or t-COCO? (Amine: done)
  • The same thing holds for the tables as before, we really want to compare the response to increased translation
  • It's a bit hard to say because we still have SPEs (FIX), but it looks like the sinusoidal PE is the only one that significantly improves over the pooling baseline. Also this table should probably have indicators for statistical significance, e.g. stars, for this reason.
  • I'm missing a plot like 3.5 for this section, it would be very helpful to see the response to increased translation - of course this should also include a result for no translation and respect the effective translation range, just as before.
    • this plot should also include error bars or a shaded error region (usually +- 1 std or quartiles) to show the measurement error
  • Still, I assume that the PE is slightly helpful across the bank, and at basically no cost in terms of efficiency. This is a good result, but it would be nice to see a more detailed analysis of the effect of the PE - maybe it is only helpful for certain translation ranges or only for certain objects. This could be done by splitting the dataset into different object sizes and measuring the effect of the PE on each of these splits. This would also be a good way to show the effect of the PE on the generalization gap, i.e. the difference in training and test performance. (this was also auto-generated 🤯)
  • The efficiency comparison should probably also include convergence efficiency - as before with combined plots of the learning curves. Of course, the learning curves should ideally be averaged across the 3 runs with a shaded error region.
Edited by Amine Karoui