`vignettes/shuffling_with_constraints.Rmd`

`shuffling_with_constraints.Rmd`

This example demonstrates that by using customized shuffling functions, it is possible to restrain the design optimization to only score sample arrangements that fit the given constraints.

Key idea is that every reshuffling produces a ‘valid’ sample permutation that is not violating those constraints, even if the suggested solution may be quite bad. During optimization, we pick the best design solution from the possible ones by appropriate scoring.

The user is free to implement custom shuffling functions and pass
those to the optimizer. However, some knowledge is required regarding
the internal workings of the optimization and batch container setup.
Therefore the package provides a little generic constructor for
customized shufflings **shuffle_grouped_data** in which
certain types of constraints that relate to grouped samples may be
specified by the passed parameters.

We refer to a simplified version of the *in vivo* example
which is examined deeper in a dedicated vignette.

```
data("invivo_study_samples")
invivo_study_samples <- dplyr::mutate(invivo_study_samples,
Litter_combine_females = ifelse(Sex == "F", "female_all", Litter)
)
str(invivo_study_samples)
#> 'data.frame': 59 obs. of 9 variables:
#> $ AnimalID : chr "F1" "F2" "F3" "F4" ...
#> $ Strain : chr "Strain B" "Strain B" "Strain B" "Strain B" ...
#> $ Sex : chr "F" "F" "F" "F" ...
#> $ BirthDate : Date, format: "2021-05-24" "2021-03-01" ...
#> $ Earmark : chr "R" "2L" "2L1R" "L" ...
#> $ ArrivalWeight : num 19.4 26.5 20.8 22.1 22.9 ...
#> $ Arrival.weight.Unit : chr "g" "g" "g" "g" ...
#> $ Litter : chr "Litter 1" "Litter 2" "Litter 2" "Litter 2" ...
#> $ Litter_combine_females: chr "female_all" "female_all" "female_all" "female_all" ...
invivo_study_samples |>
dplyr::count(Strain, Sex, Litter_combine_females) |>
gt::gt()
```

Strain | Sex | Litter_combine_females | n |
---|---|---|---|

Strain A | F | female_all | 7 |

Strain A | M | Litter 10 | 3 |

Strain A | M | Litter 11 | 4 |

Strain A | M | Litter 12 | 5 |

Strain A | M | Litter 13 | 4 |

Strain A | M | Litter 14 | 6 |

Strain B | F | female_all | 7 |

Strain B | M | Litter 1 | 3 |

Strain B | M | Litter 3 | 5 |

Strain B | M | Litter 4 | 4 |

Strain B | M | Litter 5 | 4 |

Strain B | M | Litter 6 | 4 |

Strain B | M | Litter 7 | 3 |

We will use the litter as a factor to form cages in our design.
However, in order to indicate the compatibility of female animals (see
*in vivo* study vignette), a pseudo-litter
`female_all`

is created here to group all the females
together, marking them as interchangeable for the subgroup (i.e. cage)
allocation.

In the simplified setup we want to assign two treatments to those animals, balancing for strain and sex as the primary suspected confounders. The batch container is prepared as follows:

```
treatments <- factor(rep(c("Treatment A", "Treatment B"), c(30, 29)))
table(treatments)
#> treatments
#> Treatment A Treatment B
#> 30 29
bc <- BatchContainer$new(locations_table = data.frame(Treatment = treatments, Position = seq_along(treatments)))
bc <- assign_in_order(bc, invivo_study_samples)
scoring_f <- osat_score_generator(batch_vars = "Treatment", feature_vars = c("Strain", "Sex"))
bc
#> Batch container with 59 locations and 59 samples (assigned).
#> Dimensions: Treatment, Position
```

As noted, we have to assign animals to cages in this example. The cage is thus acting as the grouping factor for the samples (animals) on which we may want to put further constraints. Concretely:

- We want to form cages with ideally 3 animals each (tolerated/preferred range is from 2-5)
- Variables Strain, Sex and Treatment must be homogeneous within cage
- Animals of different litters must not be put into the same cage
- If at all possible, avoid putting animals with the same ear markings into one cage

We will tackle the usual factor balancing (using the osat score) and the additional constraints at the same time, combined in one conceptional framework.

As stated, the main idea is to provide a customized shuffling function that ensures that only ‘suitable’ design proposals are generated and passed to the scoring function which will then identify a good one.

Also keep in mind that what is the cage here could be any subgroup into which samples have to be partitioned.

The wrapper `shuffle_grouped_data`

allows to construct a
shuffling function that satisfies all constraints defined above at the
same time. It can be passed to the optimizer together with other user
defined options such as the scoring or acceptance functions.

```
bc2 <- optimize_design(
bc,
scoring = scoring_f,
shuffle_proposal_func = shuffle_grouped_data(bc,
allocate_var = "Treatment",
keep_together_vars = c("Strain", "Sex"),
keep_separate_vars = c("Earmark"),
subgroup_var_name = "Cage",
n_min = 2, n_ideal = 3, n_max = 5,
strict = TRUE,
report_grouping_as_attribute = TRUE
),
max_iter = 600
)
#>
#> Formed 4 homogeneous groups using 59 samples.
#> 22 subgroups needed to satisfy size constraints.
#>
#> Finding possible ways to allocate variable of interest with 2 levels ...
#>
#> Aborted after 1000010 recursive calls (maxCalls limit).
#> 176364 allocations found.
#> Usage of sample attributes --> executing first shuffling call.
#> Checking variances of 1-dim. score vector.
#> ... (238.222) - OK
#> Initial score: 453.202
#> Adding 4 attributes to samples.
#> Achieved score: 355.575 at iteration 2
#> No permutations fulfilling the 'keep_separate' constraints in 1000 iters!
#> Increasing number of tolerated violations to 1
#> Achieved score: 305.134 at iteration 19
#> Achieved score: 272.592 at iteration 20
#> Achieved score: 231.507 at iteration 34
#> Achieved score: 181.473 at iteration 52
#> Achieved score: 143.032 at iteration 116
#> Achieved score: 122.49 at iteration 123
#> Achieved score: 105.405 at iteration 165
#> Achieved score: 104.999 at iteration 356
#> Achieved score: 79.371 at iteration 384
#> Achieved score: 52.931 at iteration 525
#> Achieved score: 44.388 at iteration 546
design <- bc2$get_samples()
```

`allocate_var`

is the batch container variable that should
be primarily assigned to individual samples.

`keep_together_vars`

is a list of variables that must be
homogeneous within a subgroup (here: cage).

`keep_separate_vars`

lists variables which should have
different values within a subgroup (here: cage), if at all possible.
This is a soft constraint and will be relaxed in a stepwise way until
solutions can be found.

`subgroup_var_name`

allows to give the generated subgroup
variable a useful name.

`n_min`

, `n_max`

and `n_ideal`

specify the minimal, maximal and ideal group sizes, respectively. It is
often necessary to release the `strict`

criterion to find any
solution at all that satisfies those size criteria.

`report_grouping_as_attribute`

allows, if TRUE, to add the
updated group variable into the batch container at each iteration, so
that scoring functions could make use of this variable (here: cage)!

Following the output of the optimizer, we see that a solution was identified that satisfies all constraints, with the exception of tolerating one violation of earmark-uniqueness within a cage.

The following cages (homogeneous in strain, sex and treatment) have been generated in the process:

Cage | Strain | Sex | Treatment | n |
---|---|---|---|---|

1_1 | Strain A | F | Treatment A | 3 |

1_2 | Strain A | F | Treatment A | 2 |

1_3 | Strain A | F | Treatment A | 2 |

2_1 | Strain A | M | Treatment A | 3 |

2_2 | Strain A | M | Treatment A | 3 |

2_3 | Strain A | M | Treatment A | 3 |

2_4 | Strain A | M | Treatment A | 3 |

2_5 | Strain A | M | Treatment B | 3 |

2_6 | Strain A | M | Treatment B | 3 |

2_7 | Strain A | M | Treatment B | 2 |

2_8 | Strain A | M | Treatment B | 2 |

3_1 | Strain B | F | Treatment B | 3 |

3_2 | Strain B | F | Treatment A | 2 |

3_3 | Strain B | F | Treatment B | 2 |

4_1 | Strain B | M | Treatment A | 3 |

4_2 | Strain B | M | Treatment A | 3 |

4_3 | Strain B | M | Treatment A | 3 |

4_4 | Strain B | M | Treatment B | 3 |

4_5 | Strain B | M | Treatment B | 3 |

4_6 | Strain B | M | Treatment B | 3 |

4_7 | Strain B | M | Treatment B | 3 |

4_8 | Strain B | M | Treatment B | 2 |

`shuffle_grouped_data`

is a wrapper that consecutively
calls other helper function. As an addendum, let us break the whole
procedure down into parts that show what is happening internally at each
step.

We have to divide our animal cohort into subgroups with same strain
and sex, meeting size constraints as stated above. Since 2-5 animals
should go into one cage, we specify `n_min`

and
`n_max`

accordingly. `n_ideal`

would be selected by
default as the mean of those two, but we specify it explicitly here,
too.

The homogeneity of subgroups regarding strain and sex is achieved by
listing those two parameters as `keep_together_vars`

.

Assignment of treatments should be performed as well at some point.
We thus specify Treatment as the `allocation variable`

.

Note that the `Treatment`

variable is technically a batch
container location and not a part of the sample list. This distinction
does not matter at this point. However, all required variables must
exist in the batch container object.

The following call to `form_homogeneous_subgroups()`

produces an object that holds all relevant information about the
samples, the allocation variable and the sizes of the subgroups that
have to be formed. It is NOT decided, however, which animal will end up
in which subgroup. This will be a matter of optimization later on.

```
subg <- form_homogeneous_subgroups(
batch_container = bc, allocate_var = "Treatment",
keep_together_vars = c("Strain", "Sex", "Litter_combine_females"),
subgroup_var_name = "Cage",
n_min = 2, n_ideal = 3, n_max = 5
)
#>
#> Formed 13 homogeneous groups using 59 samples.
#> 18 subgroups needed to satisfy size constraints.
```

In this example, 18 subgroups have to be formed to meet all constraints.

It is possible to obtain more information from the returned list
object. Inspection of element `Subgroup_Sizes`

tells us that
13 ‘animal pools’ have to be formed which are homogeneous in the
relevant parameters (here: strain and sex). Each of those groups happens
to be split in subgroups with a size between 2 and 4 animals , which
will later constitute the individual cages.

```
subg$Subgroup_Sizes
#> $`Strain A/F/female_all`
#> [1] 3 4
#>
#> $`Strain A/M/Litter 10`
#> [1] 3
#>
#> $`Strain A/M/Litter 11`
#> [1] 4
#>
#> $`Strain A/M/Litter 12`
#> [1] 3 2
#>
#> $`Strain A/M/Litter 13`
#> [1] 4
#>
#> $`Strain A/M/Litter 14`
#> [1] 3 3
#>
#> $`Strain B/F/female_all`
#> [1] 3 4
#>
#> $`Strain B/M/Litter 1`
#> [1] 3
#>
#> $`Strain B/M/Litter 3`
#> [1] 3 2
#>
#> $`Strain B/M/Litter 4`
#> [1] 4
#>
#> $`Strain B/M/Litter 5`
#> [1] 4
#>
#> $`Strain B/M/Litter 6`
#> [1] 4
#>
#> $`Strain B/M/Litter 7`
#> [1] 3
```

Each subgroup of animals receives one specific treatment. Or more generally: subgroups have to be homogeneous regarding the allocation variable.

This introduces another type of constraint, since numbers have to add
up to 10 ‘Control’ and 10 ‘Compound’ cases, as given by the
`treatments`

variable. As a next step, we have to find all
possible combinations of subgroups which produce valid options for the
treatment allocation. That’s done with the next call.

This will find a large number of different ways to assign treatments to subgroups that lead to the correct overall number of treated animals.

```
possible <- compile_possible_subgroup_allocation(subg)
#>
#> Finding possible ways to allocate variable of interest with 2 levels ...
#>
#> Finished with 108296 recursive calls.
#> 14660 allocations found.
```

So far we only know the sizes of subgroups (i.e. cages). Thus, in a last step we have to assign specific animals to the various subgroups. Ideally each group of ‘equivalent animals’ (in terms of strain and sex) is split up into more than one subgroup, so there’s many potential ways to assign animals to those.

To allow optimization as usual, we want to generate a shuffling
function that produces only valid solutions in terms of our constraints,
so that optimization can iterate efficiently over this solution space.
The function can be generated by calling
*shuffle_with_subgroup_formation()* with the previously created
subgrouping object and the list of possible treatment allocations.

Every call to this shuffling function will return a permutation index (of the original samples) that constitutes a valid solution to be scored.

The permutation function actually also constructs a ‘Cage’ variable
(see parameter `subgroup_var_name`

in the call to
*form_homogeneous_subgroups()*). To make this parameter available
and join it to the samples in the batch container, use flag
`report_grouping_as_attribute`

in the construction of the
permutation function.

```
shuffle_proposal <- shuffle_with_subgroup_formation(subg, possible, report_grouping_as_attribute = TRUE)
shuffle_proposal()
#> $location_assignment
#> [1] 9 12 14 8 10 11 13 38 39 40 41 42 43 44 46 47 48 45 49 50 51 52 53 54 56
#> [26] 59 2 3 4 6 55 57 58 1 5 7 35 36 37 19 26 27 20 21 15 16 17 18 22 23
#> [51] 24 25 28 29 30 31 32 33 34
#>
#> $samples_attr
#> # A tibble: 59 × 4
#> alloc_var_level group subgroup Cage
#> <chr> <int> <int> <chr>
#> 1 Treatment B 7 1 7_1
#> 2 Treatment A 7 2 7_2
#> 3 Treatment A 7 2 7_2
#> 4 Treatment A 7 2 7_2
#> 5 Treatment B 7 1 7_1
#> 6 Treatment A 7 2 7_2
#> 7 Treatment B 7 1 7_1
#> 8 Treatment A 1 2 1_2
#> 9 Treatment A 1 1 1_1
#> 10 Treatment A 1 2 1_2
#> # ℹ 49 more rows
```

Calling the shuffle proposal function repeatedly produces a valid (constraint-aware) sample arrangement each time, with the grouping variable (here: Cage) reported alongside. (The optimizer will merge the ‘Cage’ variable into the batch container after each iteration, so that it can be used for scoring as if it would have been in the container from the beginning!)

We can finally use the customized shuffling function in the optimization.

```
bc3 <- optimize_design(
bc,
scoring = scoring_f,
shuffle_proposal_func = shuffle_proposal,
max_iter = 300
)
#> Usage of sample attributes --> executing first shuffling call.
#> Checking variances of 1-dim. score vector.
#> ... (470.777) - OK
#> Initial score: 289.541
#> Adding 4 attributes to samples.
#> Achieved score: 189.066 at iteration 9
#> Achieved score: 139.439 at iteration 15
#> Achieved score: 105.405 at iteration 34
#> Achieved score: 52.931 at iteration 55
#> Achieved score: 51.304 at iteration 70
#> Achieved score: 32.863 at iteration 206
design <- bc3$get_samples()
# Obeying all constraints does not lead to a very balanced sample allocation:
dplyr::count(design, Treatment, Strain) |> gt::gt()
```

Treatment | Strain | n |
---|---|---|

Treatment A | Strain A | 17 |

Treatment A | Strain B | 13 |

Treatment B | Strain A | 12 |

Treatment B | Strain B | 17 |

Treatment | Sex | n |
---|---|---|

Treatment A | F | 10 |

Treatment A | M | 20 |

Treatment B | F | 4 |

Treatment B | M | 25 |