# Top 3/4 annotated hours

In most cases, top 3/4 hours are the top 3/4 highest-ranked subregions. Due to silences and skips, this isn't always so. So we had to annotate more creating extra and makeup regions. Then we had to retrospectively assign extra/makeup regions to either top 3 or top 4 hours so that the total duration of the assigned regions was within 15 minutes from 3 or 4 hours respectively. The code that took care of that is in the `top3_top4_surplus` module of the `blabpy` package. [Link to the module](https://github.com/BergelsonLab/blabpy/blob/main/blabpy/seedlings/regions/top3_top4_surplus.py). The procedure implemented programmatically is described below.

We first sorted all the annotated regions for each recording: we started with the regions that came from subregions (or their parts) that were originally planned to be annotated (ranked 1-4 for months 6-13, and 1-3 for months 14-17) in the order of increasing rank and continuing with the makeup/extra regions in chronological order. We then added these regions one by one until the total duration was within 15 minutes of the targetd amount of time (3 or 4 hours, depending on the month of the audiorecording). For months 6-7, for which subregions were demarcated post-hoc, there were no makeup regions and so we effectively treated subregions ranked 5 as *makeup*.

In a handful of cases, *extra* regions were further added to get to the targeted length of time. For months 6-7, if an added region made the total duration exceed 3 or 4 hours, that region was truncated. (The *surplus* regions for month 6-7 were determined programmatically - anything that wasn't part of the top 4 hours or *silences* was marked as surplus.) For months 8-17, parts of the *makeup*/*extra* regions were manually marked as *surplus* so that the duration stayed within 15 minutes of the targeted length (3 or 4 hours).

The *regions* table contains the list of the boundaries of all annotated regions together with additional information that would permit interested parties to analogously identify the top $x$ *high-talk* hours in audio recordings for other values of $x$.

For months 6-13, $x$ can go up to 4; for all months, $x$ can go up to 3. With some adjustment to account for the fact that some videos were shorter than 1 hour, the same procedure can be applied to determine the single most *high-talk* hour in all recordings, including the video ones. Columns *is\_top\_3\_hours* and *is\_top\_4\_hours* should be used to identify the top 3 and top 4 *high-talk* hours respectively. In this table, the *region\_id* column can be used to join the *regions* table with the *seedlings-nouns* table to identify which regions the tokens came from. In case of *is\_top\_3\_hours*, we used its contents to pre-populate the column of the same name in the *seedlings-nouns* table. Note that for the video recordings, its value is always *NA*. For subregions ranked 1-4 for months 6-13, the *regions* table additionally contains information about that fact itself (column *is\_part\_of\_subregion*), subregion rank (column *subregion\_rank*) and the number of the subregions in the chronological order (column *position*). Columns *subregion\_rank* and *position* are *NA* for all regions for which *is\_part\_of\_subregion* is false.
