Top 3/4 annotated hours

In most cases, top 3/4 hours are the top 3/4 highest-ranked subregions. Due to silences and skips, this isn't always so. So we had to annotate more creating extra and makeup regions. Then we had to retrospectively assign extra/makeup regions to either top 3 or top 4 hours so that the total duration of the assigned regions was within 15 minutes from 3 or 4 hours respectively. The code that took care of that is in the top3_top4_surplus module of the blabpy package. Link to the module. The procedure implemented programmatically is described below.

We first sorted all the annotated regions for each recording: we started with the regions that came from subregions (or their parts) that were originally planned to be annotated (ranked 1-4 for months 6-13, and 1-3 for months 14-17) in the order of increasing rank and continuing with the makeup/extra regions in chronological order. We then added these regions one by one until the total duration was within 15 minutes of the targetd amount of time (3 or 4 hours, depending on the month of the audiorecording). For months 6-7, for which subregions were demarcated post-hoc, there were no makeup regions and so we effectively treated subregions ranked 5 as makeup.

In a handful of cases, extra regions were further added to get to the targeted length of time. For months 6-7, if an added region made the total duration exceed 3 or 4 hours, that region was truncated. (The surplus regions for month 6-7 were determined programmatically - anything that wasn't part of the top 4 hours or silences was marked as surplus.) For months 8-17, parts of the makeup/extra regions were manually marked as surplus so that the duration stayed within 15 minutes of the targeted length (3 or 4 hours).

The regions table contains the list of the boundaries of all annotated regions together with additional information that would permit interested parties to analogously identify the top $x$ high-talk hours in audio recordings for other values of $x$.

For months 6-13, $x$ can go up to 4; for all months, $x$ can go up to 3. With some adjustment to account for the fact that some videos were shorter than 1 hour, the same procedure can be applied to determine the single most high-talk hour in all recordings, including the video ones. Columns is_top_3_hours and is_top_4_hours should be used to identify the top 3 and top 4 high-talk hours respectively. In this table, the region_id column can be used to join the regions table with the seedlings-nouns table to identify which regions the tokens came from. In case of is_top_3_hours, we used its contents to pre-populate the column of the same name in the seedlings-nouns table. Note that for the video recordings, its value is always NA. For subregions ranked 1-4 for months 6-13, the regions table additionally contains information about that fact itself (column is_part_of_subregion), subregion rank (column subregion_rank) and the number of the subregions in the chronological order (column position). Columns subregion_rank and position are NA for all regions for which is_part_of_subregion is false.

Last updated