8 - 17 months
Subregion files
Audio recordings from subjects' 08 month to 17 month home visits were not listened to or annotated in their entirety. For these files, five one-hour regions were identified programmatically using an algorithm. Each one-hour region is called a "subregion." Subregions are numbered 1 - 5 as they occur chronologically in the recording, and they are ranked 1 - 5 in terms of measures of talkativeness. For example, the subregion that occurs earliest in the day (first in the recording) is Subregion 1, but it could be ranked 3 out of 5.
Talkativeness was estimated by using automated measures calculated by the LENA software. For each audio file, the software output a table that contained estimates of, among other measures, adult word count (AWC), conversational turn count (CTC), and child vocalization count (CVC). We used an average of raw CTC and CVC scores as the measure of talkativeness. Subregions were selected iteratively by choosing 12 consecutive 5 minute intervals with the highest talkativeness, then selecting another 12 from those not yet selected, etc.
8 - 13 month files
In 8 - 13 month files, 4 hours of the file are annotated. The annotated time should come from subregions ranked 1 - 4, with the subregion ranked 5 reserved for make-up time for skips and silences. Anything above 4 hours of annotated time is marked as surplus coding.
14 - 17 month files
In 14 - 17 month files, 3 hours of the file are annotated. The annotated time should come from subregions ranked 1 - 3, with the subregions ranked 4 - 5 reserved for make-up time for skips and silences. Anything above 3 hours of annotated time is marked as surplus coding.
Details about subregions
Subregions are non-overlapping but may be adjacent to one another (e.g. subregion 3's end can be the same time as subregion 4's start).
Subregions are marked at the beginning and end with comments that are formatted as follows:
Using the subregion start and end times in milliseconds, the subregions are each 60 minutes long. Subregions cannot start and end exactly on CLAN timestamp times in milliseconds, as this would cause the subregions to be irregular in length. Thus, timestamp comments contain the start and end times (in milliseconds) and were inserted immediately after the CLAN tier with the timestamp that would include the actual subregion start time, e.g.:
Subregions at the end of the file might be somewhat shorter due to an imprecision of how we selected the subregions. We treated rows in the LENA software output as corresponding to 5 minutes of the recording while in reality they corresponded to 5 minutes of the clock time. For example, if the recording went continuously from 06:04:00 am to 10:04:00 pm, the 5 min intervals would go from 06:00:00 am to 10:05:00 pm and the subregion spanning the last 12 intervals would end 5 minutes after the recording did.
Last updated