8 - 17 months

Subregion files

Audio recordings from subjects' 08 month to 17 month home visits were not listened to or annotated in their entirety. For these files, five one-hour regions were identified programmatically using an algorithm. Each one-hour region is called a "subregion." Subregions are numbered 1 - 5 as they occur chronologically in the recording, and they are ranked 1 - 5 in terms of measures of talkativeness. For example, the subregion that occurs earliest in the day (first in the recording) is Subregion 1, but it could be ranked 3 out of 5.

Talkativeness was estimated by using automated measures calculated by the LENA software. For each audio file, the software output a table that contained estimates of, among other measures, adult word count (AWC), conversational turn count (CTC), and child vocalization count (CVC). We used an average of raw CTC and CVC scores as the measure of talkativeness. Subregions were selected iteratively by choosing 12 consecutive 5 minute intervals with the highest talkativeness, then selecting another 12 from those not yet selected, etc.

8 - 13 month files

In 8 - 13 month files, 4 hours of the file are annotated. The annotated time should come from subregions ranked 1 - 4, with the subregion ranked 5 reserved for make-up time for skips and silences. Anything above 4 hours of annotated time is marked as surplus coding.

14 - 17 month files

In 14 - 17 month files, 3 hours of the file are annotated. The annotated time should come from subregions ranked 1 - 3, with the subregions ranked 4 - 5 reserved for make-up time for skips and silences. Anything above 3 hours of annotated time is marked as surplus coding.

Details about subregions

Subregions are non-overlapping but may be adjacent to one another (e.g. subregion 3's end can be the same time as subregion 4's start).

Subregions are marked at the beginning and end with comments that are formatted as follows:

%xcom: subregion [1-5] of 5 (ranked [1-5] of 5) starts at [time from beginning of file in milliseconds] -- previous timestamp adjusted: was [time in milliseconds used in prior version of the file] *optional addendum [contains silent region: [silence start time in ms, silence end time in ms] ]
%xcom: subregion [1-5] of 5 (ranked [1-5] of 5) ends at [time from beginning of file in milliseconds] -- previous timestamp adjusted: was [time in milliseconds used in prior version of the file]*optional addendum [contains silent region: [silence start time in ms, silence end time in ms] ]

Using the subregion start and end times in milliseconds, the subregions are each 60 minutes long. Subregions cannot start and end exactly on CLAN timestamp times in milliseconds, as this would cause the subregions to be irregular in length. Thus, timestamp comments contain the start and end times (in milliseconds) and were inserted immediately after the CLAN tier with the timestamp that would include the actual subregion start time, e.g.:

*FAN: &=w5_52 . •32399810_32400960• 
%xcom: subregion 4 of 5 (ranked 4 of 5) starts at 32400000 -- previous timestamp adjusted: was 32400960

Subregions at the end of the file might be somewhat shorter due to an imprecision of how we selected the subregions. We treated rows in the LENA software output as corresponding to 5 minutes of the recording while in reality they corresponded to 5 minutes of the clock time. For example, if the recording went continuously from 06:04:00 am to 10:04:00 pm, the 5 min intervals would go from 06:00:00 am to 10:05:00 pm and the subregion spanning the last 12 intervals would end 5 minutes after the recording did.

Last updated