I don't know about YouTube but the chunks are often a fixed length. For example 1 or 2 seconds. So as long as the ad itself is an even number of seconds (which YouTube can require, or just pad the add to the nearest second) so there is no concrete difference between the 1s "content" chunks vs the 1s "ad" chunks.
If you are trying to predict the ad chunks you are probably better off doing things like detecting sudden loudness changes, different colour tones or similar. But this will always be imperfect as these could just be scene changes that happened to be chunk aligned in the content.
I don't know about YouTube but the chunks are often a fixed length. For example 1 or 2 seconds. So as long as the ad itself is an even number of seconds (which YouTube can require, or just pad the add to the nearest second) so there is no concrete difference between the 1s "content" chunks vs the 1s "ad" chunks.
If you are trying to predict the ad chunks you are probably better off doing things like detecting sudden loudness changes, different colour tones or similar. But this will always be imperfect as these could just be scene changes that happened to be chunk aligned in the content.