AAC encoding using AudioConverter and writing to AVAssetWriter

Question

I'm struggling to encode audio buffers received from AVCaptureSession using AudioConverter and then appending them to an AVAssetWriter.

I'm not getting any errors (including OSStatus responses), and the CMSampleBuffers generated seem to have valid data, however the resulting file simply does not have any playable audio. When writing together with video, the video frames stop getting appended a couple of frames in (appendSampleBuffer() returns false, but with no AVAssetWriter.error), probably because the asset writer is waiting for the audio to catch up. I suspect it's related to the way I'm setting up the priming for AAC.

The app uses RxSwift, but I've removed the RxSwift parts so that it's easier to understand for a wider audience.

Please check out comments in the code below for more... comments

Given a settings struct:

import Foundation
import AVFoundation
import CleanroomLogger

public struct AVSettings {

let orientation: AVCaptureVideoOrientation = .Portrait
let sessionPreset                          = AVCaptureSessionPreset1280x720
let videoBitrate: Int                      = 2_000_000
let videoExpectedFrameRate: Int            = 30
let videoMaxKeyFrameInterval: Int          = 60

let audioBitrate: Int                      = 32 * 1024

/// Settings that are `0` means variable rate.
/// The `mSampleRate` and `mChennelsPerFrame` is overwritten at run-time
/// to values based on the input stream.
let audioOutputABSD = AudioStreamBasicDescription(
                            mSampleRate: AVAudioSession.sharedInstance().sampleRate,
                            mFormatID: kAudioFormatMPEG4AAC,
                            mFormatFlags: UInt32(MPEG4ObjectID.AAC_Main.rawValue),
                            mBytesPerPacket: 0,
                            mFramesPerPacket: 1024,
                            mBytesPerFrame: 0,
                            mChannelsPerFrame: 1,
                            mBitsPerChannel: 0,
                            mReserved: 0)

let audioEncoderClassDescriptions = [
    AudioClassDescription(
        mType: kAudioEncoderComponentType,
        mSubType: kAudioFormatMPEG4AAC,
        mManufacturer: kAppleSoftwareAudioCodecManufacturer) ]

}

Some helper functions:

public func getVideoDimensions(fromSettings settings: AVSettings) -> (Int, Int) {
  switch (settings.sessionPreset, settings.orientation)  {
  case (AVCaptureSessionPreset1920x1080, .Portrait): return (1080, 1920)
  case (AVCaptureSessionPreset1280x720, .Portrait): return (720, 1280)
  default: fatalError("Unsupported session preset and orientation")
  }
}

public func createAudioFormatDescription(fromSettings settings: AVSettings) -> CMAudioFormatDescription {
  var result = noErr
  var absd = settings.audioOutputABSD
  var description: CMAudioFormatDescription?
  withUnsafePointer(&absd) { absdPtr in
      result = CMAudioFormatDescriptionCreate(nil,
                                              absdPtr,
                                              0, nil,
                                              0, nil,
                                              nil,
                                              &description)
  }

  if result != noErr {
      Log.error?.message("Could not create audio format description")
  }

  return description!
}

public func createVideoFormatDescription(fromSettings settings: AVSettings) -> CMVideoFormatDescription {
  var result = noErr
  var description: CMVideoFormatDescription?
  let (width, height) = getVideoDimensions(fromSettings: settings)
  result = CMVideoFormatDescriptionCreate(nil,
                                          kCMVideoCodecType_H264,
                                          Int32(width),
                                          Int32(height),
                                          [:],
                                          &description)

  if result != noErr {
      Log.error?.message("Could not create video format description")
  }

  return description!
}

This is how the asset writer is initialized:

guard let audioDevice = defaultAudioDevice() else
{ throw RecordError.MissingDeviceFeature("Microphone") }

guard let videoDevice = defaultVideoDevice(.Back) else
{ throw RecordError.MissingDeviceFeature("Camera") }

let videoInput      = try AVCaptureDeviceInput(device: videoDevice)
let audioInput      = try AVCaptureDeviceInput(device: audioDevice)
let videoFormatHint = createVideoFormatDescription(fromSettings: settings)
let audioFormatHint = createAudioFormatDescription(fromSettings: settings)

let writerVideoInput = AVAssetWriterInput(mediaType: AVMediaTypeVideo,
                                        outputSettings: nil,
                                        sourceFormatHint: videoFormatHint)

let writerAudioInput = AVAssetWriterInput(mediaType: AVMediaTypeAudio,
                                        outputSettings: nil,
                                        sourceFormatHint: audioFormatHint)

writerVideoInput.expectsMediaDataInRealTime = true
writerAudioInput.expectsMediaDataInRealTime = true

let url = NSURL(fileURLWithPath: NSTemporaryDirectory(), isDirectory: true)
        .URLByAppendingPathComponent(NSProcessInfo.processInfo().globallyUniqueString)
        .URLByAppendingPathExtension("mp4")

let assetWriter =  try AVAssetWriter(URL: url, fileType: AVFileTypeMPEG4)

if !assetWriter.canAddInput(writerVideoInput) {
throw RecordError.Unknown("Could not add video input") }

if !assetWriter.canAddInput(writerAudioInput) {
throw RecordError.Unknown("Could not add audio input") }

assetWriter.addInput(writerVideoInput)
assetWriter.addInput(writerAudioInput)

And this is how audio samples are being encoded, problem area is most likely to be around here. I've re-written this so that it doesn't use any Rx-isms.

var outputABSD = settings.audioOutputABSD
var outputFormatDescription: CMAudioFormatDescription! = nil
CMAudioFormatDescriptionCreate(nil, &outputABSD, 0, nil, 0, nil, nil, &formatDescription)

var converter: AudioConverter?

// Indicates whether priming information has been attached to the first buffer
var primed = false

func encodeAudioBuffer(settings: AVSettings, buffer: CMSampleBuffer) throws -> CMSampleBuffer? {

  // Create the audio converter if it's not available
  if converter == nil {
      var classDescriptions = settings.audioEncoderClassDescriptions
      var inputABSD = CMAudioFormatDescriptionGetStreamBasicDescription(CMSampleBufferGetFormatDescription(buffer)!).memory
      var outputABSD = settings.audioOutputABSD
      outputABSD.mSampleRate = inputABSD.mSampleRate
      outputABSD.mChannelsPerFrame = inputABSD.mChannelsPerFrame

      var converter: AudioConverterRef = nil
      var result = noErr
      result = withUnsafePointer(&outputABSD) { outputABSDPtr in
          return withUnsafePointer(&inputABSD) { inputABSDPtr in
          return AudioConverterNewSpecific(inputABSDPtr,
                                          outputABSDPtr,
                                          UInt32(classDescriptions.count),
                                          &classDescriptions,
                                          &converter)
          }
      }

      if result != noErr { throw RecordError.Unknown }

      // At this point I made an attempt to retrieve priming info from
      // the audio converter assuming that it will give me back default values
      // I can use, but ended up with `nil`
      var primeInfo: AudioConverterPrimeInfo? = nil
      var primeInfoSize = UInt32(sizeof(AudioConverterPrimeInfo))

      // The following returns a `noErr` but `primeInfo` is still `nil``
      AudioConverterGetProperty(converter, 
                              kAudioConverterPrimeInfo,
                              &primeInfoSize, 
                              &primeInfo)

      // I've also tried to set `kAudioConverterPrimeInfo` so that it knows
      // the leading frames that are being primed, but the set didn't seem to work
      // (`noErr` but getting the property afterwards still returned `nil`)
  }

  let converter = converter!

  // Need to give a big enough output buffer.
  // The assumption is that it will always be <= to the input size
  let numSamples = CMSampleBufferGetNumSamples(buffer)
  // This becomes 1024 * 2 = 2048
  let outputBufferSize = numSamples * Int(inputABSD.mBytesPerPacket)
  let outputBufferPtr = UnsafeMutablePointer<Void>.alloc(outputBufferSize)

  defer {
      outputBufferPtr.destroy()
      outputBufferPtr.dealloc(1)
  }

  var result = noErr

  var outputPacketCount = UInt32(1)
  var outputData = AudioBufferList(
  mNumberBuffers: 1,
  mBuffers: AudioBuffer(
                  mNumberChannels: outputABSD.mChannelsPerFrame,
                  mDataByteSize: UInt32(outputBufferSize),
                  mData: outputBufferPtr))

  // See below for `EncodeAudioUserData`
  var userData = EncodeAudioUserData(inputSampleBuffer: buffer,
                                      inputBytesPerPacket: inputABSD.mBytesPerPacket)

  withUnsafeMutablePointer(&userData) { userDataPtr in
      // See below for `fetchAudioProc`
      result = AudioConverterFillComplexBuffer(
                      converter,
                      fetchAudioProc,
                      userDataPtr,
                      &outputPacketCount,
                      &outputData,
                      nil)
  }

  if result != noErr {
      Log.error?.message("Error while trying to encode audio buffer, code: \(result)")
      return nil
  }

  // See below for `CMSampleBufferCreateCopy`
  guard let newBuffer = CMSampleBufferCreateCopy(buffer,
                                                  fromAudioBufferList: &outputData,
                                                  newFromatDescription: outputFormatDescription) else {
      Log.error?.message("Could not create sample buffer from audio buffer list")
      return nil
  }

  if !primed {
      primed = true
      // Simply picked 2112 samples based on convention, is there a better way to determine this?
      let samplesToPrime: Int64 = 2112
      let samplesPerSecond = Int32(settings.audioOutputABSD.mSampleRate)
      let primingDuration = CMTimeMake(samplesToPrime, samplesPerSecond)

      // Without setting the attachment the asset writer will complain about the
      // first buffer missing the `TrimDurationAtStart` attachment, is there are way
      // to infer the value from the given `AudioBufferList`?
      CMSetAttachment(newBuffer,
                      kCMSampleBufferAttachmentKey_TrimDurationAtStart,
                      CMTimeCopyAsDictionary(primingDuration, nil),
                      kCMAttachmentMode_ShouldNotPropagate)
  }

  return newBuffer

}

Below is the proc that fetches samples for the audio converter, and the data structure that gets passed to it:

private class EncodeAudioUserData {
  var inputSampleBuffer: CMSampleBuffer?
  var inputBytesPerPacket: UInt32

  init(inputSampleBuffer: CMSampleBuffer,
      inputBytesPerPacket: UInt32) {
      self.inputSampleBuffer   = inputSampleBuffer
      self.inputBytesPerPacket = inputBytesPerPacket
  }
}

private let fetchAudioProc: AudioConverterComplexInputDataProc = {
  (inAudioConverter,
  ioDataPacketCount,
  ioData,
  outDataPacketDescriptionPtrPtr,
  inUserData) in

  var result = noErr

  if ioDataPacketCount.memory == 0 { return noErr }

  let userData = UnsafeMutablePointer<EncodeAudioUserData>(inUserData).memory

  // If its already been processed
  guard let buffer = userData.inputSampleBuffer else {
      ioDataPacketCount.memory = 0
      return -1
  }

  var inputBlockBuffer: CMBlockBuffer?
  var inputBufferList = AudioBufferList()
  result = CMSampleBufferGetAudioBufferListWithRetainedBlockBuffer(
              buffer,
              nil,
              &inputBufferList,
              sizeof(AudioBufferList),
              nil,
              nil,
              0,
              &inputBlockBuffer)

  if result != noErr {
      Log.error?.message("Error while trying to retrieve buffer list, code: \(result)")
      ioDataPacketCount.memory = 0
      return result
  }

  let packetsCount = inputBufferList.mBuffers.mDataByteSize / userData.inputBytesPerPacket
  ioDataPacketCount.memory = packetsCount

  ioData.memory.mBuffers.mNumberChannels = inputBufferList.mBuffers.mNumberChannels
  ioData.memory.mBuffers.mDataByteSize = inputBufferList.mBuffers.mDataByteSize
  ioData.memory.mBuffers.mData = inputBufferList.mBuffers.mData

  if outDataPacketDescriptionPtrPtr != nil {
      outDataPacketDescriptionPtrPtr.memory = nil
  }

  return noErr
}

This is how I am converting AudioBufferLists to CMSampleBuffers:

public func CMSampleBufferCreateCopy(
    buffer: CMSampleBuffer,
    inout fromAudioBufferList bufferList: AudioBufferList,
    newFromatDescription formatDescription: CMFormatDescription? = nil)
    -> CMSampleBuffer? {

  var result = noErr

  var sizeArray: [Int] = [Int(bufferList.mBuffers.mDataByteSize)]
  // Copy timing info from the previous buffer
  var timingInfo = CMSampleTimingInfo()
  result = CMSampleBufferGetSampleTimingInfo(buffer, 0, &timingInfo)

  if result != noErr { return nil }

  var newBuffer: CMSampleBuffer?
  result = CMSampleBufferCreateReady(
      kCFAllocatorDefault,
      nil,
      formatDescription ?? CMSampleBufferGetFormatDescription(buffer),
      Int(bufferList.mNumberBuffers),
      1, &timingInfo,
      1, &sizeArray,
      &newBuffer)

  if result != noErr { return nil }
  guard let b = newBuffer else { return nil }

  CMSampleBufferSetDataBufferFromAudioBufferList(b, nil, nil, 0, &bufferList)
  return newBuffer

}

Is there anything that I am obviously doing wrong? Is there a proper way to construct CMSampleBuffers from AudioBufferList? How do you transfer priming information from the converter to CMSampleBuffers that you create?

For my use case I need to do the encoding manually as the buffers will be manipulated further down the pipeline (although I've disabled all transformations after the encode in order to make sure that it works.)

Any help would be much appreciated. Sorry that there's so much code to digest, but I wanted to provide as much context as possible.

Thanks in advance :)

Some related questions:

CMSampleBufferRef kCMSampleBufferAttachmentKey_TrimDurationAtStart crash
Can I use AVCaptureSession to encode an AAC stream to memory?
Writing video + generated audio to AVAssetWriterInput, audio stuttering
How do I use CoreAudio's AudioConverter to encode AAC in real-time?

Some references I've used:

Apple sample code demonstrating how to use AudioConverter
Note describing AAC encoder delay

Nathan Kot · Accepted Answer

Turns out there were a variety of things that I was doing wrong. Instead of posting a garble of code, I'm going to try and organize this into bite-sized pieces of things that I discovered..

Samples vs Packets vs Frames

This had been a huge source of confusion for me:

Each CMSampleBuffer can have 1 or more sample buffers (discovered via CMSampleBufferGetNumSamples)
Each CMSampleBuffer that contains 1 sample represents a single audio packet.
Therefore, CMSampleBufferGetNumSamples(sample) will return the number of packets contained in the given buffer.
Packets contain frames. This is governed by the mFramesPerPacket property of the buffer's AudioStreamBasicDescription. For linear PCM buffers, the total size of each sample buffer is frames * bytes per frame. For compressed buffers (like AAC), there is no relationship between the total size and frame count.

`AudioConverterComplexInputDataProc`

This callback is used to retrieve more linear PCM audio data for encoding. It's imperative that you must supply at least the number of packets specified by ioNumberDataPackets. Since I've been using the converter for real-time push-style encoding, I needed to ensure that each data push contains the minimum amount of packets. Something like this (pseudo-code):

let minimumPackets = outputFramesPerPacket / inputFramesPerPacket
var buffers: [CMSampleBuffer] = []
while getTotalSize(buffers) < minimumPackets {
  buffers = buffers + [getNextBuffer()]
}
AudioConverterFillComplexBuffer(...)

Slicing `CMSampleBuffer`'s

You can actually slice CMSampleBuffer's if they contain multiple buffers. The tool to do this is CMSampleBufferCopySampleBufferForRange. This is nice so that you can provide the AudioConverterComplexInputDataProc with the exact number of packets that it asks for, which makes handling timing information for the resulting encoded buffer easier. Because if you give the converter 1500 frames of data when it expects 1024, the result sample buffer will have a duration of 1024/sampleRate as opposed to 1500/sampleRate.

Priming and trim duration

When doing AAC encoding, you must set the trim duration like so:

CMSetAttachment(buffer,
                kCMSampleBufferAttachmentKey_TrimDurationAtStart,
                CMTimeCopyAsDictionary(primingDuration, kCFAllocatorDefault),
                kCMAttachmentMode_ShouldNotPropagate)

One thing I did wrong was that I added the trim duration at encode time. This should be handled by your writer so that it can guarantee the information gets added to your leading audio frames.

Also, the value of kCMSampleBufferAttachmentKey_TrimDurationAtStart should never be greater than the duration of the sample buffer. An example of priming:

Priming frames: 2112
Sample rate: 44100
Priming duration: 2112 / 44100 = ~0.0479s
First frame, frames: 1024, priming duration: 1024 / 44100
Second frame, frames: 1024, priming duration: 1088 / 41100

Creating the new `CMSampleBuffer`

AudioConverterFillComplexBuffer has an optional outputPacketDescriptionsPtr. You should use it. It will point to a new array of packet descriptions that contains sample size information. You need this sample size information to construct the new compressed sample buffer:

let bufferList: AudioBufferList
let packetDescriptions: [AudioStreamPacketDescription]
var newBuffer: CMSampleBuffer?

CMAudioSampleBufferCreateWithPacketDescriptions(
  kCFAllocatorDefault, // allocator
  nil, // dataBuffer
  false, // dataReady
  nil, // makeDataReadyCallback
  nil, // makeDataReadyRefCon
  formatDescription, // formatDescription
  Int(bufferList.mNumberBuffers), // numSamples
  CMSampleBufferGetPresentationTimeStamp(buffer), // sbufPTS (first PTS)
  &packetDescriptions, // packetDescriptions
  &newBuffer)

AAC encoding using AudioConverter and writing to AVAssetWriter

Tags:

ios

encoding

swift

core-audio

aac

Nathan Kot

1 Answers

Samples vs Packets vs Frames

`AudioConverterComplexInputDataProc`

Slicing `CMSampleBuffer`'s

Priming and trim duration

Creating the new `CMSampleBuffer`

Nathan Kot

Recent Activity

Donate For Us

AAC encoding using AudioConverter and writing to AVAssetWriter

Tags:

ios

encoding

swift

core-audio

aac

Nathan Kot

1 Answers

Samples vs Packets vs Frames

AudioConverterComplexInputDataProc

Slicing CMSampleBuffer's

Priming and trim duration

Creating the new CMSampleBuffer

Nathan Kot

Related questions

Recent Activity

Donate For Us

`AudioConverterComplexInputDataProc`

Slicing `CMSampleBuffer`'s

Creating the new `CMSampleBuffer`