I used to think that I did not have to worry about thread safety because of Ruby’s Global Interpreter Lock. Unfortunately the GIL is there to protect the Interpreter, not to save me from my bugs. This post is based on a bug I found on production.

The Setup

Let’s say that I need to send an SMS to everyone in a community and that I have to check for a communty’s available SMS credits before sending the message:

community.users.each do |recipient|
  sms_credits = community.sms_credits

  if sms_credits > 0
    send_sms recipient, 'Everything is fine.'

    # take off a credit
    community.sms_credits = sms_credits - 1
    community.save
  end
end

def send_sms(recipient, message)
  # does a REST call to Twilio
  # which takes a while
end

The Queue

Since send_sms is slow I can speed up the execution of the entire program with a queue. Queuing is a great way to separate the main program from the slow parts of the program.

Fig 1: A Marxist take on software queues Fig 1: A Marxist take on software queues

Each item on the queue is a job to send an SMS. Rich Uncle Pennybags (the main program) can chuck a job into a queue and immediately get back to doing what he was doing. At the same time separate worker processes can pick it up on the other side of the queue to send each SMS. Rails has a framework for this pattern called Active Jobs that can be used this way,

commnunity.users.each do |recipient|
  # this chucks the job into a queue and returns immediately!
  # it no longer takes a while to finish.
  SmsSenderJob.perform_later recipient, community, 'Everything is fine.'
end

# Here's where you define the job that gets run by the workers
class SmsSenderJob < ApplicationJob
  def perform(recipient, community, message)
    sms_credits = community.sms_credits

    if sms_credits > 0
      send_sms recipient, message # this still takes a while.

      # take off a credit
      community.sms_credits = sms_credits - 1
      community.save
    end
  end
end

The Bug

It was really easy to move all that code into the job but doing so caused a bug. To understand why, note that Active Job let’s you plug in different queue adapters and I used Sidekiq. Sidekiq is fast, memory efficient and fantastic. However Sidekiq jobs have to be thread safe which they warn you about all, the, time. The way the job was written above was a classic case of the check-then-set race condition which meant that it wasn’t thread safe.

Looking again at SmsSenderJob,

class SmsSenderJob < ApplicationJob
  def perform(recipient, community, message)
    sms_credits = community.sms_credits

    if sms_credits > 0
      send_sms recipient, message # this still takes a while.

      # take off a credit
      community.sms_credits = sms_credits - 1
      community.save
    end
  end
end

Suppose the community started with 100 credits, had plenty of users and ran with more than one worker. The code might deduct multiple credits. Or it might deduct 1. There is no way of determining how many credits it was going to end up with because Ruby interleaves the code when multiple threads are used. My mistake was assuming that every line in a job would run in a block. This is how I thought it would run,

# Job 1
sms_credits = community.sms_credits # start with 100 credits 
if sms_credits > 0
  send_sms recipient, message

  # take off a credit
  community.sms_credits = sms_credits - 1
  community.save # 99 credits
end
# Switch! -------------------------------------------------------------------
                      # Job 2
                      sms_credits = community.sms_credits # 99 credits
                      if sms_credits > 0
                        send_sms recipient, message

                        # take off a credit
                        community.sms_credits = sms_credits - 1
                        community.save # 98 credits
                      end

when it was just as likely to run this way,

# Job 1
sms_credits = community.sms_credits # start with 100 credits
# Switch! -------------------------------------------------------------------
                      # Job 2
                      sms_credits = community.sms_credits # still 100 credits!!
                      if sms_credits > 0
                        send_sms recipient, message

                        # take off a credit
                        community.sms_credits = sms_credits - 1
                        community.save # 99 credits
                      end
# Switch! -------------------------------------------------------------------
# Back to Job 1
if sms_credits > 0
  send_sms recipient, message

  # take off a credit
  community.sms_credits = sms_credits - 1
  community.save # also 99 credits!
end

The Easy Solution

The simple fix is to remove all shared mutable state from the multithreaded jobs. If there is nothing to check-and-set in a job then there is no way to have a check-and-set race condition. Like so,

commnunity.users.each do |recipient|
  sms_credits = community.sms_credits

  if sms_credits > 0
    # this chucks the job into a queue and returns immediately!
    SmsSenderJob.perform_later recipient, 'Everything is fine.'

    # take off a credit
    community.sms_credits = sms_credits - 1
    community.save
  end
end

class SmsSenderJob < ApplicationJob
  def perform(recipient, message)
    send_sms(recipient, message)
  end
end

The Hard Solution

But that only works if the main program is not running more than once at the same time. If I run the code on multiple processes, e.g. in multiple Unicorn instances on a web app, there might still be issues. Check-then-set race conditions don’t just happen with threads, they happen whenever there is shared mutable state, i.e. community.sms_credits. There’s less of a chance that it will happen at this level but it’s still worth fixing. The correct solution will probably involve some form of locking or journaling. Which would be fun to write about once it’s fixed.

Conclusion

Coding for thread safety is hard. I hope this post helps you reason about the most common mistake when trying to write multithreaded programs.

I would love you thank Moritz Neeb and Kamal Marhubi for reviewing earlier drafts of this.