Our Blog

Playing with Python Pickle #1

Reading time ~6 min

In our recent memcached investigations (a blog post is still in the wings) we came across numerous caches storing serialized data. The caches were not homogenous and so the data was quite varied: Java objects, ActiveRecord objects from RoR, JSON, pre-rendered HTML, .Net serialized objects and serialized Python objects. Serialized objects can be useful to an attacker from a number of standpoints: such objects could expose data where naive developers make use of the objects to hold secrets and rely on the user to proxy the objects to various parts of an application. In addition, altering serialized objects could impact on the deserialization process, leading to compromise of the system on which the deserialization takes place.

In all the caches we examined, the most common data format found (apart from HTML snippets) was serialized Python and this prompted a brief investigation into the possible attacks against serialized Python objects. We’ve put together a couple of posts explaining how one might go about exploiting Pickle strings; the obvious vector is memcached however anytime Pickle strings are passed to an untrusted party the attacks described here become useful.

Background

Python implements a default serialization technique called Pickle. Now I don’t pretend to be a Pickle expert; Python is not my script language of choice for starters and serialization is not particularly interesting subject to me, however seeing the following in any docs is cause for further digging:

A little further down the same page, we find a trivial example of how to execute code from a Pickle stream and a quick Google leads to a blog post in which Pickle insecurities are fleshed out in more detail. Both are worthwhile reads.

From these sources emerge the following factoids:

  • Pickle streams (or strings) are not simply data formats, they can reconstruct arbitrary Python objects.
  • Objects are described as a sequence of instructions and data stored in a stream of mostly 7-bit chars (newer version of the Pickle protocol support 8-bit opcodes too).
  • The stream is deserialized by a simple virtual machine. Features of the machine are that it is stack-based, includes memo storage (these are registers accessible in any scope), and can call Python callables. There are no branching or looping instructions.
  • Once the virtual machine has processed a complete set of instructions, the final deserialized object returned to the caller is whatever single object remains on the stack. Errors are produced if the final stack is empty or contains more than one item, or if the instruction sequence is malformed or terminates before the end of the serialized data.
  • Since Python 2.3, any semblance of protection in the Pickle code has been removed. Python developers have explicitly stated that the effort required to implement proper security in Pickle exceeds the usefulness of such an exercise and to underline this point they have removed all security controls that were present.

This last point was particularly intriguing; developers purposely removed any semblance of security from the depickling mechanism and exhort users to never deserialize untrusted data. However, the memcached work showed that if one could find memcached instances, it was possible to overwrite data within the cache trivially. If data inside a cache was comprised of Pickle strings, then by overwriting them an attacker is able to inject untrusted Pickle objects into a deserialization operation.

We’ve had a bit of fun with this seeing how far it can be pushed and over the coming days, I’ll post three more entries on this topic. In the mean time, here’s some background and a few simple examples to get things going.

Following along

In order to understand the Pickle objects below, you’ll need to follow a few basic opcodes and their arguments:

  • c<module>\n<function>\n -> push <module>.<function> onto the stack. There are subtleties here, but for the most part it works.
  • ( -> push a MARK object onto the stack.
  • S'<string>’\n -> Push <string> object onto the stack.
  • V'<string>’\n -> Push Unicode <string> object onto the stack.
  • l -> pop everything off the stack up to the topmost MARK object, create a list with the objects (excl MARK) and push the list back onto the stack
  • t -> pop everything off the stack up to the topmost MARK object, create a tuple with the object (excl MARK) and push the tuple back onto the stack
  • R -> pop two objects off the stack; the top object is treated is an argument and the lower object is a callable (function object). Apply the function to the arguments and push the result back onto the stack
  • p<index>\n -> Peek at the top stack object and store it in memo <index>.
  • g<index>\n -> Grab an object from memo <index> and push onto the stack.
  • 0 -> Pop and discard the topmost stack item.
  • . -> Terminate the virtual machine

With these simple instructions it’s possible to execute arbitrary Python code, call OS commands and delve into the currently running Python process, as we’ll show in the next couple of posts. I should also mention that the virtual machine supports a bunch of other instructions and these are well documented in pickletools.py, however for the sake of keeping things simple I’ve only mentioned instructions that we’ll actually touch.

Getting started

Testing out Pickle objects is pretty simple:

import pickle
str="""S'Hello world'
."""
pickle.loads(str)

(All the Pickle strings we’ll play with can be substituted in for “str”. Note that Pickle is sensitive to spacing and newlines, so don’t introduce extras.)

The pickled data “S’Hello World'” simply instructs the VM to push a “Hello World” string object onto the stack. The final “.” pops the stack and returns whatever is present.

An important instruction is the MARK opcode “(“, which is used to signify frames on the stack. It is normally used in conjunction with opcodes that have to pop multiple objects off the stack, for example opcodes that build lists, tuples or dicts. The two examples below show how a list and a tuple are produced:

(S'Hello'
S'World'
l.

produces

['Hello','World']

and

(S'Hello'
S'World'
t.

produces

('Hello','World')

Final example

The canonical example given in a number of places including the official Python docs as to why unpickling untrusted data is bad is:

cos
system
(S'echo hello world'
tR.

The intent is clear however the interesting bit is twofold: decoding the instructions used and realizing that for an attacker, “hello world” isn’t all that useful. In the next post I’ll introduce the basics behind calling functions and see whether we can extend the canonical example into something a little more evil.