PyPI - License CircleCI branch AppVeyor branch

pysimdjson

Quick-n’dirty Python bindings for simdjson just to see if going down this path might yield some parse time improvements in real-world applications. So far, the results are promising, especially when only part of a document is of interest.

Bindings are currently tested on OS X, Linux, and Windows.

See the latest documentation at http://pysimdjson.tkte.ch.

Installation

There are binary wheels available for py3.6/py3.7 on OS X 10.12 & Windows. On other platforms you’ll need a C++17-capable compiler.

pip install pysimdjson

If you’re getting errors when installing from pip, there’s probably no binary package available for your combination of platform & python version. As long as you have a C++17 compiler installed you can still use pip, you just need to provide a few extra compiler flags. The most common are:

  • gcc/clang: CFLAGS="-march=native -std=c++17" pip install pysimdjson

  • msvc (Visual Studio 2017):

    SET CL="/std:c++17 /arch:AVX2"
    pip install pysimdjson
    

or from git:

git clone https://github.com/TkTech/pysimdjson.git
cd pysimdjson
python setup.py install

Example

import simdjson

with open('sample.json', 'rb') as fin:
    doc = simdjson.loads(fin.read())

However, this doesn’t really gain you that much over, say, ujson. You’re still loading the entire document and converting the entire thing into a series of Python objects which is very expensive. You can instead use items() to pull only part of a document into Python.

Example document:

{
    "type": "search_results",
    "count": 2,
    "results": [
        {"username": "bob"},
        {"username": "tod"}
    ],
    "error": {
        "message": "All good captain"
    }
}

And now lets try some queries…

import simdjson

with open('sample.json', 'rb') as fin:
    # Calling ParsedJson with a document is a shortcut for
    # calling pj.allocate_capacity(<size>) and pj.parse(<doc>). If you're
    # parsing many JSON documents of similar sizes, you can allocate
    # a large buffer just once and keep re-using it instead.
    pj = simdjson.ParsedJson(fin.read())

    pj.items('.type') #> "search_results"
    pj.items('.count') #> 2
    pj.items('.results[].username') #> ["bob", "tod"]
    pj.items('.error.message') #> "All good captain"

AVX2

simdjson requires AVX2 support to function. Check to see if your OS/processor supports it:

  • OS X: sysctl -a | grep machdep.cpu.leaf7_features
  • Linux: grep avx2 /proc/cpuinfo

Low-level interface

You can use the low-level simdjson Iterator interface directly, just be aware that this interface can change any time. If you depend on it you should pin to a specific version of simdjson. You may need to use this interface if you’re dealing with odd JSON, such as a document with repeated non-unique keys.

with open('sample.json', 'rb') as fin:
    pj = simdjson.ParsedJson(fin.read())
    iter = simdjson.Iterator(pj)
    if iter.is_object():
        if iter.down():
            print(iter.get_string())

Early Benchmark

Comparing the built-in json module loads on py3.7 to simdjson loads.

File json time pysimdjson time
jsonexamples/apache_builds.json 0.09916733999999999 0.074089268
jsonexamples/canada.json 5.305393378 1.6547515810000002
jsonexamples/citm_catalog.json 1.3718639709999998 1.0438697340000003
jsonexamples/github_events.json 0.04840242700000097 0.034239397999998644
jsonexamples/gsoc-2018.json 1.5382746889999996 0.9597240750000005
jsonexamples/instruments.json 0.24350973299999978 0.13639699600000021
jsonexamples/marine_ik.json 4.505123285000002 2.8965093270000004
jsonexamples/mesh.json 1.0325923849999974 0.38916503499999777
jsonexamples/mesh.pretty.json 1.7129034710000006 0.46509220500000126
jsonexamples/numbers.json 0.16577519699999854 0.04843887400000213
jsonexamples/random.json 0.6930746310000018 0.6175370539999996
jsonexamples/twitter.json 0.6069602610000011 0.41049074900000093
jsonexamples/twitterescaped.json 0.7587005720000022 0.41576198399999953
jsonexamples/update-center.json 0.5577604210000011 0.4961777420000004

Getting subsets of the document is significantly faster. For canada.json getting .type using the naive approach and the items() appraoch, average over N=100.

Python Time
json.loads(canada_json)['type'] 5.76244878
simdjson.loads(canada_json)['type'] 1.5984486990000004
simdjson.ParsedJson(canada_json).items('.type') 0.3949587819999998

This approach avoids creating Python objects for fields that aren’t of interest. When you only care about a small part of the document, it will always be faster.

API Reference

simdjson.loads(s)

Deserialize and return the entire JSON object in s (bytes).

Note

Unlike the built-in Python json.loads, this method only accepts byte strings, as simdjson will only work on encoded UTF-8.

ParsedJson

class simdjson.ParsedJson(source=None)

Low-level wrapper for simdjson.

Providing a source document is a shortcut for calling allocate_capacity() and parse().

allocate_capacity(self, size, max_depth=DEFAULT_MAX_DEPTH)

Resize the document buffer to size bytes.

is_valid(self) → bool

True if the internal state of the parsed json is valid.

items(self, query)

Given a query string, find matching elements in the document and return them.

If you only desire part of a document, this method offers significant oppertunities for performance gains, as it will avoid creating Python objects for anything other than the matching objects. If you have a situation where you check a boolean, such as:

{"results": ["...50MB..."], "success": true}

… you could check just the success field before wasting time loading the entire document into Python objects.

with open("myjson.json", "rb") as source:
    pj = ParsedJson(source)
    if pj.items(".success"):
        document = pj.to_obj()
parse(self, source)

Parse the given document (as bytes).

Note

It’s up to the caller to ensure that allocate_capacity has been called with a sufficiently large size before this method is called.

to_obj(self)

Recursively convert a parsed json document to a Python object and return it.

Iterator

class simdjson.Iterator

A wrapper around the interal simdjson ParsedJson::iterator object.

Typically, it’s only useful to use this object if you have very specific needs, such as handling JSON with duplicate keys.

Note

This is a _very_ thin wrapper around the underlying simdjson structures. This means that it is possible for this interface to change between versions. If you depend on this, you should pin the version of simdjson you are using until you can confirm that the update works (which is just good practice in general!).

High-level interfaces like loads() and items() are reliable and will always be available.

down(self) → bool

Enter the current scope and move down a level in the document.

get_depth(self) → size_t

The current depth of the iterator in the tree.

get_double(self) → double

Return the current element as a double. This is only valid if is_double() is True.

get_integer(self) → int64_t

Return the current element as an integer. This is only valid if is_integer() is True.

get_scope_type(self) → size_t

Like get_type(), except it returns the type of the containing scope. For example, given a state like this:

{
    "hello": "world"
}

… and the iterator is currently on “world”, this method would return {, as it is contained within an object/dict.

get_string(self) → bytes

Return the current element as byte string. This is only valid if is_string() is True.

Note

Internally, all the strings are encoded UTF-8. To use this byte string in Python as unicode call get_string().decode(‘utf-8’).

get_tape_length(self) → size_t

The total length of the underlying tape structure.

The length of the tape is _not_ the same as the # of elements in the document. Some elements consume more than a single entry on the tape.

get_tape_location(self) → size_t

The iterator’s current location within the underlying tape structure.

get_type(self) → uint8_t

The type of the current element the iterator is pointing to. This can be one of “{}[]tfnr.

isOk(self) → bool

True if the internal state of the iterator is valid.

is_array(self) → bool

True if the current element is an array.

is_double(self) → bool

True if the current element is a double.

is_integer(self) → bool

True if the current element is an integer.

is_object(self) → bool

True if the current element is an object/dict.

is_object_or_array(self) → bool

True if the current element is an object/dict or an array (elements for which get_type() return either { or [)

is_string(self) → bool

True if the current element is a string.

move_forward(self, const char *key) → bool

Move forward along the tape in document order. This will enter and exit scopes automatically, so it can be used to traverse an entire document.

move_to_key(self, const char *key) → bool

Move to the given key within the current scope. Returns False if the key was not found.

next(self) → bool

Move to the next element in the document. This will return False if the end of the current scope has been reached.

prev(self) → bool

Move to the previous element in the document. This will return False if already at the start of the current scope.

to_obj(self)

Convert the current iterator and all of its children into Python objects and return them.

to_start_scope(self) → void

Move to the start of the current scope.

up(self) → bool

Exit the current scope and move up a level in the document.