Welcome to pandas-msgpack’s documentation!

The pandas_msgpack module provides an interface from pandas https://pandas.pydata.org to the msgpack library. This is a lightweight portable binary format, similar to binary JSON, that is highly space efficient, and provides good performance both on the writing (serialization), and reading (deserialization).

Contents:

Installation

You can install pandas-msgpack with conda, pip, or by installing from source.

Conda

# not enabled YET
$ conda install pandas-msgpack --channel conda-forge

This installs pandas-msgpack and all common dependencies, including pandas.

Pip

To install the latest version of pandas-msgpack: from the

$ pip install pandas-msgpack -U

This installs pandas-msgpack and all common dependencies, including pandas.

Install from Source

$ pip install git+https://github.com/pydata/pandas-msgpack.git

Dependencies

The blosc <https://pypi.python.org/pypi/blosc> library can be optionally installed as a compressor.

Tutorial

In [1]: import pandas as pd

In [2]: from pandas_msgpack import to_msgpack, read_msgpack
In [3]: df = pd.DataFrame(np.random.rand(5,2), columns=list('AB'))

In [4]: to_msgpack('foo.msg', df)

In [5]: read_msgpack('foo.msg')
Out[5]: 
          A         B
0  0.943178  0.429336
1  0.875800  0.785710
2  0.665361  0.438875
3  0.718998  0.387223
4  0.937270  0.907659

In [6]: s = pd.Series(np.random.rand(5),index=pd.date_range('20130101',periods=5))

You can pass a list of objects and you will receive them back on deserialization.

In [7]: to_msgpack('foo.msg', df, 'foo', np.array([1,2,3]), s)

In [8]: read_msgpack('foo.msg')
Out[8]: 
[          A         B
 0  0.943178  0.429336
 1  0.875800  0.785710
 2  0.665361  0.438875
 3  0.718998  0.387223
 4  0.937270  0.907659, 'foo', array([1, 2, 3]), 2013-01-01    0.022462
 2013-01-02    0.025367
 2013-01-03    0.881180
 2013-01-04    0.480632
 2013-01-05    0.326956
 Freq: D, dtype: float64]

You can pass iterator=True to iterate over the unpacked results

In [9]: for o in read_msgpack('foo.msg',iterator=True):
   ...:     print(o)
   ...: 
          A         B
0  0.943178  0.429336
1  0.875800  0.785710
2  0.665361  0.438875
3  0.718998  0.387223
4  0.937270  0.907659
foo
[1 2 3]
2013-01-01    0.022462
2013-01-02    0.025367
2013-01-03    0.881180
2013-01-04    0.480632
2013-01-05    0.326956
Freq: D, dtype: float64

You can pass append=True to the writer to append to an existing pack

In [10]: to_msgpack('foo.msg', df, append=True)

In [11]: read_msgpack('foo.msg')
Out[11]: 
[          A         B
 0  0.943178  0.429336
 1  0.875800  0.785710
 2  0.665361  0.438875
 3  0.718998  0.387223
 4  0.937270  0.907659, 'foo', array([1, 2, 3]), 2013-01-01    0.022462
 2013-01-02    0.025367
 2013-01-03    0.881180
 2013-01-04    0.480632
 2013-01-05    0.326956
 Freq: D, dtype: float64,           A         B
 0  0.943178  0.429336
 1  0.875800  0.785710
 2  0.665361  0.438875
 3  0.718998  0.387223
 4  0.937270  0.907659]

Furthermore you can pass in arbitrary python objects.

In [12]: to_msgpack('foo2.msg', { 'dict' : [ { 'df' : df }, { 'string' : 'foo' }, { 'scalar' : 1. }, { 's' : s } ] })

In [13]: read_msgpack('foo2.msg')
Out[13]: 
{'dict': ({'df':           A         B
   0  0.943178  0.429336
   1  0.875800  0.785710
   2  0.665361  0.438875
   3  0.718998  0.387223
   4  0.937270  0.907659},
  {'string': 'foo'},
  {'scalar': 1.0},
  {'s': 2013-01-01    0.022462
   2013-01-02    0.025367
   2013-01-03    0.881180
   2013-01-04    0.480632
   2013-01-05    0.326956
   Freq: D, dtype: float64})}

Compression

Optionally, a compression argument will compress the resulting bytes. These can take a bit more time to write. The available compressors are zlib and blosc.

Generally compression will increase the writing time.

In [1]: import pandas as pd

In [2]: from pandas_msgpack import to_msgpack, read_msgpack

In [3]: df = pd.DataFrame({'A': np.arange(100000),
   ...:                    'B': np.random.randn(100000),
   ...:                    'C': 'foo'})
   ...: 
In [4]: %timeit -n 1 -r 1 to_msgpack('uncompressed.msg', df)
1 loop, best of 1: 51 ms per loop
In [5]: %timeit -n 1 -r 1 to_msgpack('compressed_blosc.msg', df, compress='blosc')
1 loop, best of 1: 28 ms per loop
In [6]: %timeit -n 1 -r 1 to_msgpack('compressed_zlib.msg', df, compress='zlib')
1 loop, best of 1: 135 ms per loop

If compressed, it will be be automatically inferred and de-compressed upon reading.

In [7]: %timeit -n 1 -r 1 read_msgpack('uncompressed.msg')
1 loop, best of 1: 20.9 ms per loop
In [8]: %timeit -n 1 -r 1 read_msgpack('compressed_blosc.msg')
1 loop, best of 1: 21.8 ms per loop
In [9]: %timeit -n 1 -r 1 read_msgpack('compressed_zlib.msg')
1 loop, best of 1: 26.9 ms per loop

These can provide storage space savings.

In [10]: !ls -ltr *.msg
-rw-r--r-- 1 docs docs 2000582 Mar 30 20:12 uncompressed.msg
-rw-r--r-- 1 docs docs 1187978 Mar 30 20:12 compressed_blosc.msg
-rw-r--r-- 1 docs docs 1320531 Mar 30 20:12 compressed_zlib.msg

Read/Write API

Msgpacks can also be read from and written to strings.

In [1]: import pandas as pd

In [2]: from pandas_msgpack import to_msgpack, read_msgpack

In [3]: df = pd.DataFrame({'A': np.arange(10),
   ...:                    'B': np.random.randn(10),
   ...:                    'C': 'foo'})
   ...: 

In [4]: to_msgpack(None, df)
Out[4]: b'\x84\xa5klass\xa9DataFrame\xa3typ\xadblock_manager\xa6blocks\x93\x86\xa6values\xc7P\x00XD\x95\x00\xc1n\xe8\xbf\xf6\x10\xcf9\xac\xd7\xe2?\x93\xf4\xc9Z\x88]\xdb?\x8b\x84w%wJ\xb4?\xbd\n\xb0\xc8Tb\xd4?\xd7\xdd\xcd/\x8f>\xf5\xbf\x15|\x9fN\xb8X\xbc\xbf\xb2\x9bc\xc5,\xdd\xf3?\xdb\xff\xcf\x7f\x9a\xa5\xb6?!\x93\xe2\n^o\xe6?\xa8compress\xc0\xa5dtype\xa7float64\xa5klass\xaaFloatBlock\xa5shape\x92\x01\n\xa4locs\x86\xa5dtype\xa5int64\xa4ndim\x01\xa8compress\xc0\xa4data\xd7\x00\x01\x00\x00\x00\x00\x00\x00\x00\xa3typ\xa7ndarray\xa5shape\x91\x01\x86\xa6values\xc7P\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\xa8compress\xc0\xa5dtype\xa5int64\xa5klass\xa8IntBlock\xa5shape\x92\x01\n\xa4locs\x86\xa5dtype\xa5int64\xa4ndim\x01\xa8compress\xc0\xa4data\xd7\x00\x00\x00\x00\x00\x00\x00\x00\x00\xa3typ\xa7ndarray\xa5shape\x91\x01\x86\xa6values\x9a\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa8compress\xc0\xa5dtype\xa6object\xa5klass\xabObjectBlock\xa5shape\x92\x01\n\xa4locs\x86\xa5dtype\xa5int64\xa4ndim\x01\xa8compress\xc0\xa4data\xd7\x00\x02\x00\x00\x00\x00\x00\x00\x00\xa3typ\xa7ndarray\xa5shape\x91\x01\xa4axes\x92\x86\xa4name\xc0\xa5dtype\xa6object\xa8compress\xc0\xa4data\x93\xa1A\xa1B\xa1C\xa5klass\xa5Index\xa3typ\xa5index\x86\xa4name\xc0\xa4stop\n\xa5klass\xaaRangeIndex\xa3typ\xabrange_index\xa5start\x00\xa4step\x01'

Furthermore you can concatenate the strings to produce a list of the original objects.

In [5]: read_msgpack(to_msgpack(None, df) + to_msgpack(None, df.A))
Out[5]: 
[   A         B    C
 0  0 -0.763520  foo
 1  1  0.588827  foo
 2  2  0.427584  foo
 3  3  0.079261  foo
 4  4  0.318502  foo
 5  5 -1.327773  foo
 6  6 -0.110729  foo
 7  7  1.241498  foo
 8  8  0.088464  foo
 9  9  0.701095  foo, 0    0
 1    1
 2    2
 3    3
 4    4
 5    5
 6    6
 7    7
 8    8
 9    9
 Name: A, dtype: int64]

API Reference

read_msgpack(path_or_buf[, encoding, iterator]) Load msgpack pandas object from the specified
to_msgpack(path_or_buf, *args, **kwargs) msgpack (serialize) object to input file path
pandas_msgpack.read_msgpack(path_or_buf, encoding='utf-8', iterator=False, **kwargs)

Load msgpack pandas object from the specified file path

THIS IS AN EXPERIMENTAL LIBRARY and the storage format may not be stable until a future release.

Parameters:

path_or_buf : string File path, BytesIO like or string

encoding: Encoding for decoding msgpack str type

iterator : boolean, if True, return an iterator to the unpacker

(default is False)

Returns:

obj : type of object stored in file

pandas_msgpack.to_msgpack(path_or_buf, *args, **kwargs)

msgpack (serialize) object to input file path

Parameters:

path_or_buf : string File path, buffer-like, or None

if None, return generated string

args : an object or objects to serialize

encoding: encoding for unicode objects

append : boolean whether to append to an existing msgpack

(default is False)

compress : type of compressor (zlib or blosc), default to None (no

compression)

Changelog

0.1.3 / 2017-03-30

Initial release of transfered code from pandas

Includes patches since the 0.19.2 release on pandas with the following:

  • Bug in read_msgpack() in which Series categoricals were being improperly processed, see pandas-GH#14901
  • Bug in read_msgpack() which did not allow loading of a dataframe with an index of type CategoricalIndex, see pandas-GH#15487
  • Bug in read_msgpack() when deserializing a CategoricalIndex, see pandas-GH#15487

Indices and tables