Joblib: Running Python functions as pipeline jobs

biomcgary · on April 12, 2023

I am not a Pythonista, but joblib is such an easy wrapper to use.

I use joblib at work (it handles edge cases that pickle does not) to package up trained ML models for deployment by a colleague. I train the models outside of a Python environment, pull the models into the Python version of the ML library I use, and then serialize the models to a compressed joblib file. The joblib object takes up less space than the text based model file and deploys quickly for almost instant predictions.

To ensure that the input fields for training line up with the input fields for classification/regression, we add a hash to the joblib object, which is trivially easy. Dump the object on s3 using the hash as the key and it's trivial to have a library of ML models ready to deploy very quickly.

VHRanger · on April 13, 2023

At my previous job, we used joblib + a handbuilt file format to persist ML model blobs across versions.

The file format was a zipfile with:

- a joblib dump of the python class wrapping the model and its data processing pipeline code

- a small csv (~100 rows) of test X data, test Y data and the expected outputs of the model on that data

- a little json with important library versions (pandas, numpy, etc.) so that the process loading the code could check it ran the same as the one that produced the model blob

Then the loader would try to load the file and verify it, and if failed would keep the existing model in memory and raise an alert to slack, so engineers could fix it and reupload a new model without downtime.

The whole thing was inspired by the talk "Alex Gaynor: Pickles are for Delis, not Software - PyCon 2014" [1]

At my current job we don't need such infrastructure because lambda/kubeflow gateways take care of it for us.

[1] https://www.youtube.com/watch?v=7KnfGDajDQw

Paul-Craft · on April 13, 2023

I'm curious what edge cases you know about or have bitten you with pickle. The main ones I know about are that it's not blazingly fast (though cPickle mitigates this a bit), and, as a possibly related issue, pickle actually uses multiple incompatible serialization protocols that are subject to change whenever a new Python version comes out.

It's also insecure in the sense that you can pickle and unpickle more or less arbitrary Python objects and object structures, so there's no way to trust a pickle from a source you don't control, unless you want to start down the rabbit hole of encrypting and signing your pickles. But that's actually never come up for me in practice.

The incompatibility issue has always been the bigger deal for me. You don't want to upgrade Python and then have to deal with converting all your model weights to a new serialization format, too. An unstable data interchange format is absolutely an oxymoron, and all but rules out the idea of using pickle in production for much of anything. The best use I can think of is for RPC between clients you control and trust, but I just can't bring myself to have that level of trust, even for a service or host that I'm sure is fully under my control.

biomcgary · on April 13, 2023

TBH, my knowledge of pickle edge cases is limited to my colleague's recommendation. ChatGPT seems to be able to construct a plausible list that matches several of your points, but who knows if the rest was hallucinated.

rlayton2 · on April 13, 2023

Great to hear. Scikit-learn also typically uses joblib for model persistence: https://scikit-learn.org/0.18/modules/model_persistence.html

Joblib is also used internally for models to be run in parallel, e.g. https://github.com/scikit-learn/scikit-learn/blob/1834cd6b76...

(End users typically just pass n_jobs as a non-zero/one value to use it).

ogrisel · on April 13, 2023

Note that joblib serialization is pickle based and therefore has the same security implications as for any pickle file: consider loading a joblib or pickle file as running a compiled executable: never do it if you do not trust the source.

A new safer alternative for scikit-learn model persistence is skops:

- https://skops.readthedocs.io/en/stable/persistence.html

It makes it possible to trust a list of types of Python objects that are safe to load and refuse to load skops files with untrusted types.

ogrisel · on April 13, 2023

Also note that nowadays, with Python 3.8+ and pickle protocol 5, it's now as efficient to do:

  import pickle

  with open("model.pkl", mode="wb") as f:
      pickle.dump(trained_model, f, protocol=pickle.HIGHEST_PROTOCOL)

  with open("model.pkl", mode="rb") as f:
      trained_model = pickle.load(f)

pickle from the standard library with protocol 5 can store and load large data buffers often found as attributes of scikit-learn models (typically large numpy arrays) without extra memory copies (as joblib.dump and joblib.load were designed to do with a few hacks that violate the official pickle protocol).

ogrisel · on April 13, 2023

For reference pickle protocol 5 was specified and implemented as part of:

- https://peps.python.org/pep-0574/

and also provides extra API to handle large data buffers externally ("out-of-band") via custom callbacks. This is in addition to the no-copy semantics memory optimization when loading/storing such arrays "in-band" without providing custom callbacks.

xk3 · on April 14, 2023

I've used joblib before and it does some nice things like make it easy to switch between different backends (loky, multiprocessing, threading) but recently I've started to use python's new-ish built-in way:

ThreadPoolExecutor if IO-bound

ProcessPoolExecutor if CPU-bound

for example

    from concurrent.futures import ThreadPoolExecutor

    with ThreadPoolExecutor(max_workers=4) as e:
        e.submit(shutil.copy, 'src1.txt', 'dest1.txt')
        e.submit(shutil.copy, 'src2.txt', 'dest2.txt')
    print("everything done!")

jononor · on April 13, 2023

My go-to for embarrassingly parallell problems that are still single-computer scale, using the multiprocessing backend. Like computing features across a bunch of inputs. Does the job in a simple and pain free manner. Have got some wrapper for progress indicators using tdwm.

PS: The synchronous backend switch is great for getting better backtraces, like in a debugger.

tpoacher · on April 14, 2023

Is this similar to spotify's Luigi? https://luigi.readthedocs.io/en/stable/

I've used Luigi in a project before; I wasn't too crazy with the class-based syntax at first, but once you get used to it, it's very expressive, and since it's effectively plain python, it means you can customise your pipelines in a very flexible manner.

Too · on April 13, 2023

Cool. Looks like a more lightweight alternative to Temporal workflows.

A tool in similar vein is pydoit. Here you need to be a bit more explicit about declaring the dag, in return it parallelizes well and can tell you what steps are available.

bartekpacia · on April 13, 2023

Looks similar to Bazel’s Starlark language[1].

[1]: https://bazel.build/rules/language

meta-level · on April 13, 2023

But while Starlark tries to mimic Python (while not allowing lots of useful stuff like f-strings) joblib _is_ Python and you can make use of the eco-system around Python.

gjvc · on April 14, 2023

small but excellent example of the advantage of using an embedded DSL over a custom one; it allows one to make use of all language features, and f-strings are oh so useful.