I attended PyCon US 2021 on May 14-15. I liked the sessions that presented new tools or best practices, which will be covered in two blogs. This first blog post contains notes from the first day (May 14th), while the second blog post covers talks from the next day (May 15th).
I focused on talks about data science topics, with a few other talks that piqued my interest. For each talk, I’ve included short notes about the talk and links to associated resources. Hopefully, you will find these notes useful!
Update: the talks are now available for free on YouTube here.
Background
PyCon US is the annual national conference for Python in the United States. It includes workshops, coding sprints, talks, an expo, and more. Pre-COVID, PyCon was an in-person conference. For the second year, PyCon was held online. The registration fee was much lower than an in-person conference at $50 to $150. The proceeds benefit the Python Software Foundation, which funds Python core development, outreach, and other activities. If you have not already registered, I assume you can still sign up. Recordings of all the conference talks are available to registered attendees on the conference website.
Friday Talks
Analyze, Govern, and Approve Model Training Experiments with Rubicon
Speakers: Mike, Joe, Ryan Soley, and Sri
Rubicon is an ML experiment tracking system developed by a Data Science team at Capital One. There are many experiment tracking systems these days — see, for example, the list of similar projects we are tracking at DataHut.ai. Rubicon seems fairly straightforward and non-invasive. It includes both experiment tracking and a nice visualization dashboard of experimental results. From this talk, I also learned about FSSPEC, an effort to create a common filesystem API with many backends. Rubicon leverages it to provide both local and Amazon S3 storage of experiments.
Testing Stochastic AI Models with Hypothesis
Speaker: Marina Shvartz
This talk was about Hypothesis, a property-based testing library. Property-based testing, popularized by Haskell’s QuickCheck library, involves defining pre-condition properties on a function’s input and post-condition properties on the function’s result. The library then automatically generates input data based on the pre-condition properties and validates that the results obey the post conditions. To address the random nature of the generated test data, Hypothesis maintains a database of failed test examples. These exact examples can rerun in the future to validate that a given problem was fixed.
Although Hypothesis is not limited to ML applications, Marina gave some example techniques for using it with ML models. In particular, Metamorphic Testing techniques make sense for testing non-deterministic ML applications. In Metamorphic Testing, a program is run twice, with an input and a transformation of the input. The output of the transformed run must obey some relation with respect to the output of the non-transformed run. For example, a test for a search engine might add an additional keyword as the transformed input. We would expect the transformed output to obey the subset relation with respect to the original input. Example transformations with ML training include removing or transforming of labels and classes.
Narrative-focused Video Games Development with Ren’Py, an Open source Engine
Speaker: Susan Shu Chang
Ren’Py is a modern cross-platform game engine for building interactive fiction games. The capabilities of Ren’Py are far, far beyond the old text-based Zork games, and heavily feature graphics and sound, while still being story-driven. Susan, the speaker, has created a commercial game with this engine which has sold over six million copies.
I do not have much background in game engines, but thought this was really cool! I might try it out with my daughter over the summer, who is interested in writing computer games.
Reproducible and Maintainable Data Science Code with Kedro
Speaker: Yetunde Dada
Kedro is another open source ML experiment tracking system (see Rubicon above). Kedro is from QuantumBlack, a data science consulting company that was acquired by McKinsey. Some unique features of Kedro include an opinionated project layout template (e.g. specific directories for configuration, data, code, notebooks, and documentation) and the concept of a data catalog. The data catalog is a configuration file that has references to all the data sets used by a Kedro project. Like Rubicon, Kedro leverages FSSPEC to support many file backends. Kedro also provides a simple API for creating data pipelines and a pipeline visualizer.
Zero to Production-Ready: A Best-Practices Process for Docker Packaging
Speaker: Itamar Turner-Trauring
In this talk, Itamar provides an overview of his recommended process for Docker-izing an application. The steps are:
- Get something working
- Security
- Running in Continuous Integration
- Operational correctness and debuggability
- Reproducible builds
- Faster builds and smaller images
The talk is aimed at intermediate Docker users who already understand the basic ideas behind Dockerfiles, images, containers, etc. There was a lot of good advice in the talk, and the above process is a good way to get to the next level. The author has more content on his website.
I learned something new in the talk — to improve the startup time for Python-based containers, you can precompile (to .pyc) your Python code and library dependencies. This is important for short lived applications of Docker containers, like AWS Lambda.
Coming up
In part 2, I cover the talks I attended on May 15th. The topics include testing and best practices in data science, using Sphinx for static websites, and CircuitPython.