In this blog post, I share my notes from the PyCon US 2021 talks I attended on Saturday May 15th. See also my blog post from last week, where I provided background on the conference and covered the first day of the conference talks, May 14th.
Update: the talks are now available for free on YouTube here.
Speaker: Jeremy Paige
I maintain a few Python packages and my usual approach to packaging is to copy and paste a setup.py file from a previous project and adjust as needed. Thus, I was interested to see what are the current best practices. Jeremy does a good job of covering the highlights. His slides are here. Apparently, the packaging world is moving away from the executable setup.py. The recommended approach is to have a pyproject.toml file that contains a pointer to the build tool to be used and the build tool’s dependencies. Then, you replace setup.py with a setup.cfg file that contains the usual project metadata. Jeremy describes how to migrate an existing project to the new approach.
The packaging situation is a bit confusing — there are three roles for tools: build backend, build frontend, integration frontend. I did not fully understand the differences between these roles, but maybe it only matters if you are developing a build/packaging tool. Also, it is odd that they chose TOML as the pyproject file format: there is no TOML parser available in the standard library! Supposedly, they will create a PEP to add one, but I believe it is too late for Python 3.10, so it cannot not be included in Python until October 2022.
Anyway, I personally do not want to put too many mental cycles into packaging. 🤷 I think I got enough from Jeremy’s talk that I can at least modernize my package configuration.
Dask-SQL: Empowering Pythonistas for Scalable End-to-end Data Engineering and Data Science
Speaker: Adam Breindel
Dask is a distributed analytics environment written in Python. It provides deep integrations with NumPy, Pandas, and Scikit-Learn. Although I have not used it, it seems like a lighter weight and more Python-focused alternative to Apache Spark. Until recently, the big advantage for Spark was its SQL interface. Now, Dask also supports SQL with Dask-SQL.
In this talk, Adam introduced the capabilities of Dask-SQL and provided several example workflows. I was impressed with its ability to consume ad-hoc data files and Hive tables as well as its integration with Pandas DataFrames (you can return a query as a DataFrame). In describing the implementation, I was a little disappointed to learn from Adam that Dask-SQL does not completely escape from the JVM – it uses the Apache Calcite library under the covers for SQL parsing and conversion to a relational algebra. That probably is a good idea, as Calcite is a widely used query parser in the Big Data space. They do hide the JVM install issues from you via the conda package manager, which is better than pip and friends for multi-language packages. Anyway, if you have some Big Data processing to do, and you don’t want to spin up a Spark cluster, check it out!
Static Sites with Sphinx and Markdown
Speaker: Paul Everitt
I was excited to see this talk because, 1) my open source site datahut.ai is a static site created via Sphinx, and 2) Paul is a great speaker and co-founder of the first Python web companies, Digital Creations (later Zope Corporation) in 1995.
The focus of this talk was on using Markdown (.md) as the Sphinx file format instead of Restructured Text (.rst). The tool that lets you do this is MyST. Paul demonstrated how to set this up and how to use it in building your content. From his examples, it is clear that all the specialized Sphinx “directives” are available when using Markdown. Sphinx is clearly a powerful tool for building static sites and hopefully the Markdown functionality will attract more users and theme developers.
More Fun with Hardware and CircuitPython – IoT, Wearables, and More!
Speaker: Nina Zakharenko
I watched this talk with my daughter, who is just finishing up the second grade and is a total electronics geek. The talk was a high level overview of the kinds of “maker” projects you can build. CircuitPython is a low-level Python 3 implementation designed for microcontrollers, originally forked from MicroPython by Adafruit. Adafruit provides built-in support for most of their microcontroller boards and sensors. Even if you are already familiar with CircuitPython (as my daughter is), the talk was full of resources and ideas. Links to everything may be found in the slides, which are here. My daughter’s favorite resource from the talk was ICircuit 3D, a $15 iPhone/iPad app that simulates electronic circuits. It looks great for trying out ideas before building them.
Statistical Typing: A Runtime Typing System for Data Science and Machine Learning
Speaker: Niels Bantilan
In this talk, Niels describes Pandera, a data validation library that he has written and made available on GitHub. His slides are here.The basic idea is to extend primitive data types with deterministic properties over the domain of possible values (e.g. x>0) and probabilistic properties held by a collection of data points (e.g. the mean and standard deviation). This is then used to validate raw and processed data every time a data pipeline is run. This approach can catch some simple bugs (e.g. values must be greater than zero) and some very subtle bugs (e.g. the data is drawn from the wrong distribution). Furthermore, Niels demonstrates how Pandara can be used to generate data for property-based testing via the Hypothesis library (I also discussed Hypothesis in my blog post on the Friday talks). Pandera can also infer validation rules from example data sets to bootstrap your rules. Finally, Niels talked about his future plans to use a Generative Adversarial Network (GAN) as a schema to take these ideas to the next level, validating real-world data and generating synthetic data with complex patterns. Overall, this looks very promising.
The online PyCon conference misses the social aspect of an in-person conference, but it was well worth the time and nominal fee to watch. PyCon has always been a high information content conference, without a lot of talks from marketing people pushing their commercial projects (a little is OK, but I did not invest my personal time and money to listen to all-marketing talks).
I hope these summaries are useful in selecting talks and following up on online resources. I’m looking forward to PyCon US 2022, which will be held in-person at Salt Lake City, Utah. Hope to see you there!