A Map-reduce Example in Ray

Introduction Last fall, we had several blog posts on Ray, a new distributed infrastructure, focused on machine learning and data engineering. In this blog post, I explore a map-reduce example to use Ray with large scale applications. Although Ray’s documentation already has a simple map-reduce example, I want to look at a more complex problem to better understand what Ray’s capabilities are. The map-reduce algorithm is one of the most… Read More »A Map-reduce Example in Ray

Diagram of the flow in the DataHut pipeline.

Data Hut’s Data Pipeline

Datahut.ai is a site we launched this spring. It combines manual curation and automated analytics to provide insight into data science and big data open source projects. We currently cover 128 different projects, from AirFlow to XGBoost. The code for DataHut started last year with a single Python script to download Git commits and a Jupyter notebook for data analysis and visualization. This allowed us to gain insight into our… Read More »Data Hut’s Data Pipeline

PyCon 2021 Notes, Part 2

In this blog post, I share my notes from the PyCon US 2021 talks I attended on Saturday May 15th. See also my blog post from last week, where I provided background on the conference and covered the first day of the conference talks, May 14th. Update: the talks are now available for free on YouTube here. Saturday Talks Packaging Python in 2021 Speaker: Jeremy Paige I maintain a few… Read More »PyCon 2021 Notes, Part 2

Talk on Data Hut Tonight

Tonight, May 19th, I will be giving a talk at SF Python about our open source directory, datahut.ai. If you are interested, you can sign up here. Talk Abstract My wife and I recently launched datahut.ai, a free website that tracks open source projects in the data science and data engineering space. In this talk, I will provide an overview of the site and then describe the data pipeline used… Read More »Talk on Data Hut Tonight

PyCon 2021 Notes, Part 1

I attended PyCon US 2021 on May 14-15. I liked the sessions that presented new tools or best practices, which will be covered in two blogs. This first blog post contains notes from the first day (May 14th), while the second blog post covers talks from the next day (May 15th).  I focused on talks about data science topics, with a few other talks that piqued my interest. For each… Read More »PyCon 2021 Notes, Part 1

Data Hut May 2021 Update

We are pleased to announce the May update to DataHut, our curated site with analysis on popular data science and data engineering projects. We have added two new categories: Time Series Analysis and Workflow Management. Time Series Analysis is a sub-category of Data Science. It includes projects which perform feature extraction, prediction, trend analysis, and machine learning on time series data. We are tracking four projects in this category. Workflow… Read More »Data Hut May 2021 Update

Announcing Data Hut™: A Free Resource for Data Science and Big Data

We are pleased today to announce datahut.ai, a free web-based directory that provides statistics and analysis on the most popular data science and data engineering projects, from Apache Age to XGBoost. By combining machine learning techniques with expert knowledge, we help you navigate the open source landscape and pick the best software for your needs.  Data Hut organizes over 100 projects by category and community, visually comparing size and trends… Read More »Announcing Data Hut™: A Free Resource for Data Science and Big Data

Ray: Beyond Serverless Computing

Series Outline Note: this is part of an ongoing series of posts about Ray. The full series is: Ray: A Distributed Computing Platform For Machine Learning Ray’s Ecosystem Ray: Core Architecture This post, where we show how Ray improves upon Serverless Computing Serverless Computing Most cloud providers have a Serverless Computing offering, also known as Function as a Service. These include AWS Lambda, Google Cloud Functions, and Azure Functions. Serverless… Read More »Ray: Beyond Serverless Computing

Ray: Core Architecture

Series Outline Note: this is part of an ongoing series of posts about Ray. The full series is: Ray: A Distributed Computing Platform For Machine Learning Ray’s Ecosystem This post, where we start getting into technical details   Ray and Its Programming ModelRay, an open source project which originated at UC Berkeley, provides a high level programming model and supporting infrastructure to support distributed machine learning applications. For the application… Read More »Ray: Core Architecture

Ray’s Ecosystem

As part of our blog series on Ray, this post analyzes the ecosystem that Ray has built around its platform. If you missed our first blog post on Ray, you might read it first.   Ray is a relatively young open source project, created in 2016 as part of the research project from UC Berkeley. Nevertheless, Ray has created an impressive ecosystem around its platform. The graph below shows the… Read More »Ray’s Ecosystem