Jeff Fischer

A Map-reduce Example in Ray

Introduction Last fall, we had several blog posts on Ray, a new distributed infrastructure, focused on machine learning and data engineering. In this blog post, I explore a map-reduce example to use Ray with large scale applications. Although Ray’s documentation already has a simple map-reduce example, I want to look at a more complex problem to better understand what Ray’s capabilities are. The map-reduce algorithm is one of the most… Read More »A Map-reduce Example in Ray

Diagram of the flow in the DataHut pipeline.

Data Hut’s Data Pipeline

Datahut.ai is a site we launched this spring. It combines manual curation and automated analytics to provide insight into data science and big data open source projects. We currently cover 128 different projects, from AirFlow to XGBoost. The code for DataHut started last year with a single Python script to download Git commits and a Jupyter notebook for data analysis and visualization. This allowed us to gain insight into our… Read More »Data Hut’s Data Pipeline

PyCon 2021 Notes, Part 2

In this blog post, I share my notes from the PyCon US 2021 talks I attended on Saturday May 15th. See also my blog post from last week, where I provided background on the conference and covered the first day of the conference talks, May 14th. Update: the talks are now available for free on YouTube here. Saturday Talks Packaging Python in 2021 Speaker: Jeremy Paige I maintain a few… Read More »PyCon 2021 Notes, Part 2

Talk on Data Hut Tonight

Tonight, May 19th, I will be giving a talk at SF Python about our open source directory, datahut.ai. If you are interested, you can sign up here. Talk Abstract My wife and I recently launched datahut.ai, a free website that tracks open source projects in the data science and data engineering space. In this talk, I will provide an overview of the site and then describe the data pipeline used… Read More »Talk on Data Hut Tonight

PyCon 2021 Notes, Part 1

I attended PyCon US 2021 on May 14-15. I liked the sessions that presented new tools or best practices, which will be covered in two blogs. This first blog post contains notes from the first day (May 14th), while the second blog post covers talks from the next day (May 15th).  I focused on talks about data science topics, with a few other talks that piqued my interest. For each… Read More »PyCon 2021 Notes, Part 1

Announcing Data Hut™: A Free Resource for Data Science and Big Data

We are pleased today to announce datahut.ai, a free web-based directory that provides statistics and analysis on the most popular data science and data engineering projects, from Apache Age to XGBoost. By combining machine learning techniques with expert knowledge, we help you navigate the open source landscape and pick the best software for your needs.  Data Hut organizes over 100 projects by category and community, visually comparing size and trends… Read More »Announcing Data Hut™: A Free Resource for Data Science and Big Data

Ray: Beyond Serverless Computing

Series Outline Note: this is part of an ongoing series of posts about Ray. The full series is: Ray: A Distributed Computing Platform For Machine Learning Ray’s Ecosystem Ray: Core Architecture This post, where we show how Ray improves upon Serverless Computing Serverless Computing Most cloud providers have a Serverless Computing offering, also known as Function as a Service. These include AWS Lambda, Google Cloud Functions, and Azure Functions. Serverless… Read More »Ray: Beyond Serverless Computing

Ray: Core Architecture

Series Outline Note: this is part of an ongoing series of posts about Ray. The full series is: Ray: A Distributed Computing Platform For Machine Learning Ray’s Ecosystem This post, where we start getting into technical details   Ray and Its Programming ModelRay, an open source project which originated at UC Berkeley, provides a high level programming model and supporting infrastructure to support distributed machine learning applications. For the application… Read More »Ray: Core Architecture

How To Select Open Source Software

We are often asked by our clients to help select the best open source software for a given use case. For example, we might be asked to select a data warehouse for analytics queries, a time series database for marketing events, or an ML library for a recommendation engine. Through a combination of research, data analysis, and prototyping, we work with our clients to make the decision for their business.… Read More »How To Select Open Source Software

The Open Source Software Life Cycle

Open source projects come and go over time. A project may rise from obscurity to become the hottest thing, and then, in a few years, go back to obscurity. Before betting on an open source project for your next solution, you want to make sure that project will be around for a while. With most open source projects using Git as their source control system, it is easy to grab… Read More »The Open Source Software Life Cycle