Data Science

Diagram of the flow in the DataHut pipeline.

Data Hut’s Data Pipeline

Datahut.ai is a site we launched this spring. It combines manual curation and automated analytics to provide insight into data science and big data open source projects. We currently cover 128 different projects, from AirFlow to XGBoost. The code for DataHut started last year with a single Python script to download Git commits and a Jupyter notebook for data analysis and visualization. This allowed us to gain insight into our… Read More »Data Hut’s Data Pipeline

PyCon 2021 Notes, Part 2

In this blog post, I share my notes from the PyCon US 2021 talks I attended on Saturday May 15th. See also my blog post from last week, where I provided background on the conference and covered the first day of the conference talks, May 14th. Update: the talks are now available for free on YouTube here. Saturday Talks Packaging Python in 2021 Speaker: Jeremy Paige I maintain a few… Read More »PyCon 2021 Notes, Part 2

Talk on Data Hut Tonight

Tonight, May 19th, I will be giving a talk at SF Python about our open source directory, datahut.ai. If you are interested, you can sign up here. Talk Abstract My wife and I recently launched datahut.ai, a free website that tracks open source projects in the data science and data engineering space. In this talk, I will provide an overview of the site and then describe the data pipeline used… Read More »Talk on Data Hut Tonight

PyCon 2021 Notes, Part 1

I attended PyCon US 2021 on May 14-15. I liked the sessions that presented new tools or best practices, which will be covered in two blogs. This first blog post contains notes from the first day (May 14th), while the second blog post covers talks from the next day (May 15th).  I focused on talks about data science topics, with a few other talks that piqued my interest. For each… Read More »PyCon 2021 Notes, Part 1

Data Hut May 2021 Update

We are pleased to announce the May update to DataHut, our curated site with analysis on popular data science and data engineering projects. We have added two new categories: Time Series Analysis and Workflow Management. Time Series Analysis is a sub-category of Data Science. It includes projects which perform feature extraction, prediction, trend analysis, and machine learning on time series data. We are tracking four projects in this category. Workflow… Read More »Data Hut May 2021 Update

Announcing Data Hut™: A Free Resource for Data Science and Big Data

We are pleased today to announce datahut.ai, a free web-based directory that provides statistics and analysis on the most popular data science and data engineering projects, from Apache Age to XGBoost. By combining machine learning techniques with expert knowledge, we help you navigate the open source landscape and pick the best software for your needs.  Data Hut organizes over 100 projects by category and community, visually comparing size and trends… Read More »Announcing Data Hut™: A Free Resource for Data Science and Big Data

Ray’s Ecosystem

As part of our blog series on Ray, this post analyzes the ecosystem that Ray has built around its platform. If you missed our first blog post on Ray, you might read it first.   Ray is a relatively young open source project, created in 2016 as part of the research project from UC Berkeley. Nevertheless, Ray has created an impressive ecosystem around its platform. The graph below shows the… Read More »Ray’s Ecosystem

Ray: A Distributed Computing Platform For Machine Learning

Ray is an open source project originating from the UC Berkeley RISELab in 2016. The creators of Ray launched a commercial company, Anyscale, in 2019. The Ray project has been a superstar from its inception: it received two NSF grants and sponsorships from Alibaba, Amazon Web Services, Ant Financial, ARM, CapitalOne, Ericsson, Facebook, Google, Huawei, Intel, Microsoft, Sco-tiabank, Splunk, and VMware. Without any surprise, Anyscale successfully raised $60M from two… Read More »Ray: A Distributed Computing Platform For Machine Learning