Plurrrr

a tumblelog
week 14, 2021

Bloom filters

A Bloom filter is a probabilistic data structure present in many common applications. Its purpose is answering the question: "is this item in the set?" very fast and not using a lot of space. The answers can be NO, or MAYBE YES.

Source: Bloom filters an article by Ricardo Ander-Egg Aguilar.

Using PyTorch + NumPy? You're making a mistake.

Bugs in ML code are notoriously hard to fix - they don’t cause compile errors but silently regress accuracy. Once you have endured the pain and fixed one of these, the lesson is forever etched into your brain, right? Wrong. Recently, an old foe made a comeback - a familiar bug bit me again! As before, the performance improved significantly after fixing it.

The bug was subtle and easy to make. How many others has it done damage to? Curious, I downloaded over a hundred thousand repositories from GitHub that import PyTorch, and analysed their source code. I kept projects that define a custom dataset, use NumPy’s random number generator with multi-process data loading, and are more-or-less straightforward to analyse using abstract syntax trees. Out of these, over 95% of the repositories are plagued by this problem. It’s inside PyTorch’s official tutorial, OpenAI’s code, and NVIDIA’s projects. Even Karpathy admitted falling prey to it.

Source: Using PyTorch + NumPy? You're making a mistake, an article by Tanel Pärnamaa.

Rust async is colored, and that’s not a big deal

In this article I will show that Rust async functions are colored, by both the original definition and in practice. This is not meant as an criticism of Rust async, though – I don’t see function colors as an insurmountable issue, but as a reflection of the fundamental difference of async and sync models of the world. Languages that hide that difference do so by introducing compromises that might not be acceptable in a systems language like Rust or C++ – for example, by entirely forbidding the use of system threads, or by complicating the invocation of foreign or OS-level blocking calls. Colored functions are also present in at least C#, Python, Kotlin, and C++, so they’re not a quirk of JavaScript and Rust. And additional features of Rust async do make it easier to connect async code with traditional blocking code, something that is just not possible in JavaScript.

Source: Rust async is colored, and that’s not a big deal.

Best practices for writing SQL queries

This article covers some best practices for writing SQL queries for data analysts and data scientists. Most of our discussion will concern SQL in general, but we’ll include some notes on features specific to Metabase that make writing SQL a breeze.

Source: Best practices for writing SQL queries.

"this" Is Weird: Understanding JavaScript "this" Keyword

The this keyword is a fundamental concept in JavaScript. However its behavior may appear very strange, especially if you are familiar with similar constructs in other languages, for example this in Java, or self in Python.

Source: "this" Is Weird: Understanding JavaScript "this" Keyword, an article by Linton Ye.

Nix is the ultimate DevOps toolkit

At Channable we use Nix to build and deploy our services and to manage our development environments. This was not always the case: in the past we used a combination of ecosystem-specific tools and custom scripts to glue them together. Consolidating everything with Nix has helped us standardize development and deployment workflows, eliminate “works on my machine”-problems, and avoid unnecessary rebuilds. In this post we want to share what problems we encountered before adopting Nix, how Nix solves those, and how we gradually introduced Nix into our workflows.

Source: Nix is the ultimate DevOps toolkit, an article by Ruud van Asseldonk, Reinier Maas, Falco Peijnenburg, Fabian Thorand, and Robert Kreuzer.

Why Do Long Options Start with Two Dashes?

Around 1990, Richard Stallman (RMS) and I were writing the GNU C library getopt() and he wanted to extend it to support long (multi-character) option names for user-friendliness.

Source: Why Do Long Options Start with Two Dashes?

3 reasons I use the Git cherry-pick command

Here is an example to help you understand the importance of cherry-picking. Suppose you have made several commits in a branch, but you realize it's the wrong branch! What do you do now? Either you repeat all your changes in the correct branch and make a fresh commit, or you merge the branch into the correct branch. Wait, the former is too tedious, and you may not want to do the latter. So, is there a way? Yes, Git's got you covered. Here is where cherry-picking comes into play. As the term suggests, you can use it to hand-pick a commit from one branch and transfer it into another branch.

Source: 3 reasons I use the Git cherry-pick command, an article by Manaswini Das.

What is a Diagonally Dominant Matrix?

Matrices arising in applications often have diagonal elements that are large relative to the off-diagonal elements. In the context of a linear system this corresponds to relatively weak interactions between the different unknowns. We might expect a matrix with a large diagonal to be assured of certain properties, such as nonsingularity. However, to ensure nonsingularity it is not enough for each diagonal element to be the largest in its row.

Source: What is a Diagonally Dominant Matrix?, an article by Nick Higham.

What is Werkzeug?

This article explains what Werkzeug is and how Flask uses it for its core HTTP functionality. Along the way, you'll develop your own WSGI-compatible application using Werkzeug to create a Flask-like web framework!

Source: What is Werkzeug?, an article by Patrick Kennedy.

Steps to release curl

I have a lot of different hats and roles in the curl project. One of them is “release manager” and in this post I’ve tried to write down pretty much all the steps I do to prepare and ship a curl release at the end of every release cycle in the project.

Source: Steps to release curl, an article by Daniel Stenberg.

What Is Categorical Data?

Data comes in two flavors: Numeric and Categorical. Numeric data is easy, it’s numbers. Categorical data is everything else.

As the name suggests, categorical data is information that comes in categories—which means each instance of it is distinct from the others. Names are an example of categorical data, and my name is distinct from your name. On the unlikely chance that your name is the same as mine, I’m sure our government-issued ID numbers, phone numbers, and email addresses are distinct—which are also categorical data.

Source: What Is Categorical Data?, an article by Ryan Wright.

A practical guide on calling Go from Python using ctypes

When working in Python, it sometimes makes sense to implement parts of the program in a static, high-performance language. Go can be a great choice for that because it is fast, simple and cross platform.

Source: A practical guide on calling Go from Python using ctypes, an article by Amit Lavon.

How to make an awesome Python package in 2021

If you are like me, every once in a while you write a useful python utility and want to share it with your colleagues. The best way to do this is to make a package: it easy to install and saves from copy-pasting.

If you are like me, you might be thinking that creating packages is a real headache. Well, that’s not the case anymore. And I am going to prove it with this step-by-step guide. Just three main steps (and a bunch of optional ones) accompanied by few GitHub links.

Source: How to make an awesome Python package in 2021, an article by Anton Zhiyanov.

A look under the hood: how branches work in Git

Git has won the race for the most popular version control system. But why exactly is it so popular? The answer, at least in my opinion, is pretty clear: branches! They allow you to keep different versions of your code cleanly separated—ideal when collaborating in a team with other people, but also for yourself when you begin working on a new feature.

Although other version control systems also offer some form of branching, Git’s concept and implementation are just stunning. It has made working with branches so quick and easy that many developers have adopted the concept for their daily work.

Source: A look under the hood: how branches work in Git, an article by Tobias Günther.

Introducing Swift Collections

I’m thrilled to announce Swift Collections, a new open-source package focused on extending the set of available Swift data structures. Like the Swift Algorithms and Swift Numerics packages before it, we’re releasing Swift Collections to help incubate new functionality for the Swift Standard Library.

The Swift Standard Library currently implements the three most essential general-purpose data structures: Array, Set and Dictionary. These are the right tool for a wide variety of use cases, and they are particularly well-suited for use as currency types. But sometimes, in order to efficiently solve a problem or to maintain an invariant, Swift programmers would benefit from a larger library of data structures.

We expect the Collections package to empower you to write faster and more reliable programs, with less effort.

Source: Introducing Swift Collections, an article by Karoy Lorentey

Understanding Bloom Filters

A Bloom filter is a probabilistic data structure. It tells you if an element is in a set or not in a very fast and memory-efficient way. A Bloom filter can tell if an element is not in the set (“being 100% sure”) or that it may be in the set, but not “being 100% sure”. It only has 2 operations: add, to add an element, and query, to check if an element exists in the set or not.

Source: Understanding Bloom Filters, an article by Ricardo Ander-Egg Aguilar.

Deep learning model compression

Each year, larger and larger models are able to find methods for extracting signal from the noise in machine learning. In particular, language models get larger every day. These models are computationally expensive (in both runtime and memory), which can be both costly when served out to customers or too slow or large to function in edge environments like a phone.

Researchers and practitioners have come up with many methods for optimizing neural networks to run faster or with less memory usage. In this post I’m going to cover some of the state-of-the-art methods. If you know of another method you think should be included, I’m happy to add it. This has a slight PyTorch bias (haha) because I’m most familiar with it.

Source: Deep learning model compression, an article by Rachit Singh.

Loading SQL data into Pandas without running out of memory

You have some data in a relational database, and you want to process it with Pandas. So you use Pandas’ handy read_sql() API to get a DataFrame—and promptly run out of memory.

The problem: you’re loading all the data into memory at once. If you have enough rows in the SQL query’s results, it simply won’t fit in RAM.

Pandas does have a batching option for read_sql(), which can reduce memory usage, but it’s still not perfect: it also loads all the data into memory at once!

So how do you process larger-than-memory queries with Pandas? Let’s find out.

Source: Loading SQL data into Pandas without running out of memory, an article by Itamar Turner-Trauring.

Pterinochilus murinus outside its burrow

In the evening I managed to take a few photos of the Pterinochilus murinus I keep. Because the specimen is very skittish the only way I could take those photos was by taking them through the plastic container and carefully providing some lighting from above with a flash light.

Pterinochilus murinus outside its burrow
Pterinochilus murinus outside its burrow.