A Bloom filter is a probabilistic data structure present in many
common applications. Its purpose is answering the question: "is this
item in the set?" very fast and not using a lot of space. The
answers can be NO, or MAYBE YES.
Source: Bloom filters an article
by Ricardo Ander-Egg Aguilar.
Bugs in ML code are notoriously hard to fix - they don’t cause
compile errors but silently regress accuracy. Once you have endured
the pain and fixed one of these, the lesson is forever etched into
your brain, right?
Wrong. Recently, an
old foe made a comeback - a familiar bug bit me again! As before,
the performance improved significantly after fixing it.
The bug was subtle and easy to make. How many others has it done
damage to? Curious, I downloaded over a hundred thousand
repositories from GitHub that import PyTorch, and analysed their
source code. I kept projects that define a custom dataset, use
NumPy’s random number generator with multi-process data loading, and
are more-or-less straightforward to analyse using abstract syntax
trees. Out of these, over 95% of the repositories are plagued by
this problem. It’s inside PyTorch’s official
tutorial,
OpenAI’s
code,
and NVIDIA’s
projects. Even
Karpathy
admitted
falling prey to it.
In this article I will show that Rust async functions are colored,
by both the original definition and in practice. This is not meant
as an criticism of Rust async, though – I don’t see function colors
as an insurmountable issue, but as a reflection of the fundamental
difference of async and sync models of the world. Languages that
hide that difference do so by introducing compromises that might not
be acceptable in a systems language like Rust or C++ – for example,
by entirely forbidding the use of system threads, or by complicating
the invocation of foreign or OS-level blocking calls. Colored
functions are also present in at least C#, Python, Kotlin, and C++,
so they’re not a quirk of JavaScript and Rust. And additional
features of Rust async do make it easier to connect async code with
traditional blocking code, something that is just not possible in
JavaScript.
This article covers some best practices for writing SQL queries for
data analysts and data scientists. Most of our discussion will
concern SQL in general, but we’ll include some notes on features
specific to Metabase that make writing SQL a breeze.
The this keyword is a fundamental concept in JavaScript. However
its behavior may appear very strange, especially if you are familiar
with similar constructs in other languages, for example this in
Java, or self in Python.
At Channable we use Nix to build and deploy our services and to
manage our development environments. This was not always the case:
in the past we used a combination of ecosystem-specific tools and
custom scripts to glue them together. Consolidating everything with
Nix has helped us standardize development and deployment workflows,
eliminate “works on my machine”-problems, and avoid unnecessary
rebuilds. In this post we want to share what problems we encountered
before adopting Nix, how Nix solves those, and how we gradually
introduced Nix into our workflows.
Source: Nix is the ultimate DevOps
toolkit,
an article by Ruud van Asseldonk, Reinier Maas, Falco Peijnenburg,
Fabian Thorand, and Robert Kreuzer.
Around 1990, Richard Stallman (RMS) and I were writing the GNU C
library getopt() and he wanted to extend it to support long
(multi-character) option names for user-friendliness.
Here is an example to help you understand the importance of
cherry-picking. Suppose you have made several commits in a branch,
but you realize it's the wrong branch! What do you do now? Either
you repeat all your changes in the correct branch and make a fresh
commit, or you merge the branch into the correct branch. Wait, the
former is too tedious, and you may not want to do the latter. So, is
there a way? Yes, Git's got you covered. Here is where
cherry-picking comes into play. As the term suggests, you can use it
to hand-pick a commit from one branch and transfer it into another
branch.
Matrices arising in applications often have diagonal elements that
are large relative to the off-diagonal elements. In the context of a
linear system this corresponds to relatively weak interactions
between the different unknowns. We might expect a matrix with a
large diagonal to be assured of certain properties, such as
nonsingularity. However, to ensure nonsingularity it is not enough
for each diagonal element to be the largest in its row.
This article explains what
Werkzeug is and how
Flask uses it for its core
HTTP functionality. Along the way, you'll develop your own
WSGI-compatible application using Werkzeug to create a Flask-like
web framework!
I have a lot of different hats and roles in the curl project. One of
them is “release manager” and in this post I’ve tried to write down
pretty much all the steps I do to prepare and ship a curl release at
the end of every release cycle in the project.
Data comes in two flavors: Numeric and Categorical. Numeric data is
easy, it’s numbers. Categorical data is everything else.
As the name suggests, categorical data is information that comes in
categories—which means each instance of it is distinct from the
others. Names are an example of categorical data, and my name is
distinct from your name. On the unlikely chance that your name is
the same as mine, I’m sure our government-issued ID numbers, phone
numbers, and email addresses are distinct—which are also categorical
data.
When working in Python, it sometimes makes sense to implement parts
of the program in a static, high-performance language. Go can be a
great choice for that because it is fast, simple and cross platform.
If you are like me, every once in a while you write a useful python
utility and want to share it with your colleagues. The best way to
do this is to make a package: it easy to install and saves from
copy-pasting.
If you are like me, you might be thinking that creating packages is
a real headache. Well, that’s not the case anymore. And I am going
to prove it with this step-by-step guide. Just three main steps (and
a bunch of optional ones) accompanied by few GitHub links.
Git has won the race for the most popular version control
system. But why exactly is it so popular? The answer, at least in my
opinion, is pretty clear: branches! They allow you to keep
different versions of your code cleanly separated—ideal when
collaborating in a team with other people, but also for yourself
when you begin working on a new feature.
Although other version control systems also offer some form of
branching, Git’s concept and implementation are just stunning. It
has made working with branches so quick and easy that many
developers have adopted the concept for their daily work.
I’m thrilled to announce Swift
Collections, a new
open-source package focused on extending the set of available Swift
data structures. Like the Swift
Algorithms and Swift
Numerics packages before
it, we’re releasing Swift Collections to help incubate new
functionality for the Swift Standard Library.
The Swift Standard Library currently implements the three most
essential general-purpose data structures: Array, Set and
Dictionary. These are the right tool for a wide variety of use
cases, and they are particularly well-suited for use as currency
types. But sometimes, in order to efficiently solve a problem or to
maintain an invariant, Swift programmers would benefit from a larger
library of data structures.
We expect the Collections package to empower you to write faster
and more reliable programs, with less effort.
A Bloom filter is a probabilistic data structure. It tells you if an
element is in a set or not in a very fast and memory-efficient
way. A Bloom filter can tell if an element is not in the set
(“being 100% sure”) or that it may be in the set, but not “being
100% sure”. It only has 2 operations: add, to add an element, and
query, to check if an element exists in the set or not.
Each year, larger and larger models are able to find methods for
extracting signal from the noise in machine learning. In particular,
language models get larger every day. These models are
computationally expensive (in both runtime and memory), which can be
both costly when served out to customers or too slow or large to
function in edge environments like a phone.
Researchers and practitioners have come up with many methods for
optimizing neural networks to run faster or with less memory
usage. In this post I’m going to cover some of the state-of-the-art
methods. If you know of another method you think should be included,
I’m happy to add it. This has a slight PyTorch bias (haha) because
I’m most familiar with it.
You have some data in a relational database, and you want to process
it with Pandas. So you use Pandas’ handy read_sql() API to get a
DataFrame—and promptly run out of memory.
The problem: you’re loading all the data into memory at once. If you
have enough rows in the SQL query’s results, it simply won’t fit in
RAM.
Pandas does have a batching option for read_sql(), which can
reduce memory usage, but it’s still not perfect: it also loads all
the data into memory at once!
So how do you process larger-than-memory queries with Pandas? Let’s
find out.
In the evening I managed to take a few photos of the Pterinochilus
murinus I keep. Because the specimen is very skittish the only way I
could take those photos was by taking them through the plastic
container and carefully providing some lighting from above with a
flash light.