week 32, 2022

Java Heap Dump Analysis with Examples

I am a big fan of Java Memory Management and in this article, I will try to explain how to take and analyze heap dump with examples, but let’s refresh our minds and remember what we know about this domain. After some theoretical information, we will take a heap dump and will analyze it for a simple application.

Source: Java Heap Dump Analysis with Examples, an article by Huseyin Babal.

Loading Dangerously: PyYAML and Safety by Design

The Python standard library json.load does not have “side effects” besides reading a stream of text input. Because I assumed YAML was equivalent to JSON and had not read the 23,000+ word spec, I assumed that PyYAML’s yaml.load had the same properties. Last June, I learned that this was incorrect.

In tip #7 of 10 Common Security Gotchas in Python, I learned that using yaml.load could run arbitrary code. While the danger of this possibility is limited only by your imagination, the article provided the very plausible example of having your passwords emailed to a hacker.

Source: Loading Dangerously: PyYAML and Safety by Design, an article by Cameron Yick.

Código Emperador (2022)

Follows Juan, an agent working for the intelligence services, who also reports to a parallel unit involved in illegal activities.

In the evening Esme and I watched Código Emperador. I liked the movie and give it a 7 out of 10.

Rob Pike’s simple C regex matcher in Go

Back in 1998, Rob Pike – of Go and Plan 9 fame – wrote a simple regular expression matcher in C for The Practice of Programming, a book he wrote with fellow Unix hacker Brian Kernighan. If you haven’t read Kernighan’s “exegesis” of this code, it’s definitely worth the 30-minute time investment it takes to go through that slowly.

With Go’s C heritage (and Pike’s influence on the Go language), I thought I’d see how well the C code would translate to Go, and whether it was still elegant.

Source: Rob Pike's simple C regex matcher in Go, an article by Ben Hoyt.

Don't Pickle Your Data

Pretty much every Python programmer out there has broken down at one point and and used the ‘pickle’ module for writing objects out to disk.

The advantage of using pickle is that it can serialize pretty much any Python object, without having to add any extra code. Its also smart in that in will only write out any single object once, making it effective to store recursive structures like graphs. For these reasons pickle is usually the default serialization mechanism in Python, used in modules likes python-memcached.

However, using pickle is still a terrible idea that should be avoided whenever possible.

Source: Don't Pickle Your Data, an article by Ben Frederickson.

Test against what won't change

When we write tests, we've inevitably got to choose an interface to write against - for a unit test, this is the interface of whatever unit is under test, usually the type signature of a function or the public methods on a class. These unit interfaces tend to change often in response to refactoring, optimisations, new requirements and so on, and they should be able to change quickly too, or we make any of these important improvements.

Instead what we want to do is write our tests against interfaces that seldom change, and thus we should target public, external interfaces, which are (generally) better designed and much slower to change than the non-exposed interfaces on units. For example, in our scenario above, we could treat our system as a black box and use its HTTP API as the interface that we use to test it (e.g. using supertest), use a mocked HTTP API for the web service it calls (e.g. using nock), and run it against a real database that we reset after each test.

Source: Test against what won't change, an article by Alex Gilleran.

How to Choose the Right Python Concurrency API

Python standard library offers 3 concurrency APIs.

How do you know which API to use in your project?

In this tutorial you will discover helpful step-by-step procedure and helpful questions to guide you to the most appropriate concurrency API.

After reading this guide, you will also know how to choose the right Python concurrency API for current and future projects.

Source: Choose the Right Python Concurrency API, an article by Jason Brownlee.

Flexible JSON transformations in Rust

The JSON format remains one of the most popular text data formats for Data-in-Transition. You can encounter JSON data on every stack level of your application: from the database to UI, from IoT sensors data to the mobile app’s payload. And it is not a coincidence; the format has a good balance between being convenient for developers and decent payload density. In Rust ecosystem, the de-facto standard for dealing with JSON is Serde. Although it is the best choice for most cases, there can be alternative approaches that can work best for your application. One of these approaches we are going to cover in this article.

Source: Flexible JSON transformations in Rust, an article by Alexander Galibey.

Creating a JSON logger for Flask

By default Flask writes logs to the console in plain-text format. This can be limiting if you intend to store your logs in a text file and periodically send them to a central monitoring service. For example, Kibana, only accepts JSON logs by default.

You might also want to enrich your logs with additional metadata, e.g. timestamps, method names, log type (Warn, Debug, etc.). In this post we will use the Python logging library to modify Flask's logging format and write them to a text file. In the end we will see how to periodically send these logs to an external service using Flume.

Source: Creating a JSON logger for Flask, an article by Adeel Ahmad.

Avoiding space leaks at all costs

Haskell programs are infamous for having lots of space leaks. This is the result of Haskell choosing the lazy evaluation model and not designing the language around preventing such type of memory usage errors.

Investigating and fixing space leaks brought tons of frustration to Haskell developers. Believe it or not, I’m not a fan of space leaks either. However, instead of fighting the fire later, you can use several techniques to prevent the catastrophe in the first place.

Source: Avoiding space leaks at all costs, an article by Dmitrii Kovanikov.

Best practices for inclusive textual websites

I realize not everybody’s going to ditch the Web and switch to Gemini or Gopher today (that’ll take, like, at least a month /s). Until that happens, here’s a non-exhaustive, highly-opinionated list of best practices for websites that focus primarily on text. I don’t expect anybody to fully agree with the list; nonetheless, the article should have at least some useful information for any web content author or front-end web developer.

Source: Best practices for inclusive textual websites, an article by Rohan Kumar.

unblob - extract everything!

unblob is an accurate, fast, and easy-to-use extraction suite. It parses unknown binary blobs for more than 30 different archive, compression, and file-system formats, extracts their content recursively, and carves out unknown chunks that have not been accounted for.

unblob is free to use, licensed under MIT license, it has a command line interface and can be used as a Python library. This turns unblob into the perfect companion for extracting, analyzing, and reverse engineering firmware images.

Source: unblob - extract everything!.

See also: binwalk.

Using unwrap() in Rust is Okay

One day before Rust 1.0 was released, I published a blog post covering the fundamentals of error handling. A particularly important but small section buried in the middle of the article is named “unwrapping isn’t evil”. That section briefly described that, broadly speaking, using unwrap() is okay if it’s in test/example code or when panicking indicates a bug.

I generally still hold that belief today. That belief is put into practice in Rust’s standard library and in many core ecosystem crates. (And that practice predates my blog post.) Yet, there still seems to be widespread confusion about when it is and isn’t okay to use unwrap(). This post will talk about that in more detail and respond specifically to a number of positions I’ve seen expressed.

Source: Using unwrap() in Rust is Okay, an article by Andrew Gallant.

A friendly introduction to Principal Component Analysis

Principal component analysis (PCA) is probably the most magical linear method in data science. Unfortunately, while it's always good to have a sense of wonder about mathematics, if a method seems too magical it usually means that there is something left to understand. After years of almost, but not quite fully understanding PCA, here is my attempt to explain it fully, hopefully leaving some of the magic intact.

Source: A friendly introduction to Principal Component Analysis, an article by Peter Bloem.