friso.lol

I'm being serious here.

The Anatomy of a Tiny Software Product

Posted at —

TL;DR

  • I made a tiny software product, called Walk from Home.
  • It is a web application that has exactly one feature and is made up of about 2000 lines of code.
  • Nonetheless, the implementation uses 24 different technology concepts, each of which have entire books dedicated to them.
  • Seemingly trivial product decisions can easily double or halve that number, which is why product engineering is hard.
  • A software product is not the same thing as a computer program. Confusing these may lead some to believe that small products are easy.
  • Software products (not computer programs) are hard because, amongst other things, users expect more than ever before, non functional concerns are of previously unimaginable proportions (e.g. security), and distribution needs to take into account devices that you didn't know existed.
  • They will become harder, because machine learning will be the norm, and proprietary data sets are the new unfair competitive advantage.

Prelude

I made a tiny software product called Walk from Home (WfH). It has one feature: given a list of participants and their postal codes, generate pairs of participants that minimise the total distance participants have to travel in order to meet their pairing partner. The list of participants is provided by uploading a Microsoft Excel spreadsheet, and the resulting pairings are obtained by downloading a different Microsoft Excel spreadsheet.

I did not create WfH because I think it will be hugely successful. In fact, I would be surprised if anyone uses it. The need was not researched, I did not talk to any prospective users, and I have no clue if there is a market for the solution. The odds of launching a product without research and it being a hit are against me… I just created WfH to see what it is like to single handedly build a usable software product in 2021.

The Anatomy of WfH

---------------------------------
Language           files     code
---------------------------------
Jupyter Notebook       8      736
JavaScript            11      734
Python                 6      391
Markdown               1       65
Dockerfile             1       30
HTML                   2       26
CSS                    1        3
JSON                   1        3
---------------------------------
SUM:                  31     1988
---------------------------------

To appreciate how tiny WfH as a product is, the above table shows that a grand total of about 2000 lines were written.

The Code

To write and deploy those 2000 lines, you need working knowledge of 3 languages, 2 development frameworks, 8 development tools, 4 cloud services, and 3 notable third party libraries.

The list of required working knowledge or necessitated research:

  • JavaScript, JSX, HTML, CSS
  • Python
  • Bash scripting
  • ReactJS
  • FastAPI
  • git
  • Docker
  • Webpack
  • npm
  • pip
  • pipenv
  • Google Cloud Run
  • Google Cloud Container Registry
  • Google Cloud DNS
  • Notable JS libraries: LeafletJS
  • Notable Python libraries: Pandas, NumPy, xlsxwriter

Additionally, WfH contains a couple of neat challenges under the hood: combinatorial optimisation, and geocoding.

Matching Pairs

The number of pairs you can create from n participants is n(n - 1). For 200 participants, there are roughly 40 thousand options. WfH attempts to find the pairings that minimise the total distance participants have to travel. The search space is quite manageable: it grows quadratic with the number of participants. Yet, the evaluation of a single candidate solution is computationally expensive, as each unique pairing requires a distance calculation between two geocoded postal codes. With a heavily optimised implementation, brute force could be an option. But in the end a greedy hill climbing approach was used, allowing for a non-optimised pure Python implementation, which is easier to write.

WfH uses the distance along a straight line between geocoded postal codes, but a more useful implementation would use travel time as a distance measure. But simply switching the calculation from straight line distance to travel time distance would have easily doubled, if not tripled the development effort.

The straight line distance calculation is simplified by geocoding the postal codes to the Dutch RD coordinate system, which is approximately Euclidean within its valid area. With RD coordinates, centroid calculations are a single NumPy operation (np.mean(...)). They also avoid the Haversine formula during evaluation.

Adding to the list of required working knowledge or necessitated research:

  • Approaches to combinatorial optimization problems
  • Basic understanding of geo coordinate systems / map projections
  • Dutch RD coordinate system
  • Distance calculations

Geocoding

The few vendors (including Google) who provide geocoding of entities (postal codes, addresses, points of interest) as a service charge handsomely for their work. A free product like WfH would not be viable when relying on a geocoding API. Luckily, WfH only requires a minimal subset of all geocoding features: convert a Dutch postal code into a single coordinate that is roughly at the center of the area. Instead of full geocoding, it needs a geocoded Dutch postal code table. But simply switching the location precision from postal codes to full addresses as location input would have made the entire product financially infeasible.

How do you obtain a geocoded Dutch postal code table? It turns out you can buy one; the price tag is several hundred Euro. Not prohibitively expensive, but just out of range for a hobby project. Besides, Dutch addresses, postal codes, and building geometries are open data. Navigating open data portals provided by the Dutch government is an art form in and of itself, but after some digging you should be able to find a downloadable version of the complete BAG, the registration data for addresses and buildings. From there on, it is only a matter of parsing tens of gigabytes of XML, extracting the postal codes from all addresses, joining with building geometries (a separate set of XML files), and determining the centroid for each unique postal code. The solution for WfH involves a combination of Python and Bash scripts. The essential trick is to make sure that the XML parser is a streaming (event based) parser, so you can pass over the gigabytes of XML once without running out of memory. At this point, I could have gone into the business of selling postal code tables, but I chose to finish WfH instead.

Adding to the list of required working knowledge or necessitated research:

  • The BAG dataset and its entities
  • Geography Markup Language (GML)
  • Streaming XML parsing
  • Effective use of the cut, sort, uniq, and join cli tools

Distribution

WfH is a web application. It could also have been a single Python script in a Github repo somewhere and provide exactly the same functionality. It could also have been a mobile app that all participants would need to install to create a schedule. Choice of distribution matters a lot when it comes to adoption. The distribution for WfH has one user do a lot of work to make it easier for all other participants. A web application also holds a nice middle ground in terms of complexity. No app stores, approval processes, etc., but definitely not as trivial as leaving script on Github.

Additionally, I set a number of hard constraints for WfH: no user accounts, no email, and no server side state. These help stay within the budget for a free solution (zero), avoid compliance risk with data privacy or other legislation, and maintain simplicity. Such constraints also help guide the decision on which distribution to use. Simply switching from web application to mobile would have easily quadrupled development effort. Just leaving a script on Github does not make a software product; it just makes a computer program, which is not the same thing.

Conclusion

Tallying the total of tools, techniques, programming languages, concepts, frameworks, and libraries listed above, we find that WfH depends on a grand total of 24. Each of these have entire books dedicated to them. Any seemingly minimal change in the product specification could have doubled or halved that list. That is why product engineering is hard: a minimal change in specification can mean the difference between fairly doable or completely infeasible. Number of features does not matter. For WfH with only one feature, we have highlighted two design decisions that made the difference between feasibility or failure.

Whenever someone asks me "How much does a good mobile app cost?", my default answer is "one million per year (EUR)". My response is typically met with astonishment. Or contempt, thinking I am not taking the question seriously. But let's see what you can buy for about a million a year.

  • 2 computer programmers: EUR 195000,- / year
  • 1 UI / UX designer: EUR 78000,- / year
  • 1 senior product manager: 104000,- / year
  • Cloud hosting and tooling: EUR 15000,- / year
  • Need machine learning? Add 1 ML engineer: 117000,- / year
  • Want a game? Add 1 game developer: 97500,- / year
  • Need integrations? Add 1 additional computer programmer: 97500,- / year

It adds up to roughly half a million per year, give or take, so one million is clearly exaggerating. Unless you want anyone to use your app; then that is where the other half of the million goes: promotion, and driving engagement.

Things were not always this way. There was a time when software and computer programs were the same thing. One boss person would manage the programmers, and the programmers would deliver the software. Distribution was shipping a floppy disk. What has happened since? Well, here are a couple of things that happened:

  • Users expect more. Way more. For example, search used to be a database query with wildcards. Now you are expected to do auto completion, spelling correction, synonym matching, some minimal learn to rank implementation, and deliver all of it with sub 100ms 99p latencies.
  • Security is a really big deal. If the wrong group of people is bored for an afternoon, and your property is not fronted with some form of DDoS protection, all bets are off. If you care about sleeping at night, hosting on a single VPS and forgetting about it, is no longer an option.
  • Devices, devices, devices. Just when you think you understood where people run web browsers, someone put one in the console of a car, or on a refrigerator.

Today, when we talk about software, we talk about products. Under the hood, there could be many computer programs, which are of no concern to our users. Someone asking for "just a simple app" often has the computer program in mind, not the software product.

The evolution is likely to continue. Here are two things that I believe are likely to happen soon:

  • Machine learning is then norm. Any piece of interaction or user experience that can be improved with ML, will be improved with ML. In the same way that implementing good search was once a very hard problem, until things like Lucene, SOLR, and ElasticSearch commoditised it, there will be use case focused off the shelf ML stacks for developers to bring it to the masses.
  • Proprietary data as the basis for product. Perhaps not for everyone, but there will be a substantial number of products that can exist exclusively because the creator put effort into the acquisition or construction of proprietary (annotated) data sets. This effort is not to be underestimated.

If you care about doing products, I am afraid it is time to look into these on top of everything you already have to know…