TinyDB: A Fast and Easy Document DB without a Server
TinyDB is a Python only NoSQL Database. It stores all data in a json file, with a document id for each record saved. It also has a nice set of CRUD functions and uses caching. TinyDB, however, is a really cut down database system with only the basic CRUD features for reading and writing to a JSON file. No ACID.
So why use TinyDB. Well, its stand alone Python code, so its encapsulated in your code base.No server software needs to be installed or maintained. Also since its a document database, no tables or schemas need to be defined. This makes it great for cases where you have chunks of data you randomly want to access, without pulling the whole data blob into memory.
TinyDB is also TINY. its only 1800 lines of code including documentation and unit tests. This makes it really easy to use within your own code.
My Usecase
The FDA has a database of all drugs (medications) available in the US. Each of then has NDC codes, not just for the product, but for different ways they are packaged as well. This data also includes drug classes, active ingredients, band names and generic names. The zip file containing the json with all this data is 233mb and it holds data on 130,000 medications.
Now I want to be able to access this data to create test records, and also for finding medications of selected classes, or by active ingredient. The code for this needs to be used by multiple applications, and I don’t want to install an additional database server. Since my usecases are really read only, I’m not bothered about write speeds or updating the records.
With TinyDB, I’m also using one of the available extensions, BetterJSONStorage. This swaps using the standard JSON library to using orjson. Orjson is a LOT faster and the saved records are smaller. This highlights one of the positive attributes of TinyDB. Its REALLY easy to extend. All the code is available to you. Overriding classes, or adding new ones to the code base is very simple. BetterJSONStorage is an example of this. Its really just a single class that replaces the standard storage class with one using orjson. You can write your own storage classes with ease and there is documentation on how to do this.
Writing to TinyDB
You need to be caseful when writing to your TinyDB database as this is not the fastest thing in the world. Each write results in a file open, write and close action, so writing one record at a time is VERY slow.
With that said, TinyDB has an insert_multiple() method, which can take a lot of records. I wrote out my records in 1000 record batches, and it took 37.74 seconds. Since I’m only using this for reading, this is more than fast enough. By database only really needs to be updated one a year or so.
I did try writing one record at a time. I got to 3 hours and gave up. I suspect I could write batches of more than 1000 and speed things up, but the result is more than fast enough for my needs.
I will add that this timing does include pushing each medication record through a pydantic BaseModel to extract and reformat the records to my needs. (I like Pydantic !!).
Reading Records
A big negative for TinyDB is the lack of an indexing system, Anybody who has dealt with Document databases knows that fast retrieveal of records are dependant on indexes.
Getting a record based on a query looking for a matching product ndc code took 0.016053 seconds. This is not too bad when your looking for a match in 130,000 records.
I decided to create my own indexes. retrieving a record based on the document id assigned to each record is super fast, just like any database. Product NDC codes are an important data element. They are unique to each medication in the FDA database and make up part of the package ndc code, which is given to a product sold in a particular package (Number of tabs etc). The Product NDC is made of two numbers with a dash between then, for example 72839–217.
I split this into two strings of numbers and created a hierachical Index. Creation of the index took a couple of seconds, using the following code
def create_ndc_index(self):
self.ndc_index.truncate()
ndc_index = {}
for doc in self.db.all():
product1, product2 = doc["product_ndc"].split("-")
if product1 not in ndc_index:
ndc_index[product1] = {product2: doc.doc_id}
else:
ndc_index[product1][product2] = doc.doc_id
self.ndc_index.insert_multiple([{"name": k, "records": v} for k, v in ndc_index.items()])The index is stored in its own TinyDB database instance and not in the main one. To find a record, we take the first number, find it in the index, then within that record find the second number in a dictionary, which will return the document id.
When retriveing the same record using the index, it took 0.000064 seconds. This makes for a pretty fast database. I then created indexes, again, in their own TinyDB instances for medication classes, generic names, and ingredients. Indexes took between 1.5 to 2.7 seconds to create since, as you can see, I just empty the database and recreate the index.
This also highlights that retrieval speeds with TinyDB are dependant on the number of records. Since its easy to do so, you can split data across multiple TinyDN instances.
Conclusion
TinyDB works extreamly well for my usecase. Data access is fast and I don’t have to set-up any database software and load up my data for each system I want to use my code for another application.
I can also save the database in the codebase since the database files are small.
I can also see TinyDB being used in other usecases. The code base is very easy to write extensions for, and its easy to spread data over multiple databases. Its also nicely encapsulated in your code base. No need to install and configure external db servers. Async Support is also ready to be used in a prewritten extension using the aiofiles library.
TinyDB is a simple idea done well. Kudo’s to its developers. In most of the cases, where I have had to use NoSQL db’s I think TinyDB would have done a better job.
