Why we really need to think about Reporting and Data Warehouses for smaller apps


Why we really need to think about Reporting and Data Warehouses for smaller apps

I have been building Web applications for over a decade and nearly all of them require some sort of report and analytics, but I missed something.

I was doing my usual wandering around looking at concepts and ideas and spent some time looking more deeply into Data Warehouses. This isn’t an area I’ve had any dealings with, as I mostly deal with startups, but there were points made about the warehouses that made me rethink my ideas on reporting.

OLAP and OLTP

When looking are warehouses, we first need to understand why they are different from your box standard Relational Database. RDB’s are defined as OLTP, or Online Transactional Processing. In simpler terms, this is a database where we expect a lot of writes happening. Writing to a database needs to have all sorts of protections to ensure that records are added property, write are either completed entirely or not at all, and of course dealing with nasty little things like race conditions.

Postgresql, MySQL etc, have a lot of their processing power pointed towards ensuring this is all done safe and secure. This includes NoSQL databases as well.

Now there is OLAP, or Online Analytical Processing. This is where nearly all of the work in on reading records, and the reads have to be done faster, as we are looking at a lot of records. We can do this with an OLTP based database, but they are optimized for protecting data when writing, and this can and does slow down the reads. Optimizing for reading and analytics, is one of the core concepts of Data warehouses.

So how does this effect how we build applications. I have recently been working on applications in the medical clinic area, so I will use this as an example. If we want to run a report of patient activity over the past year, with several aggregates, so the data can be used in charts, the processing power is very CPU and memory intensive. I know in at least a couple of cases, reports can only be run outside of working hours, or they create a whole copy of the database and run reports on that.

So what do Data Warehouses do differently and why should we take note.

First off, lets look at an example of file and process management for Data Warehouses. There are multiple types. but Hadoop is one of the first and is quite easy to understand. Records are spread over multiple files and the Warehouse starts a process for each file, so extraction of data is done in an asynchronous fashion. The data from each process goes through a process they call MapReduce. In basic terms, this means filtering out records, and removing fields that are not needed. The data is then run through a GroupBy type process, where records are compressed and values that are to be used in computations are moved to be in lists. Data from the various processes are then merged together, final aggregates calculated and the results returned. For the online services, they are multiple instances used and so a LOT of processes can be started and run at the same time.

This makes for a very fast read system. It should also be noted at the data is unnormalized, so all data is stored in a single table. So if a patient is proscribed a medication, the patient record and medication record are added together and then put in the database.

This brings up the last, and I think, most important concept. Records are added to the warehouse every time a change is made to the main database, along with a timestamp of when the change occurred.

Why do I think this is important. Imagine you want to run a report of patients prescribed a medication and lived in a specific area 5 years ago. You create a query saying get patients who’s address is that area, with a prescription created in the date range you set. Your results are wrong!

The problem is your looking at where the patients live now, not where they lived 5 years ago. So you say you will update your database to keep old Addresses, but this makes the query even more complex.

In the data warehouse, because its unnormalized and uses time stamps. you can just grab records for your date range, for the area and for the medication. The record in the warehouse holds the address of the patient at the time they got the medication.

This makes a data warehouse, not only faster at reading and analytics, but also makes its easier to produce queries and returns better results.

This works even better when we look at requirements for dashboards. They all need to have time related events and need to access the data fast.

So Why are we not all using data warehouses

At this current time, Data Warehouses are nearly all online. They are all made to house REALLY big data. Our clinics kinda don’t have that much data. A hospital might, but not the smaller guys. The same goes for many business applications. Also businesses like clinics etc, are not going to take well to using online/cloud services

The second thing is that Data Warehouses are difficult to design and set-up. They take a lot of work and the people with the skills are not cheap.

So what can we do.

An RDB does not make the best Data Warehouse, but it can be used as such. Postgresql has additional modules for dealing with time series data and more advanced analytics.

The Database however needs to be seen as separate form the application database. OLAP and OLTP. You can’t mix them. I’ve also noticed that NoSQL databases generally don’t seem to react well for analytics of this size. For those who can use the cloud services, I do recommend looking into and maybe utilizing them for reporting instead of your OLTP database. For those you can’t, consider having an RDB set to one side configured like a Data Warehouse and use this for reporting. Reporting against your OLTP database will always be problematic.