Ask HN: Is our data warehouse setup normal or over-complicated?
4 points
2 hours ago
| 1 comment
| HN
I've been pulled onto a new feature for replacing some of our existing customer-facing reports with reports from the data warehouse. This isn't the first data report from the data team we've integrated into the product, but since it involves existing reports that I'm the local expert on, I'm getting pulled into the process. The current reports don't have any performance issues, but the decision to change has been made anyway.

From what I've been able to gather, the data goes from the production MySQL database to a secondary MySQL database using DMS. Then come the Glue jobs that ship the data out to a data lake in S3. After that there are several transformation jobs that I've been told convert the data into a "canonical" form, smoothing out all the differences between verticals. I think they said that next the data goes into a second data lake and has additional transformations performed. Finally the entire process gets the data to its final resting place in Redshift where QuickSight is used to create reports. I'm fairly certain I missed a couple steps because I just couldn't figure out the purpose of each step as they were describing the process.

Getting reports out of that process seems painful. Showing a report for an internal customer (sales or customer support for instance) means they need a QuickSight account and access to the specific report. Getting access to that for myself was not straightforward, which makes me think it is hand-managed by a dev.

For showing a report in product it feels worse. First the data team are about the only people that can create these reports because not only do the product devs not know this "canonical" form, but getting the development environment running consistently for product devs has been like pulling teeth. Once someone has written the report, they have to promote the report by copying it exactly, including an identical report id, to another region. Finally the report id is given to the product team to put into the product. Adding the report id to the product is the easiest part, but the data journey doesn't stop there. The product has to pass that report id and user information to a lambda the data team maintains that generates a URL for the product to embed with an iframe. And after all of that, the report doesn't come close to matching the look of the site.

Is this data warehouse setup normal? Is this a common way to handle in-product reports after a company invests in a data warehouse? There are a lot of what seem like redundant steps, as well as a lot of custom code for what I would expect to be built into these products.

icedchai
1 hour ago
[-]
Without understanding differences between the "source" and "canonical" forms, it is tough to say. Also how much data are we actually talking about? The pipeline you describe may be entirely reasonable, or it may be an over engineered, convoluted contraption that could be replaced with a single DB replica and a few views to simplify queries.

My experience with QuickSight has been pretty negative. The overall UI/UX is pretty meh. If you're embedding it in your product you may be better off generating your own reports, in app.

reply
ealready_value
1 hour ago
[-]
The source form is the production database, which is what the current reports pull from. The canonical form is the form that in theory all of the verticals get rolled into, but many of the nuances that our customers are used to having end up getting replaced with similar, but are not quite the same. Right now that's my biggest concern that customers are not going to get the data they need because of this canonical form.

We're talking about a few-hundred megabytes of data for all of the customers that these reports pull, but that's also for the past 15 years. We do have like 25k customers, which shrinks how much a customer can pull in even further. One last point is that we already de-normalize the report data into its own table specifically for these reports, so that's not something the data warehouse is doing for us.

I agree with your experience with QuickSight, it is exactly my experience. My preference is to continue using the reports we generate in the app, but I'm trying to wrap my head around cases where this ends up being the better direction.

reply