May 9, 2023

Redshift: Body Hiding as a Service

When I started writing this post, it was to be a single, helpful guide for myself. Before long, it turned into a pretty gnarly rant about Redshift as a product, rather than a truly marvelous technical solution. The helpful guide still does exist, for what it's worth. Check it out as a sibling post.

In full disclosure: I do work with an organization that has adopted Redshift as an internal tool for various reporting purposes, as of mid-to-late 2021. In hindsight, the decision to bring it in as a tool likely had little to do with its prowess and capabilities, and more to do with certain personnel confusing the term "growth" for "infinite amounts of cash to blow".

We leveraged I/O as follows:

Frequency	Operation	Mechanism
Once	WRITE all tables from external, primary datasource(s)	AWS DMS + Vendor integration
Ad-hoc	WRITE custom views to cache aggregate data used solely for reporting purposes	AWS Redshift Console + DBMS Tools
Ad-hoc	WRITE DDL changes that have occurred on the primary datasource(s)	AWS DMS
Ad-hoc	READ from various tables and views	DBMS Tools + other query tools
Regularly	READ from various tables and views	Scheduled, and systemic processes

Those WRITE operations are inevitable, regardless of the destination data source, sure. But they specifically require going through AWS managed toolchains, either S3 for a COPY or via AWS Database Migration Service for table mapping or other ETL jobs.

Allow me to pause my uppity browbeating for a moment. I personally find the tooling behind Redshift to be badass. It:

is run on its own custom flavor of SQL
can [allegedly] act as a decent "big-data" solution, offering scaling up to PB of data quickly
comes with training wheels:
- a phenomenal documentation layer
- a web-based query console

Being badass never lets one off the hook, though. I'm sure you could ask Bezos.

PaaS vs. DBaaS

Look at those WRITEs again. None of them are from an SDK. None of them are from a code base. In fact, it seems that (obligatory: at the time of writing this) Redshift really still doesn't cater to development teams as a primary data source for server applications.

Its SDKs cater to managing Redshift clusters and its assets, rather than invoking queries
The official Data API documentation includes instructions on how to map result sets by hand to JDBC objects
There is no official supported ORM for Redshift
The client-side SDK seems to be exactly what it would have taken to construct the Amazon internal cluster management UI within the AWS management console

I'm not trying to build a case that Redshift was at all designed for the purpose of acting as a server application's primary datasource. It's important to acknowledge that it's a profoundly accessible data storage solution with seemingly great potential for scale, and capability of catering to various use-cases. Oh, also, by accessible I mean, "widely adoptable for those with a credit card burning a hole in their pocket."

However, I think it seems obvious as to why this purpose was not its goal in the first place. In fact, recalling the I/O operations I expressed above, it's truly no mystery as to why there are no official Amazon-led connectors/SDKs/ORM libraries that bind any particular language to Redshift capabilities:

If the programmatic capabilities a developer has with a platform are to create and manage new infrastructure, then the platform remains a platform to be managed, rather than a tool to be leveraged.

It might be in Amazon's best interest, strictly from a margins and maintenance cost perspective, that Redshift remain a pure PaaS product. Unfortunately, when developers (and not data scientists) are the ones responsible for the operations involving Redshift I/O, it becomes painfully clear that it's simply a SQL database with a bunch of added cost.

DBaaS...

Amazon's own Relational Database Service (RDS) contrasts with Redshift in ways that really helped me arrive to my conclusion.

RDS virtual instances have smaller downtime on average over Redshift clusters during re-scaling/elastic resizing
RDS is cheaper, on average, per use case (at the time of writing)
1. An RDS db.t3.large (1 core, 2vCPUs, 8GB RAM) is $0.136/hour, vs. the minimal cluster charge of $0.25/hour on Redshift
Allegedly, Redshift's query optimizer, famous for making its "big data" capabilities a reality, make it slower on average than RDS for queries under the several million record count

Redshift was marketed well, and perhaps even had the best of intentions to disrupt the market of big data tools. It certainly smells like a big custom SQL database cloud with purposeful dependencies that hike the cost of using it effectively. It seems that it is most effectively serving entities who have so much data to play with that dumping it in a central place to stream outwards in creative ways. I hope their pockets are deep. So does Amazon, I'm sure.

References

Redshift Rank
Redshift Data API
Redshift SDK Examples
Redshift Java SDK Overview
Redshift Python SDK Overview
Redshift Client-side SDK
RDS vs. Redshift Analysis
- This is from Hevo, an ETL and "data pipeline" platform. Not necessarily a competitor to, and lists Redshift as a source they work with out of the box. Take that as you will...