May 9, 2023
When I started writing this post, it was to be a single, helpful guide for myself. Before long, it turned into a pretty gnarly rant about Redshift as a product, rather than a truly marvelous technical solution. The helpful guide still does exist, for what it's worth. Check it out as a sibling post.
In full disclosure: I do work with an organization that has adopted Redshift as an internal tool for various reporting purposes, as of mid-to-late 2021. In hindsight, the decision to bring it in as a tool likely had little to do with its prowess and capabilities, and more to do with certain personnel confusing the term "growth" for "infinite amounts of cash to blow".
We leveraged I/O as follows:
| Frequency | Operation | Mechanism |
|---|---|---|
| Once | WRITE all tables from external, primary datasource(s) | AWS DMS + Vendor integration |
| Ad-hoc | WRITE custom views to cache aggregate data used solely for reporting purposes | AWS Redshift Console + DBMS Tools |
| Ad-hoc | WRITE DDL changes that have occurred on the primary datasource(s) | AWS DMS |
| Ad-hoc | READ from various tables and views | DBMS Tools + other query tools |
| Regularly | READ from various tables and views | Scheduled, and systemic processes |
Those WRITE operations are inevitable, regardless of the destination data source, sure. But they specifically require
going through AWS managed toolchains, either S3 for a COPY or via AWS Database Migration Service for table mapping
or other ETL jobs.
Allow me to pause my uppity browbeating for a moment. I personally find the tooling behind Redshift to be badass. It:
Being badass never lets one off the hook, though. I'm sure you could ask Bezos.
Look at those WRITEs again. None of them are from an SDK. None of them are from a code base.
In fact, it seems that (obligatory: at the time of writing this) Redshift really still doesn't cater to
development teams as a primary data source for server applications.
I'm not trying to build a case that Redshift was at all designed for the purpose of acting as a server application's primary datasource. It's important to acknowledge that it's a profoundly accessible data storage solution with seemingly great potential for scale, and capability of catering to various use-cases. Oh, also, by accessible I mean, "widely adoptable for those with a credit card burning a hole in their pocket."
However, I think it seems obvious as to why this purpose was not its goal in the first place. In fact, recalling the I/O operations I expressed above, it's truly no mystery as to why there are no official Amazon-led connectors/SDKs/ORM libraries that bind any particular language to Redshift capabilities:
If the programmatic capabilities a developer has with a platform are to create and manage new infrastructure, then the platform remains a platform to be managed, rather than a tool to be leveraged.
It might be in Amazon's best interest, strictly from a margins and maintenance cost perspective, that Redshift remain a pure PaaS product. Unfortunately, when developers (and not data scientists) are the ones responsible for the operations involving Redshift I/O, it becomes painfully clear that it's simply a SQL database with a bunch of added cost.
Amazon's own Relational Database Service (RDS) contrasts with Redshift in ways that really helped me arrive to my conclusion.
db.t3.large (1 core, 2vCPUs, 8GB RAM) is $0.136/hour, vs. the minimal cluster charge of $0.25/hour on RedshiftRedshift was marketed well, and perhaps even had the best of intentions to disrupt the market of big data tools. It certainly smells like a big custom SQL database cloud with purposeful dependencies that hike the cost of using it effectively. It seems that it is most effectively serving entities who have so much data to play with that dumping it in a central place to stream outwards in creative ways. I hope their pockets are deep. So does Amazon, I'm sure.