What is Redshift?
Amazon Redshift is a fully-managed, petabyte-scale data warehouse, offered only in the cloud through AWS. Like other analytical data warehouses, Redshift is a columnar store, making it particularly well-suited to large analytical queries against massive datasets.
Redshift is also fast enough for interactive querying against large-scale data sets. Redshift makes it easy to scale clusters up or down through the AWS Console,the command-line, or an API and can support clusters with as few as one node and as many as 128. Queries run on Redshift Spectrum are allocated compute resources dynamically, with no pre-planning necessary and no maximum.
Redshift is a hugely popular data warehouse, offering a nice balance between easy set-up/maintenance and robust customization options. This means that Redshift provides a powerful cloud database solution. As organizations build the knowledge to monitor and optimize the redshift cluster for their specific workloads, they can achieve even greater throughput.
Reasons to choose Redshift
Redshift strikes a great balance between the robust customization of on-premise MPP databases and the simplicity of cloud MPP databases. For data-savvy companies, Amazon Redshift hits the sweet spot with respect to cost, flexibility, ease of use, and performance.
Redshift is fully managed in the cloud, making it very easy to scale your Redshift instance up or down completely through your browser. Redshift’s Spectrum feature also gives you interactive query access to data stored directly in S3 — without importing it into Redshift — via a dynamically scaled compute engine, making resources available on-demand in seconds and allowing virtually unlimited scalability.
Amazon offers powerful options to optimize your Redshift cluster specific to your use case, including different types of nodes, many types of compression, and several ways to partition your data tables. Smartly wielded, these customizations can result in a blazingly fast performance from your cluster.
What use cases is Amazon Redshift great for?
Companies with large amounts of growing data
For companies who are used to querying transactional databases, you’ll be flabbergasted at the data volumes that Redshift can handle with ease. Queries on millions (or even billions) of rows return in milliseconds instead of minutes or hours.
For companies switching from an existing enterprise data warehouse, Redshift achieves the same amazing performance without the DevOps headaches and often at a fraction of the cost. Additionally, its cloud architecture means that scaling for growing data volumes is painless.
With the addition of Spectrum, Redshift has grown even more powerful. Redshift can now access data from their S3 data lake directly. Because S3 storage is inexpensive and essentially limitless, you can architect a multi-layered data warehousing solution with your most-queried data in tables and less-frequently queried data always available via Spectrum.
Companies looking for a familiar Data Warehousing system and SQL dialect
Amazon Redshift uses a familiar architecture to many of the on-premise EDW systems (clusters of nodes that contain storage and compute resources, with a lead node for query planning), but optimized for the cloud. This helps with migrations from on-prem environments as data engineers are familiar with the architecture. Redshift’s SQL dialect is based on PostgreSQL, which is hugely popular and well known by analysts around the world.
Companies looking to tap into a huge ecosystem of services and partners
Redshift belongs to a huge (and growing) roster of solutions within the AWS ecosystem, which provides tools for all kinds of use cases, ranging from raw developer toolsto fully managed services. Amazon Redshift is only one of the cloud data solutions Amazon offers, and is able to fully integrate with other solutions within the larger AWS ecosystem.
Redshift in particular has an incredible partner network, with each company in the network certified to work with Redshift. Partners within Redshift’s partner networkassist users with everything from data integration to business intelligence. This allows Redshift to provide unparalleled support and integrations for its product.
Companies working with semi-structured files
With Redshift Spectrum, companies are able to run queries against exabytes of structured data sitting in Amazon S3 without any data movement. It uses Redshift’s query planning resources to optimize the query above just simple file access, and supports file multiple formats including CSV, TSV, Parquet, Sequence, and RCFile.
Companies needing predictable pricing
Because you’re essentially renting cloud resources from AWS, it’s easy to predict what your costs will be with Redshift and get the most performance you can out of your cluster given your cost constraints, whether that be a few hundred dollars per month, or many thousands.
Companies needing workload management
Redshift has killer workload management configuration options, meant to ensure that organizations can appropriately prioritize queries that they want to return first, and urgent queries are not stuck behind slow-moving processes.
Before you commit
AWS offers both on-demand and reserved pricing. For proof-of-concepts, on-demand pricing is a great way to test different cluster configurations to get optimal performance; however, it’s much more cost-effective to choose reserved pricing if you have already selected Amazon Redshift as your preferred data warehouse technology.
The price you pay for Redshift is fixed depending on your hardware configuration. Spectrum, which queries data directly on S3, is priced on a pay-per-usage model. For Redshift especially, it’s important to right-size your cluster, so that you get the performance you need, but don’t pay for capacity you won’t use.
Trying it for free
Redshift comes with a free trial that includes 750 hours per month on a DC1.Large node for 2 months, for a total of 1500 hours. We generally recommend running a trial on several nodes to get a realistic sense of performance under production circumstances.If you are interested in seeing how Looker and Redshift perform together, Looker periodically has free trial codes that allow you to extend past the 750 hours and build a full Proof of Concept. Contact Looker for more information.
Redshift database architecture
Redshift is a distributed MPP cloud database designed with a shared nothing architecture, which means that nodes contain both compute (in the form of CPU and memory), and storage (in the form of disk space). Redshift is composed of two types of nodes: leader nodes and compute nodes.
Leader nodes in Redshift receive queries from client applications and parcel those queries out to compute nodes where the relevant data is stored. It’s important to note that while the leader node compiles the query plan and sends it out, the actual work on Redshift is mostly performed on compute nodes.
Compute nodes in Redshift store portions of table data and execute the code received from the leader node, then send the results back upstream to be aggregated by the leader node.
There are two classes of compute nodes from which to choose. Each cluster must be composed entirely of one kind or the other. The node types are:
- Dense compute (SSD): The smaller dense compute option (DC1.Large) provides 0.16TB of storage per node; the larger dense compute option (DC1.8x.Large) provides 2.56TB of storage per node. Both are optimized for performance and are associated with faster I/O. The tradeoff is that space is limited, so the cluster needs constant monitoring and scaling if large amounts of data are continually flowing into the cluster.
- Dense storage (HDD): The smaller dense storage option (DS2.XLarge) provides 2TB of storage per node; the larger dense storage option (DS2.8XLarge) provides 16TB of storage per node. While I/O is slower than the dense compute alternatives, there is typically plenty of disk space, so the cluster’s disk space may require less constant monitoring than dense compute nodes.
Upsizing the Redshift cluster involves either upgrading the individual types of nodes (vertically scaling the cluster), or adding more nodes to the cluster itself (horizontally scaling the cluster).
For more information, Amazon gives a great overview of the Redshift architecture on this page.
Types of data
Amazon Redshift stores its data in columnar format. It can ingest both tabular data from text (delimited) files, JSON, or Avro, as well as ingesting data directly from other AWS services like EMR and DynamoDB. Amazon Redshift queries relational data relying on SQL, converting all incoming raw data into a relational-columnar format, the specifics of which are abstracted from the user. (For those familiar with Parquet, the format is conceptually similar).
Amazon Redshift performs well on flattened, denormalized data and on normalized schemas (providing that distribution and sortkeys are well-defined to make joins efficient). Because of Redshift’s efficient compression, both redundancy and sparseness of data are less of a concern in Amazon Redshift than they are in many other databases. Semi-structured and unstructured data can’t be imported into a Redshift table, but stored in S3 files, they can be accessed directly using Redshift Spectrum using a Hive Metastore.
AWS Redshift uses a similar dialect to that of PostgreSQL, with many of the same analytical capabilities. As such, it’s relatively easy to pick up and adapt to its few minor differences.
Maximum recommended size of data
The largest Amazon Redshift cluster can store 2 petabytes of data in tables, which is an enormous amount of data. In all likelihood, you’d run up against concurrent query limitations far before you run out of space.
With the addition of Spectrum, Redshift’s query limit essentially disappears, since Spectrum can query buckets in S3, the size of which is basically unlimited. Additionally, because Spectrum dynamically pulls in compute resources as needed per-query, concurrency limitations aren’t an issue for queries run through Spectrum.
AWS Redshift implementation & use
Implementing Amazon Redshift: a couple of things to keep in mind
AWS Redshift is a fully-managed solution, which means it’s easy to spin up a cluster with the exact specifications you have in mind through the AWS console. Amazon takes care of all the hardware considerations for you.
However, Redshift gives you many options to optimize your schema and tables, and so it’s important to gain a good understanding of best practices (which AWS helpfully outlines).
Loading data: doing it yourself
Data is most commonly loaded into AWS Redshift via the COPY command, which leverages the Amazon Redshift MPP architecture to read and write files in parallel. Using Amazon Redshift with Spectrum, you can also access data directly where it lives in S3 without importing it. That gives you five methods to access data from in Redshift:
- Using the Redshift COPY command to copy data from Amazon S3
- Using Redshift COPY to copy data from DynamoDB
- Copying data from Hadoop Amazon Elastic MapReduce (EMR)
- Using Redshift COPY with SSH to copy data from remote hosts
- Create an Athena catalogue or a Hive metastore to access data stored in S3 with Spectrum and directly query it with Redshift
Think about table design
If you’re loading your data yourself, it’s important to think about table design before you begin since some parameters must be chosen at table creation that will strongly impact performance down the line. We recommend Amazon’s Tuning Table Design tutorial to understand how to best design tables that will be loaded into your Redshift instance for maximum performance.
Loading data: utilizing a data integration partner
There are many third-party ETL and data integration services that are endorsed by AWS to make the the process of loading data into Redshift much easier for organizations. Depending on exact needs, we tend to recommend partners like Stitch, Fivetran, Matillion, Talend, and Alooma.
Here are some questions to consider when choosing a data integration partner:
Which data sources do you need to replicate? Make a list of which data sources (e.g. Salesforce, HubSpot, Zendesk, Google Analytics, etc.) you want to centralize in your Redshift cluster and then find a partner that can support them. Some providers have APIs to make it easy to build custom integrations as well.
How is the service priced? If you have lots of integrations with a little data in each, a service priced by data volume may be best. If you have only a few integrations with a lot of data, then a service priced by integration would be better.
Do you want to transform the data before query time? If masking sensitive data or transforming data before it’s loaded into Redshift is important, make sure to choose an integration partner that supports the types of transformations you need.
Do you have any special requirements around data hosting or certifications (e.g. HIPAA)? The AWS cloud has top notch security but some industries and countries mandate that customer data reside within a specific geography or that vendors comply with specific protocols.
Tools for optimizing Redshift performance
Redshift offers many resources and tools to help optimize the performance of your Redshift cluster. Understanding how these tools work can rapidly increase performance, while forgoing their use can lead to suboptimal performance.
Every table in Redshift can have one or more sort keys. A sort key is like an index: Imagine looking up a word in a dictionary that’s not alphabetized — that’s what Redshift is doing if you don’t set up sort keys. Redshift stores data in 1MB blocks, storing the min and max values for each sort key present in that block. The main benefit of sort keys is that Redshift can skip over irrelevant blocks of data when a sort key is present and the query is limited by that column, vastly boosting performance when querying large data sets.
If you have a 50 billion-row event table with three years of data, you may often need to run queries on just “today” or “last 7 days”. If you have a sort key on created_at and include a filter on that field, Redshift will be able to skip over 99% of rows (i.e., blocks of data) when executing the query.
Joins and distribution keys
Join performance in Redshift can be greatly improved when tables are distributed efficiently across nodes. Using distribution keys to minimize data transfers during joins would significantly improve query SLAs.
For large, immutable datasets that grow over time (e.g. log data, web tracking data), separating tables by day, week, or month is a good way to speed up query times when not all queries require full history. Redshift offers views as a way to union all of the underlying tables so that they can effectively be queried as if they were a single table.
Encoding & compression
Compression can vastly reduce the overall amount of data stored on disk in Redshift, which subsequently reduces the amount of data read for each query. Redshift does a good job automatically selecting appropriate compression encodings if you let it, but you can also set them manually.
Vacuum & analyze
In order to reclaim space from deleted rows and properly sort data that was loaded out of order, you should periodically vacuum your Redshift tables. Keeping statistics on tables up to date with the
command is also critical for optimal query-planning.