How to leverage AWS to build affordable big data solutions?

Dovydas Stankevicius
4 min readMar 4, 2021

When designing big data solutions it is important to understand that not every system is the same. There is no “one size fits all” solution for all of your business needs. In general, businesses are implementing two types of data analytics: batch analytics and stream processing. Batch analytics are used for processing data sets over a large periods of time. For example, generating monthly or quarterly reports. Stream processing helps businesses take real-time data feeds, compute insights and take immediate actions. For example, continuously analyzing order data for an e-commerce website and automatically promoting less popular items to boost their sales.

Data streaming

Streaming data can be defined as a continuous flow of data generated by various sources. Using stream processing technology, data streams can be processed, stored, analyzed, and acted upon as it’s generated in real-time. Common use cases of streaming are log analytics, IoT sensors and click stream data.

Kinesis Data Streams

Kinesis data streams are used for capturing, processing and storing data feeds. Using data streams requires custom code to implement data producers and consumers. Producers can store data on the stream for up to a week and it is up to the consumer to read the stream and apply some custom business logic. This service does not support auto scaling and therefore it is up to the developer to provision stream capacity. One shard provides ingest capacity of 1MB/sec or 1000 records/sec. There are two techniques to consume stream data: enhanced fan out or GetRecord(s) api. When data consumers opt-in to use enhanced fan-out, each shard provides up to 2MB/sec of data output for each consumer with ca. 70ms latency. Do note that this option is more expensive — use it only if your data is time sensitive. Using GetRecords api, each shard provides up to 2MB/sec of data output, regardless of the number of consumers processing data in parallel from a shard. Stream capacity can be altered anytime using shard splitting and shard merging operation. If you are producing a large number of small events, you might be throttled by 1000 records per seconds limit. This would require adding more shards to the stream, thus making it more expensive. Using KPL (kinesis producer library) and KCL (kinesis consumer library) for producing and consuming messages can be a major price optimization. KPL automatically allows batching small records into a single api call over a specified time window. This increases your shard throughput, but introduces longer time delay for data delivery. For example, if you are producing 300 byte size 3000 messages per second would require 3 shards. With KPL library batch processing a single shard is enough as the total throughput is under 1MB per second. This simple optimization would result in three times cost savings.

Kinesis Firehose

Kinesis data firehose should be used if you are looking for a managed consumer that does not require any custom code. A typical example of such a use case is continuous data delivery into the S3 data lake or streaming application data directly into an ELK stack. Kinesis Firehose charges for the amount of data being transferred over the stream. This is a great choice if you have irregular data flow patterns as it supports autoscaling. This could be a great cost saver because you don’t have to worry about paying for over provisioned resources or paying for the stream idle time.

Kinesis Data Analytics

Kinesis data analytics allows real time data analysis using simple SQL statements. This technology is often used in the e-commerce space to answer questions such as “how many times a specific product has been sold in the last thirty minutes”. To answer this question, order data has to be ingested into the data stream and inspected using data analytics with thirty minute sliding window aggregation. This method can help to identify most popular products and make strategic business decisions, such as increasing promotions for less popular products to boost their sales. Data analytics are priced per hour and it can add up to a significant cost. Before choosing this technology consider using other processing methods. Sometimes it is enough to write a simple lambda function.

Data lakes and batch processing

Amazon S3 should always be used when building any data lake on AWS. Amazon provides native S3 integration with major big data frameworks such as Spark, Athena, Redshift and Hadoop. S3 should be used to decouple storage and compute resources. This allows you to run transient EMR clusters with EC2 spot instances, which is a major price optimization for data processing costs.

Choosing the right file format

Choosing the right file format for storing data is another important optimization when designing big data systems. For example, using Amazon Athena for ad hoc data exploration with classic CSV or JSON base format, can result in poorly performing and very expensive queries. This is because Athena pricing model is based on the amount of data scanned by the service. Large cloud costs are often a direct result of incorrect use of the AWS service. The solution is to store data in columnar data format such as ORC or PARQUET. This simple optimization helps Athena partially scan files, thus significantly improving query performance and reducing the costs. Further cost optimizations can be achieved by partitioning data in S3 with different file prefixes based on business data access patterns.

S3 data tiering

Choosing the right S3 data tiering can be another major price optimization. Amazon S3 Standard should be used for frequently accessed data, Amazon S3 Standard-IA should be used for less frequently accessed data. Amazon Glacier Deep Archive is the cheapest data solution provided by Amazon and should be used for archiving data for a very long time period for data that is accessible once or twice a year. Amazon Glacier is often used for getting data compliance. It allows creation of vault policies, which can act as a WORM drive (write once read many) for preventing data deletion for regulatory purposes.

Produced by: cloudblatic.eu

--

--