AWS Glue and Athena : Using Partition Projection to perform real-time query on highly partitioned data

Ravi Intodia
10 min readJul 4, 2021

--

Many organizations use S3 as storage for their IOT or Telematics or other high data volume generating use-cases. There are several ways to get your data in S3 for which you can refer “Moving data to Amazon S3” section of https://aws.amazon.com/s3/getting-started/.

Many development teams focus heavily on getting the data to S3 and don’t put enough thoughts on how the data is stored/organized within S3. When we say how data is stored in S3, it means:

  1. Folder structure
  2. File formats
  3. File compression
  4. Encryption
  5. Several other factors which impact usage of data in S3

Since there are too many use-cases for which S3 can be used, we will focus on a specific use-case for rest of the article. Our use-case highlights are :

  1. Storing JSON data coming from 1000s of IOT Devices
  2. Provide efficient and fast way of querying the data in realtime whenever needed
  3. Provide cost-effective and optimized way of querying historic data whenever needed

Now in order to implement above use-case, we will have to focus on different aspects of S3 Data storage and retrival:

Encryption:

We recommend encrypt all your data stored in cloud may it be S3 or anywhere else. Lets go which S3 default encryption.

Folder Structure:

Organizing your data in right folder structure is important in order to avoid hitting S3 rate limits for GET/PUT/POST/COPY/DELETE etc. Refer below links for understanding S3 rate limits and throttling issues:

For our use-case below folder structure will be a good fit:

Based on above folder structure,

  1. Parent folder for each device id which will reduce throttling issues as rate limits will be increased by distributing files for different devices separately.
  2. Well partitioned folder structure will organize all files per device per day into separate folders which are easily identifiable using year/month/day folders values.
  3. Folder naming like fieldname=value will further optimize querying of this data using services like Athena

File Compression:

JSON files can be compressed to gzip or any other compression format to reduce size of files hence save on storage cost in long term.

File Format:

Here we are storing multiple JSON files per device per day which will result in large volume of small files resulting in issues when trying to query the data. For querying large amount of data it is recommended to limit the number of files to be queried as well as have files in more optimized formats like Apache Parquet and Avro.

Data Transformation for Optimized Retrieval:

Since countless small JSON files is not the recommended source for querying large volume of data it’s better to re-organize small data files into bigger files along with using optimized formats.

To begin with we need to setup a Glue Crawler to create table on JSON data bucket which will be used to query on JSON data. For creating Glue Crawlers refer:

While creating Glue Crawler provide target database name as “iot_device_data_db” and let the table be created based on S3 Bucket name which will be “iot_device_data_json”. Also, Glue crawler needs to be scheduled to run periodically to update partitions in glue catalog table.

Once the glue catalog table is updated, data can be queried using Athena. Below is sample query for fetching data for device id “1000001” for date “1-Jan-2021"

Now AWS Glue ETL Job can be scheduled to convert source JSON data into optimized formats (parquet/avro/orc etc) and store in target S3 location.

Reference Links:

We will consider parquet for our scenario and for parquet optimized file size is between 512MB to 1024MB, so we need to see our data volumes to decide on how do we want to combine the data :

a. One parquet file per day per device

b. One parquet file per day for all devices

c. More than one parquet file per for all devices : Files can be split using criteria fields like device type, device model, series etc.

d. For even lower data volumes we can do one file per month per device or one file per month for all devices

For our scenario let’s consider high data volume per day per device and create 1 parquet file per day per device. Couple of advantages here with this grouping:

a. Less data to be scanned when querying from single device data

b. Less overhead of re-creating files in case of some issue with historic device data (data received was corrupt and fixed later or data was not received in time and received after few days etc.). Only affected device level data needs to recreated in parquet.

Based on above points our folder structure in target S3 bucket will look similar to source:

Querying on Parquet Data:

Till now we have setup a Glue ETL Job which is converting our source JSON data in Parquet as 1 file per day per device. As the job runs it will continue to create new folders/files in S3 which needs to be available for querying.

We can setup a Glue crawler on parquet data bucket (assuming we already created one for JSON data during our Glue ETL Job setup). The crawler will crawl Parquet data bucket and create a Glue Data Catalog table in same target database “iot_device_data_db” and let the table be created based on Parquet S3 Bucket name which will be “iot_device_data_parquet”. Glue crawler needs to be scheduled to run periodically to update partitions in glue catalog table.

Once the glue catalog table is updated, data can be queried using Athena. Below is sample query for fetching data for device id “1000001” for date “1-Jan-2021”

Response when querying from parquet data table will be much faster and also you will see data scanned when querying from “iot_device_data_parquet” table will be 80 to 90% less when compared to querying from “iot_device_data_json” table for same search criteria. Hence we start saving time and money when using parquet data table.

For Athena query costing refer: https://aws.amazon.com/athena/pricing/

Optimization Techniques:

As the number of devices and age of data increases, there would be increase in number of partitions that need to be stored in Glue Catalog table and as the data in table grows performance of Athena queries will fall considerably.

To understand how the performance will be impacted, it’s important to understand that whenever we query on Glue table using Athena, Glue Data Catalog table is scanned to identify potential partitions that need to be looked for data and associated metadata with those partitions. Hence when the Glue Data Catalog performance decreases, it takes more time to identify what locations/files need to parsed for data thereby increasing the overall Athena query time considerably.

One simple solution to it can be reduce the number of partitions, for example (with respect to our scenario):

  1. Change Glue ETL to create Parquet files such that they have partitions at day level — One Day One Partition which includes data for all devices

2. Change Glue ETL to create Parquet files such that they aggregate data for each device at month level — One partition per device per month

But it might not be always possible reduce number of partitions, in which case “Partition Projection” will come into picture.

Reference : https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html

“Partition Projection” operates much faster than conventional AWS Glue Data Catalog specially in case of highly partitioned tables. “Partition Projection” is able to skip AWS Glue Data Catalog and directly query on S3 folders/files based on partition projection configuration for given table.

Now we can either edit existing table to use partition projection or create a new table on same parquet data source and then enable partition projection on same. Let’s create a new table by setting up new Glue Crawler and this time we will use prefix “projection_” in the table name hence the new table will be “projection_iot_device_data_parquet”.

Once the table is created, follow below steps to configure partition projection:

  1. Go to AWS Glue Console, open “projection_iot_device_data_parquet” table

2. Click “Edit” table

3. Go to table properties, add projection configuration for partition fields year/month/day along with enabling projection

4. Device Id in our case we have considered to be integer and of 8 digits, projection configuration would look like

It is extremely important to consider type of values in partition fields when planning to use partition projection, for example if you use alphanumeric device ids with no fix pattern then projecting optimal partition projection will be difficult as we will have to specify device id as type “injected” and will be able to query only on one device id at a time using any projection enabled table.

For details on supported types for partition projection refer :

5. Click on “Apply” to save the updated table properties

Now once the table is updated, any queries on this table will be executed using “partition projection” settings and would result in faster results by directly scanning the folder/files based on partition information provided in the query itself, example:

Above query will skip scanning entire Glue Catalog table metadata to identify which location/files need to be scanned and directly scan parquet files present at below folder location, hence providing considerable performance improvement.

Querying Real Time Data:

Now since we have successfully optimized our process of querying data from parquet files which are generated at fixed schedule, question would arise how do we effectively query data which is yet to be converted in parquet. Some of the use-cases can be:

  1. Need last 30 mins data for a specific device for debugging current issue
  2. Need data from X date till now for comparing historic values with current trend of values
  3. Many other data analytics or data visualization use-cases

We will have follow the same process like we did for parquet data, i.e. setup a new Glue crawler to create a new projection table on “iot-device-data-json” bucket and configure partition projection similar to “projection_iot_device_data_parquet” table. So now we will have “projection_iot_device_data_json” table available for querying data which is not yet converted to parquet.

Another important feature of “partition projection” tables is even if partitions are not yet added to catalog metadata, it can still query on newly created partitions (s3 folders) immediately. This heavily reduces dependency on Glue Crawler to complete crawling the data before it is available for querying using Athena.

Important Note : If you are looking to query any newly added field then one time completion of Glue Crawler is must for those fields to be available for querying.

Now since we have partition projection configuration updated for both json and parquet data, we can use both tables together to query historic as well as real time data in an optimal way.

Assume today is 11-Jan-2021 and our Glue ETL job runs once a day to convert previous day’s json data to parquet. When querying on 11-Jan we can’t be 100% sure whether Glue ETL for 11th Jan is completed and data for 10-Jan is available in parquet or not therefore safe side will be try to query data until 9-Jan from parquet data and remaining from json data. Sample query will look like below:

Important Note: Keeping column names same across json and parquet data is pre-requisite for using * in UNION query. If column names are different then we will have to apply alias to make them consistent in query output.

Conclusion:

Based on above analysis and implementation recommendations, it is evident that one needs to keep in mind many factors when using S3 as source for storing and querying IOT type data. Also, performing due-diligence upfront to start with optimal folder structures, partition data types, file formats, table configurations etc. is must else solution has to be re-implemented after every new road-block is encountered.

--

--

Ravi Intodia
Ravi Intodia

Written by Ravi Intodia

Solution Architect working on designing and implementing AWS Cloud based solutions.