AWS Glue : Optimize crawler runtimes for continuously increasing data (using exclude patterns)
As we all know AWS Glue is a serverless managed ETL service used widely to scan and transform large volumes of data for analytics and machine learning purposes. There are multiple ways AWS Glue can support you in performing ETL operations on large datasets, refer https://aws.amazon.com/glue/ for high level use-cases.
Typically development teams start using Glue with a smaller dataset during early days of their service/application/product development or roll-out. Now as the service/application/product becomes popular associated data volume increases significantly. Even with serverless and managed service like Glue managing large volume of historic data along with incoming data becomes difficult if configuration is not optimized as per the data volume, data patterns, frequency of schema change, flow of data etc.
Eventually if Glue Crawler runtime increases continuously, there are chances:
- It might hit the timeout of 24 hours while listing the files
- It might start throttling S3 rate limits thereby impacting other functionalities in the application landscape
To better understand the Glue Crawler functioning, it’s important to know that everytime the crawler runs it will lists all the files in the target which include files that were already crawled till last run and the new files which are added since the last run. However crawler will only end up reading data (typically first megabyte only) from the new files.
Below are some of the best practices or suggestions for reducing Glue Crawler runtime:
- Combine/Merge smaller files into larger files: Reduces the crawling time as less number of files needs to be listed and even lesser number of files needs to be read in each run.
- Split data-sets between multiple crawlers: Running multiple crawlers for short time is usually more effective that running single crawler for a much longer duration.
- Use exclude patterns: Exclude patterns can be used to ignore the file-paths/folders which are already been crawled before(or files from which data is not required). Since this reduces the number of files that crawlers needs to list everytime, crawler runtime is also reduced accordingly.
Each of the above best practices needs detailed analysis and implementation based on your data characteristics like size of files, number of files, data format, folder structure of files etc.
For the rest of this article we will focus on “using exclude patterns” as a sustainable practice for historic and incoming data. Let’s take example of data coming from 1000s of IOT devices and files are stored in well partitioned folder structure and all files per device per day into separate folders which are easily identifiable using year/month/day folders values.
In most of the IOT based setups we expect data to transmitted mostly in real-time or scheduled frequency of few hours hence the data for given day is expected to be received by the next calendar day. However, for use cases like Mining or Marine where connectivity might be an issue we can expect data to be received few days late, depending on the nature of devices, network and usage.
Now coming back to Glue crawler, for such applications/data there is no need to crawl all folders everyday/everytime as the setup is expecting data in recent folders only (last few days or months) hence older folders/partitions can be excluded using “Exclude Patterns”
As shown in above screenshot, there is option to provide “Exclude patterns” in Data store section (above values are indicative) which can be used to exclude folders/partitions in which we don’t expect any new data or which we don’t want to be part of our glue catalog.
Effective use of “Exclude patterns” can bring down Glue Crawler runtime and cost down by more than 50%. e.g., If 2018, 2019 and 2020 data is excluded then Glue Crawler runtime/cost for crawling only 2021 data will be much lesser than crawling all data starting from 2018 to 2021.
Now the next big question how to add exclusion patterns, there are majorly 2 options:
- Update Glue Crawler via CFT or AWS CLI or AWS Console: All of these options need manual intervention at regular intervals to update the exclude patterns for keeping the “folders to be crawled” relevant to a given date. e.g., When year changes to 2022 and we don’t expect any data to come in 2021 folders we need to update the “exclude patterns” manually using CFT/CLI/Console to add “*/2021/**”
- Update Glue Crawler programmatically: As simple piece of code can be written and scheduled to run at given frequency which can update “exclude patterns” property of Glue crawler as per current date. e.g., A python lambda functions using boto3 can be scheduled using CloudWatch events to keep executing at defined intervals (once a month or once a week etc.) and updating “Exclude patterns” to ensure “folders to be crawled” are always relevant on any given date.
Option 2 doesn’t need any manual intervention hence it is a recommended option.
Please find below sample python lambda code for excluding all years + months except current and last three months.
import os
import boto3
from datetime import datetime#creat boto3 glue client
client = boto3.client('glue')#fetch crawler details from Environment variables
crawler_name = os.environ["crawler_name"]
database_name = os.environ["database_name"]
s3_path = 's3://' + os.environ["s3_path"] + '/'###
# Mention any default patterns to be
# excluded which are not dependent on date
###
exclusions_array = []# oldest year for which data is available
start_year = os.environ["start_year"]def calculate_exclusions(exclusions_array):
#get current date details
current_date = datetime.today()
current_year = int(current_date.strftime('%Y'))
current_month = int(current_date.strftime('%m'))
temp_year = start_year#Exclude all years except current and last year
while temp_year < current_year-1:
exclusions_array.append("*/" + str(temp_year) + "**")
temp_year+=1
#Exclude year/months from current and last year
if current_month > 3:
exclusions_array.append("*/" + str(temp_year) + "**")
start_month = 1
temp_month = start_month
while temp_month < (current_month - 3):
exclusions_array.append("*/" + str(current_year) + "/" + f'{temp_month:02}' + "**")
temp_month+=1
else:
start_month = 1
temp_month = start_month
while temp_month < (9 + current_month):
exclusions_array.append("*/" + str(temp_year) + "/" + f'{temp_month:02}' + "**")
temp_month+=1def lambda_handler(event, context):
#Call calculation function
calculate_exclusions(exclusions_array)
#Update Glue Crawler
response = client.update_crawler(
Name=crawler_name,
DatabaseName=database_name,
Targets={
'S3Targets': [
{
'Path': s3_path,
'Exclusions': exclusions_array,
},
]
}
)
When above code is executed in August 2021 it will exclude all years from start year till 2020 and exclude all months till April 2021. e.g., if start year is 2018 then Glue Crawler will have “Exclude patterns” like shown below:
Important Note: Above code needs to be updated as per the folder/partition structure of your data and exclusion needs. e.g., an use-case might have a need to exclude all folders/partitions older than 3 days instead of 3 months.
Conclusion:
Based on above analysis it is evident that “Exclude patterns” play a very vital role in keeping a check on Glue crawler runtime/cost and it is recommended to have a process to automatically update “Exclude patterns” at regular intervals.