
Subsequent job run on the same data source only process newly added data since the last checkpoint.
#BEST BOOKMARK DEDUPLICATOR UPDATE#
#BEST BOOKMARK DEDUPLICATOR HOW TO#
The post also reviews best practices using job bookmarks with complex AWS Glue ETL scripts and workloads.įinally, the post shows how to use the custom AWS Glue Parquet writer optimized for performance by avoiding extra passes of the data and computing the schema at runtime. It also shows how to scale AWS Glue ETL jobs by reading only newly added data using job bookmarks, and processing late-arriving data by resetting the job bookmark to the end of a prior job run. This post shows how to incrementally load data from data sources in an Amazon S3 data lake and databases using JDBC. The first post of the series, Best practices to scale Apache Spark jobs and partition data with AWS Glue, discusses best practices to help developers of Apache Spark applications and Glue ETL jobs, big data architects, data engineers, and business analysts scale their data processing jobs running on AWS Glue automatically.

AWS Glue provides a serverless environment to prepare (extract and transform) and load large amounts of datasets from a variety of sources for analytics and data processing with Apache Spark ETL jobs.
