AWS Glue Serverless Cloud Computing – Amazon Web Services Tutorial
This amazon web services Glue tutorial with AWS serverless Cloud Computing shows how powerful functions as a service are and how easy it is to get up and running with them.
# What is AWS Glue
# What is AWS ETL
# Serverless ETL Pipelines
# Data Catalog
# Data Pipelines
# Discover dark data that you are currently not analysing.
# Analyze dark data without moving it into your data warehouse.
# Visualize the results of your dark data analytics.
In this AWS Glue tutorial will give an overview of Amazon Glue and its features. AWS Glue is built on top of Apache Spark, which provides the underlying engine to process data records and scale to provide high throughput, all of which is transparent to AWS Glue users.
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize data, clean it, enrich it, and move it reliably between various data stores.
The new service has three main components:
Data Catalog—A common location for storing, accessing and managing metadata information such as databases, tables, schemas and partitions. Creating an ETL job to organize, cleanse, validate and transform the data in AWS Glue is a simple process. When setting up the connections for data sources, “intelligent” crawlers infer the schema/objects within these data sources and create the tables with metadata in AWS Glue Data Catalog.
ETL Engine—Once the metadata is available in the catalog, data analysts can create an ETL job by selecting source and target data stores from the AWS Glue Data Catalog. It’s not necessary to know the target schema in advance as it will be provided by the catalog. The next step is to define an ETL job for AWS Glue to generate the required PySpark code. Developers can customize this code based on validation and transformation requirements.
Scheduler—Once the ETL job is created, it can be scheduled to run on-demand, at a specific time or upon completion of another job. AWS Glue provides a flexible and robust scheduler that can even retry the failed jobs.
Written by admin