
Senior Cloud Solution Architect
At Cloud Practice we aim to encourage adoption of the cloud as a way of working in the IT world. To help with this task, we are going to publish numerous good practice articles and use cases and others will talk about those key services within the cloud.
We present the basic concepts AWS Glue below.
AWS Glue is one of those AWS services that are relatively new but have enormous potential. In particular, this service could be very useful to all those companies that work with data and do not yet have powerful Big Data infrastructure.
Basically, Glue is a fully AWS-managed pay-as-you-go ETL service without the need for provisioning instances. To achieve this, it combines the speed and power of Apache Spark with the data organisation offered by Hive Metastore.
The Glue Data Catalogue is where all the data sources and destinations for Glue jobs are stored.
An ETL in AWS Glue consists primarily of scripts and other tools that use the data configured in the Data Catalogue to extract, transform and load the data into a defined site.
Sample ETL script in Python:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
## Read Data from a RDS DB using JDBC driver
connection_option = {
"url": "jdbc:mysql://mysql–instance1.123456789012.us-east-1.rds.amazonaws.com:3306/database",
"user": "test",
"password": "password",
"dbtable": "test_table",
"hashexpression": "column_name",
"hashpartitions": "10"
}
source_df = glueContext.create_dynamic_frame.from_options('mysql', connection_options = connection_option, transformation_ctx = "source_df")
job.init(args['JOB_NAME'], args)
## Convert DataFrames to *AWS Glue* 's DynamicFrames Object
dynamic_df = DynamicFrame.fromDF(source_df, glueContext, "dynamic_df")
## Write Dynamic Frame to S3 in CSV format
datasink = glueContext.write_dynamic_frame.from_options(frame = dynamic_df, connection_type = "s3", connection_options = {
"path": "s3://glueuserdata"
}, format = "csv", transformation_ctx = "datasink")
job.commit()
Creating a Job using a command line:
aws glue create-job --name python-job-cli --role Glue_DefaultRole \
--command '{"Name" : "my_python_etl", "ScriptLocation" : "s3://SOME_BUCKET/etl/my_python_etl.py"}'
Running a Job using a command line:
aws glue start-job-run --job-name my_python_etl
AWS has also published a repository with numerous example ETLs for AWS Glue.
Like all AWS services, it is designed and implemented to provide the greatest possible security. Here are some of the security features that AWS GLUE offers:
AWS bills for the execution time of the ETL crawlers / jobs and for the use of the Data Catalogue.
Although it is a young service, it is quite mature and is being used a lot by clients all over the AWS world. The most important features it offers us are:
Like any database, tool, or service offered, AWS Glue has certain limitations that would need to be considered to adopt it as an ETL service. You therefore need to bear in mind that:
My name is Álvaro Santos and I have been working as Solution Architect for over 5 years. I am certified in AWS, GCP, Apache Spark and a few others. I joined Bluetab in October 2018, and since then I have been involved in cloud Banking and Energy projects and I am also involved as a Cloud Master Partitioner. I am passionate about new distributed patterns, Big Data, open-source software and anything else cool in the IT world.
Senior Cloud Solution Architect
Desde la Práctica Cloud queremos impulsar la adopción de la nube como forma de trabajo en el mundo de IT. Para ayudar en esta tarea, vamos a publicar multitud de artículos de buenas prácticas y casos de uso, otros hablarán aquellos servicios clave dentro de la nube.
A continuación os vamos a presentar los conceptos básicos de AWS Glue.
AWS Glue es uno de esos servicios de AWS que son relativamente nuevos pero que tienen un enorme potencial. En especial este servicio puede ser muy beneficioso para todas aquellas empresas que trabajan con datos y que aún no posean una infraestructura de Big Data potente.
Básicamente, Glue es un servicio de ETLs totalmente administrado por AWS y de pago por uso sin necesidad de aprovisionar instancias. Para conseguirlo, combina la velocidad y potencia de Apache Spark con la organización de datos que ofrece Hive Metastore.
El Catálogo de datos de Glue es donde se almacenan todos los orígenes y destinos de datos para los trabajos de Glue.
Una ETL en AWS Glue está compuesta principalmente de los scripts y otras herramientas que utilizan los datos configurados en Data Catalog para extraer, transformar y cargar los datos en un sito definido.
Script de ejemplo de ETL en Python:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
## Read Data from a RDS DB using JDBC driver
connection_option = {
"url": "jdbc:mysql://mysql–instance1.123456789012.us-east-1.rds.amazonaws.com:3306/database",
"user": "test",
"password": "password",
"dbtable": "test_table",
"hashexpression": "column_name",
"hashpartitions": "10"
}
source_df = glueContext.create_dynamic_frame.from_options('mysql', connection_options = connection_option, transformation_ctx = "source_df")
job.init(args['JOB_NAME'], args)
## Convert DataFrames to *AWS Glue* 's DynamicFrames Object
dynamic_df = DynamicFrame.fromDF(source_df, glueContext, "dynamic_df")
## Write Dynamic Frame to S3 in CSV format
datasink = glueContext.write_dynamic_frame.from_options(frame = dynamic_df, connection_type = "s3", connection_options = {
"path": "s3://glueuserdata"
}, format = "csv", transformation_ctx = "datasink")
job.commit()
Creación de un Job mediante línea de comandos:
aws glue create-job --name python-job-cli --role Glue_DefaultRole \
--command '{"Name" : "my_python_etl", "ScriptLocation" : "s3://SOME_BUCKET/etl/my_python_etl.py"}'
Ejecución de un Job mediante línea de comandos:
aws glue start-job-run --job-name my_python_etl
Además, AWS tiene publicado un repositorio con multitud de ejemplos de ETLs para AWS Glue.
Como todo servicio de AWS, está diseñado e implementado para ofrecer la mayor seguridad posible. Estas son algunas de las funcionalidades de seguridad que ofrece AWS GLUE:
AWS factura por el tiempo de ejecución de los crawlers / jobs de ETLs y por el uso de Data Catalog.
Pese a ser un servicio joven es bastante maduro y se está usando mucho por clientes del todo el mundo de AWS. Las características más importantes que nos aporta son:
Como toda Base de Datos, herramienta o servicio ofrecido, AWS Glue tiene ciertas limitaciones que habría que tener en cuenta para adoptarlo como servicio de ETLs. Por ello deberías tener presente que:
Mi nombre es Álvaro Santos y ejerzo como Solution Architect desde hace más de 5 años. Estoy certificado en AWS, GCP, Apache Spark y alguna que otras más. Entré a formar parte en Bluetab en octubre de 2018 y desde entonces estoy involucrado en proyectos cloud de Banca y Energía y además participo como Cloud Master Partitioner. Soy un apasionado de las nuevas patrones distribuidos, Big Data, Open-source software y cualquier otra cosa de mundo IT que mole.
© 2025 Bluetab Solutions Group, SL. All rights reserved.