Bluetab

Data Governance in the Media sector

septiembre 10, 2020 by Bluetab

Data Governance in the Media sector

In one of our clients in the «media» sector, we participate with our data governance solutions in the process of digital transformation of data architecture on Cloud environments.

Within this initiative we have incorporated different pieces on its Cloud platform, providing a complete view of the data from a single point of access.

For this, a Data Catalog and mechanisms for metadata discovery have been designed and implemented. Furthermore, our solutions allow us to manage the data definitions from a business point of view through the Business Glossary where we measure and control the quality of the information.

In the project, the most used technologies have been the different Azure cloud services (ADLS, Blob, ADLA, etc.)

SUCCESS STORIES

Almacén Cloud en Banca

septiembre 10, 2020 by Bluetab

Almacén Cloud en Banca

El camino de los bancos a ser “Data driven companies” hace que crezca exponencialmente el volumen de datos a almacenar y procesar.

Realizamos uno de los proyectos más estratégicos de Data para facilitar el escalado de la infraestructura en una de las mayores entidades financieras de España. gestionando un entorno Cloud para soportar la elasticidad que demandan los usuarios.

A través de este servicio se proporciona un reporting que permite controlar y detectar los factores que impactan en la calidad de los servicios ofrecidos por la Entidad. El cuadro de mando facilita la toma de decisiones para implementar procesos de mejora continua, específicamente a través de lógicas de detección, preventivas y correctivas que permitan minimizar los tiempos de respuesta para mejorar la calidad de los servicios digitales.

El proyecto se basa en bots que simulan peticiones virtuales de servicio para todos los canales, permitiendo conocer el estado de todos los procesos implicados en cada momento. Adicionalmente se proporciona una visión cliente, trazando la información de los ficheros (logs) que contienen información propia de cliente para seleccionar los KPIS más relevantes y realizar extracciones agregadas para su explotación.

Tecnologías: Cloud AWS, PostgreSQL, S3, Grafana, MicroStrategy

CASOS DE ÉXITO

Cloud Warehouse in Banking

septiembre 10, 2020 by Bluetab

Cloud Warehouse in Banking

The path of banks to be “Data driven companies” causes the volume of data to be stored and processed to grow exponentially.

We carry out one of the most strategic Data projects to facilitate infrastructure scaling in one of the largest financial entities in Spain. Managing a Cloud environment to support the elasticity that users demand.

With this service, reporting that allows controlling and detecting the factors that impact the quality of the services offered by the Entity is provided. The dashboard facilitates decision-making to implement continuous improvement processes, specifically through detection, preventive and corrective logics that allow minimizing response times to improve the quality of digital services.

The project is based on bots that simulate virtual service requests for all channels, allowing to know the status of all the processes involved at all times. Additionally, a customer view is provided, tracing the information in the files (logs) that contain customer information to select the most relevant KPIS and carry out aggregated extractions for their exploitation.

Technologies: Cloud AWS, PostgreSQL, S3, Grafana, MicroStrategy

SUCCESS STORIES

Introducción a los productos de HashiCorp

agosto 25, 2020 by Bluetab

Introducción a los productos de HashiCorp

Desde la Práctica Cloud queremos impulsar el uso de los productos Hashicorp y por ello vamos a publicar artículos temáticos sobre cada uno de ellos.

Debido a la multitud de posibilidades que ofrecen sus productos, en éste artículo abordaremos una visión de conjunto y en publicaciones posteriores, entraremos en detalle de cada uno de ellos, aportando casos de uso no convencionales que muestren el potencial que tienen los productos de Hashicorp.

¿Por qué Hashicorp?

Durante estos últimos años, Hashicorp viene desarrollando diferentes productos Open Source los cuales ofrecen una gestión transversal de la infraestructura en entornos Cloud y on-premises. Estos productos, han marcado estándares en la automatización de infraestructura.

En la actualidad, sus productos aportan soluciones robustas en los ámbitos del aprovisionamiento, seguridad, interconexión y coordinación de cargas de trabajo.

El código fuente de sus productos está liberado bajo licencia MIT, lo cual ha tenido una gran acogida dentro de la comunidad Open Source (cuentan con más de 100 desarrolladores aportando mejoras de forma continua)

Además de los productos que expondremos en este artículo, disponen de interesantes soluciones Enterprise.

Respecto al impacto que tienen sus soluciones, poniendo como ejemplo Terraform, se ha convertido en un referente en el mercado. Esto significa que Hashicorp está haciendo las cosas bien, por lo que es conveniente entender y aprender a usar sus soluciones.

Aunque nos estemos centrando en su uso en el ámbito Cloud, sus soluciones tiene gran presencia en los entornos on-premises, pero alcanzan todo su potencial cuando trabajamos en la nube.

¿Cuáles son estos productos?

Hashicorp cuenta con los siguientes productos:

Terraform: infraestructura como código.
Vagrant: máquinas virtuales para entornos de pruebas.
Packer: construcción de imágenes de forma automatizada.
Nomad: “orquestador” de cargas de trabajo.
Vault: gestión de secretos y protección de datos.
Consul: gestión y descubrimiento de servicios en entornos distribuidos.

A continuación, resumimos las principales características de los productos. En próximas publicaciones entraremos en detalle de cada uno de ellos.

Terraform se ha posicionado como el producto más extendido en el ámbito del aprovisionamiento de Infraestructura como Código.

Utiliza un lenguaje específico (HCL) para desplegar infraestructura a través del uso de los diferentes proveedores de Cloud. Terraform también permite gestionar recursos de otras tecnologías o servicios, como Gitlab o Kubernetes.

Terraform hace uso de un lenguaje declarativo y se basa en tener desplegado exactamente lo que pone en el código.

El típico ejemplo que se muestra para entender la diferencia con el paradigma que sigue por ejemplo Ansible (procedimental) es:

En un primer momento queremos desplegar 5 instancias EC2 en AWS:

Ansible:

- ec2:
    count: 5
    image: ami-id
    instance_type: t2.micro

Terraform:

resource "aws_instance" "ejemplo" {
  count         = 5
  ami           = "ami-id"
  instance_type = "t2.micro"
}

Como puede observarse, no existen prácticamente diferencias entre los códigos. Ahora necesitamos desplegar dos instancias más para hacer frente a una carga de trabajo mayor:

Ansible:

- ec2:
    count: 2
    image: ami-id
    instance_type: t2.micro

Terraform:

resource "aws_instance" "ejemplo" {
  count         = 7
  ami           = "ami-id"
  instance_type = "t2.micro"
}

En este momento, podemos ver que mientras en Ansible estableceríamos en 2 el número de instancias a desplegar para que en total haya 7 desplegadas, en Terraform estableceríamos el número directamente a 7 y Terraform sabe que tiene que desplegar 2 más porque ya hay 5 desplegadas.

Otro aspecto importante es que Terraform no necesita un server Master como Chef o Puppet. Es una herramienta distribuida cuyo elemento común y centralizado es el tfstate (se explica en el siguiente párrafo). Se lanza desde cualquier sitio con acceso al tfstate (que puede ser remoto, guardado en un almacenamiento común como puede ser AWS S3) y terraform instalado, el cual se distribuye como un binario descargable desde la web de Hashicorp.

El último punto a comentar sobre Terraform es que se basa en un archivo denominado tfstate en el que va guardando y actualizando la información relativa al estado de la infraestructura, y el cual va a consultar para conocer si es necesario realizar cambios sobre la misma. Es muy importante tener en cuenta que este estado es lo que conoce Terraform. Terraform no va a conectarse a AWS para ver qué hay desplegado. Por ello es necesario no realizar cambios a mano sobre infraestructura desplegada por Terraform (ni nunca. Cambios a mano, nunca), ya que no se actualiza el tfstate y por tanto se crearán inconsistencias.

Vagrant permite desplegar entornos de test de manera local de forma rápida y muy simple, basados en código. Por defecto utiliza por debajo VirtualBox, pero es compatible con otros providers como por ejemplo VMWare o Hyper-V. Es posible desplegar máquinas sobre proveedores Cloud como AWS instalando plugins. Personalmente no encuentro la ventaja de usar Vagrant para esta función.

Las máquinas que despliega se basan en boxes, y desde el código se indica qué box se quiere desplegar. En este aspecto podría compararse en modo de funcionamiento con Docker. Sin embargo, la base es totalmente distinta. Vagrant levanta máquinas virtuales con una herramienta de virtualización por debajo (VirtualBox) mientras que Docker despliega contenedores y su soporte es una tecnología de contenerización Vagrant VS Docker.

Casos de uso típicos pueden ser probar playbooks de Ansible, recrear un entorno de laboratorio de manera local en muy pocos minutos, realizar alguna demo, etc.

Vagrant se basa en un archivo Vagrantfile. Una vez situado en el directorio donde se encuentra este Vagrantfile, con ejecutar vagrant up, la herramienta comienza a desplegar lo que se indique dentro de dicho archivo.

Los pasos para lanzar una máquina virtual con Vagrant son:

1- Instalar Vagrant. Se distribuye como un binario al igual que el resto de productos de Hashicorp.

2- Con un Vagrantfile de ejemplo:

Vagrant.configure("2") do |config|
  config.vm.box = "gbailey/amzn2"
end

PD: (el parámetro 2 dentro de configure se refiere a la versión de Vagrant, que me ha tocado mirarlo para el post)

3- Ejecutar dentro del directorio donde se encuentra el Vagrantfile

vagrant up

4- Entrar por ssh a la máquina

vagrant ssh

5- Destruir la máquina

vagrant destroy

Vagrant se encarga de gestionar el acceso ssh creando una clave privada y almacenándola en el directorio .vagrant, aparte de otros metadatos. Se puede acceder a la máquina ejecutando vagrant ssh. También es posible visualizar las máquinas desplegadas en la aplicación de VirtualBox.

Con Packer puedes construir imágenes de máquinas de forma automatizada. Puede utilizarse para, por ejemplo, construir una imagen en un proveedor Cloud como AWS con una configuración inicial ya realizada y poder desplegarla un número indeterminado de veces. De esta forma solo es necesario aprovisionar una sola vez y al desplegar la instancia con esa imagen ya tendrá la configuración deseada sin tener que invertir mayor tiempo en aprovisionarla.

Un pequeño ejemplo sería:

1- Instalar Packer. De la misma forma, es un binario que habrá que colocar en una carpeta que se encuentre en nuestro path.

2- Crear un archivo, por ejemplo builder.json. También se crea un pequeño script en bash (el link que se muestra en builder.json es dummy):

{
  "variables": {
    "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
    "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}",
    "region": "eu-west-1"
  },
  "builders": [
    {
      "access_key": "{{user `aws_access_key`}}",
      "ami_name": "my-custom-ami-{{timestamp}}",
      "instance_type": "t2.micro",
      "region": "{{user `region`}}",
      "secret_key": "{{user `aws_secret_key`}}",
      "source_ami_filter": {
        "filters": {
            "virtualization-type": "hvm",
            "name": "amzn2-ami-hvm-2*",
            "root-device-type": "ebs"
        },
        "owners": ["amazon"],
        "most_recent": true
      },
      "ssh_username": "ec2-user",
      "type": "amazon-ebs"
    }
  ],
  "provisioners": [
    {
        "type": "shell",
        "inline": [
            "curl -sL https://raw.githubusercontent.com/example/master/userdata/ec2-userdata.sh | sudo bash -xe"
        ]
    }
  ]
}

Los proveedores utilizan software integrado y de terceros para instalar y configurar la imagen de la máquina después del arranque. Nuestro ejemplo ec2-userdata.sh

yum install -y \
  python3 \
  python3-pip \

pip3 install cowsay

3- Ejecutar:

packer build builder.json

Y ya tendríamos una AMI aprovisionada con cowsay instalado. Ahora nunca más será necesario que, como primer paso tras lanzar una instancia, sea necesario instalar cowsay porque ya lo tendremos de base, como debe ser.

Como es de esperar, Packer no solo funciona con AWS sino con cualquier proveedor Cloud como Azure o GCP. También funciona con VirtualBox y VMWare y una gran lista de builders. Además, se puede crear una imagen anidando builders en el archivo de configuración de Packer que sea igual para distintos proveedores _Cloud. Esto es muy interesante para entornos multi-cloud en los que haya que replicar diferentes configuraciones.

Nomad es un orquestador de cargas de trabajo. A través de sus Task Drivers ejecutan una tarea en un entorno aislado. El caso de uso más común es orquestar contenedores Docker. Tiene dos actores básicos: Nomad Server y Nomad Client. Ambos actores se ejecutan con el mismo binario pero diferente configuración. Los Servers organizan el cluster mientras que los Agents ejecutan las tareas.

Un pequeño “Hello-world” en nomad podría ser:

1- Descargar e instalar Nomad y ejecutarlo (en modo development, este modo nunca debe usarse en producción):

nomad agent -dev

Una vez hecho esto ya es posible acceder a la UI de nomad en localhost:4646

2- Ejecutar un job de ejemplo que lanzará una imagen de grafana

nomad job run grafana.nomad

grafana.nomad:

job "grafana" {
    datacenters = [
        "dc1"
    ]

    group "grafana" {
        task "grafana"{
            driver = "docker"

            config {
                image = "grafana/grafana"
            }

            resources {
                cpu = 500
                memory = 256

            network {
                mbits = 10

                port "http" {
                    static = "3000"
                }
            }
            }
        }
    }
}

3- Acceder a localhost:3000 para comprobar que se puede acceder a grafana y acceder a la UI de Nomad (localhost:4646) para ver el job

4- Destruir el job

nomad stop grafana

5- Parar el agente de Nomad. Si se ha ejecutado tal y como se ha indicado aquí, con pulsar Control-C en la terminal donde se está ejecutando bastará

En resumen, Nomad es un orquestador muy ligero, al contrario que Kubernetes. Lógicamente no tiene las mismas funcionalidades que Kubernetes pero ofrece la posibilidad de ejecutar cargas de trabajo en alta disponibilidad de forma sencilla y sin necesidad de usar gran cantidad de recursos. Se tratarán ejemplos con más detalle en el post de Nomad que se publicará en un futuro.

Vault se encarga de gestionar secretos. A través de su api los usuarios o aplicaciones pueden pedir secretos o identidades. Estos usuarios se autentican en Vault, el cual tiene conexión a un proveedor de identidad de confianza como un Active Directory, por ejemplo. Vault funciona con dos tipos de actores al igual que otras de las herramientas mencionadas, Server y Agent.

Al inicializar Vault, comienza en estado sellado (Seal) y se generan diferentes tokens para deshacer el sellado (Unseal). Un caso de uso real sería demasiado largo para este artículo y se verá en detalle en el artículo destinado únicamente a Vault. Sin embargo, se puede descargar y probar en modo dev al igual que se ha mecionado antes con Nomad. En este modo, Vault se inicializa en modo Unsealed , almacenando todo en memoria sin necesidad de autenticación, sin utilizar TLS y con una única clave de sellado. De esta forma podremos jugar con sus funcionalidades y explorarlo de una manera rápida y sencilla.

Consultar en este enlace para mayor información.

1- Instalar Vault. Al igual que el resto de productos de Hashicorp, se distribuye como un binario a través de su web. Una vez hecho, lanzar (este modo nunca debe usarse en producción):

vault server -dev

2- Vault mostrará el token de root que sirve para autenticarse contra Vault. También mostrará la única clave de sellado que crea. Ahora Vault es accesible a través de localhost:8200

3- En otra terminal, exportar las siguientes variables para poder hacer pruebas:

export VAULT_ADDR='http://127.0.0.1:8200'
export VAULT_DEV_ROOT_TOKEN_ID="token-showed-when-running-vault-dev"

4- Crear un secreto (En la terminal en la que se han exportado las variables anteriores)

vault kv put secret/test supersecreto=iewdubwef293bd8dn0as90jdasd

5- Recuperar ese secreto

vault kv get secret/test

Estos pasos son el Hello World de Vault, para comprobar su funcionamiento. En el artículo de Vault veremos de forma detallada el funcionamiento y la instalación de Vault, así como más características como las policies, explorar la UI, etc.

Consul sirve para crear una red de servicios. Funciona de forma distribuida y un cliente de consul se ejecutará en una localización en la que existan servicios que se quieran registrar. Consul incluye Health Checking de servicios, una base de datos de clave valor y funcionalidades para securizar el tráfico de red. Al igual que Nomad o Vault, Consul tiene dos actores principales, Server y Agent.

De la misma manera que hemos hecho con Nomad y Vault, vamos a ejecutar Consul en modo dev para mostrar un pequeño ejemplo definiendo un servicio:

1- Instalar Consul. Crear una carpeta de trabajo y dentro de la misma crear un archivo json con un servicio a definir (ejemplo de nombre de carpeta: consul.d).

Archivo web.json (ejemplo de la web de Hashicorp):

{
  "service": {
    "name": "web",
    "tags": [
      "rails"
    ],
    "port": 80
  }
}

2- Ejecutar el agente de Consul indicándole el directorio de configuración creado antes donde se encuentra web.json (este modo nunca debe usarse en producción):

consul agent -dev -enable-script-checks -config-dir=./consul.d

3- A partir de ahora, aunque no hay nada realmente ejecutándose en el puerto indicado (80), se ha creado un servicio con un nombre DNS al que se puede preguntar para poder acceder al mismo a través de dicho nombre DNS. Su nombre será NAME.service.consul. En nuestro caso NAME es web. Ejecutar para preguntar:

dig @127.0.0.1 -p 8600 web.service.consul

Nota: Consul por defecto se ejecuta en el puerto 8600.

De la misma manera se puede preguntar por un record de tipo SRV:

dig @127.0.0.1 -p 8600 web.service.consul SRV

También se puede preguntar a través de los tags del servicio (TAG.NAME.service.consul):

dig @127.0.0.1 -p 8600 rails.web.service.consul

Y se puede consultar a través de la API:

curl http://localhost:8500/v1/catalog/service/web

Nota final: Consul se utiliza como complemento a otros productos, ya que sirve para crear servicios sobre otras herramientas ya desplegadas. Es por ello que este ejemplo no es demasiado ilustrativo. Aparte del artículo específico de Consul, se encontrará como ejemplo en otros artículos como en el de Nomad.

En futuros artículos se explicará también Consul Service Mesh para interconexión de servicios a través de sidecars proxies, la federación de Datacenters de Nomad en Consul para realizar despliegues o las intentions, que sirven para definir reglas de con qué servicios puede comunicarse otro servicio.

Conclusiones

Todos estos productos tienen un gran potencial en sus respectivos ámbitos y ofrecen una gestión transversal que puede ayudar a evitar el famoso vendor lock-in de los proveedores de Cloud. Pero lo más importante es su interoperabilidad y compatibilidad.

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

Jorge de Diego

Cloud DevOps Engineer

Mi nombre es Jorge de Diego y estoy especializado en entornos Cloud. Trabajo habitualmente con AWS aunque tengo conocimientos sobre GCP y Azure. Entré en Bluetab en septiembre de 2019 y desde entonces trabajo de Cloud DevOps y tareas de seguridad. Me interesa todo lo relacionado con la tecnología y en especial, los ámbitos de modelos de seguridad e Infraestructura. Podréis identificarme en la oficina por los pantalones cortos.

SOLUCIONES, SOMOS EXPERTOS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Te puede interesar

MICROSOFT FABRIC: Una nueva solución de análisis de datos, todo en uno

octubre 16, 2023

Data Mesh

julio 27, 2022

De documentos en papel a datos digitales con Fastcapture y Generative AI

junio 7, 2023

$ docker run 2021

febrero 2, 2021

Los Incentivos y el Desarrollo de Negocio en las Telecomunicaciones

octubre 9, 2020

¿Existe el Azar?

noviembre 10, 2021

Introduction to HashiCorp products

agosto 25, 2020 by Bluetab

Introduction to HashiCorp products

At Cloud Practice we look to encourage the use of HashiCorp products and we are going to publish thematic articles on each of them to do this.

Due to the numerous possibilities their products offer, this article will provide an overview, and we will go into detail on each of them in later publications, showing unconventional use cases that show the potential of HashiCorp products.

Why HashiCorp?

HashiCorp has been developing a variety of Open Source products over recent years that offer cross-cutting infrastructure management in cloud and on-premises environments. These products have set standards in infrastructure automation.

Its products now provide robust solutions in provisioning, security, interconnection and workload coordination fields.

The source code for its products is released under MIT licence, which has been well received within the Open Source community (over 100 developers are continually providing improvements)

In addition to the products we present in this article, they also have attractive Enterprise solutions.

As regards the impact of its solutions, Terraform has become a standard on the market. This means HashiCorp is doing things right, so understanding and learning about its solutions is appropriate.

Although we are focusing on their use in the Cloud environment, their solutions are widely used in on-premises environments, but they reach their full potential when working in the cloud.

What are their products?

HashiCorp has the following products:

Terraform: infrastructure as code.
Vagrant: virtual machines for test environments.
Packer: automated building of images.
Nomad: workload “orchestrator”.
Vault: management of secrets and data protection.
Consul: management and discovery of services in distributed environments.

We summarise the main product features below. We will go into detail on each of them in later publications.

Terraform has positioned itself as the most widespread product in the field of provisioning of Infrastructure as Code.

It uses a specific language (HCL) to deploy infrastructure through the use of various Cloud providers. Terraform also lets you manage resources from other technologies or services, such as GitLab or Kubernetes.

Terraform makes use of a declarative language and is based on having deployed exactly what is present in the code.

The typical example shown to see the difference with the paradigm followed, for example, by Ansible (procedural) is:

Initially we want to deploy 5 EC2 instances on AWS:

Ansible:

- ec2:
    count: 5
    image: ami-id
    instance_type: t2.micro

Terraform:

resource "aws_instance" "ejemplo" {
  count         = 5
  ami           = "ami-id"
  instance_type = "t2.micro"
}

As you can see, there are virtually no differences between the coding. Now we need to deploy two more instances to cope with a heavier workload:

Ansible:

- ec2:
    count: 2
    image: ami-id
    instance_type: t2.micro

Terraform:

resource "aws_instance" "ejemplo" {
  count         = 7
  ami           = "ami-id"
  instance_type = "t2.micro"
}

At this point, we can see that while in Ansible we would set the number of instances to be deployed at 2 so that a total of 7 are deployed, in Terraform we would set the number directly to 7 and Terraform knows that it needs to deploy 2 more because there are already 5 deployed.

Another important aspect is that Terraform does not need a Master server such as Chef or Puppet. It is a distributed tool whose common, centralised element is the tfstate file (explained in the next paragraph). It is launched from any site with access to the tfstate (which can be remote, stored in common storage such as AWS S3) and with Terraform installed, which is distributed as a binary downloadable from the HashiCorp website.

The last point to note about Terraform is that it is based on a file called tfstate, in which the information relating to infrastructure status is progressively stored and updated, which it consults to see whether changes need to be made to the infrastructure. It is very important to note that this state is what Terraform knows. Terraform will not connect to AWS to see what is deployed. This means it is necessary to avoid making changes manually to the infrastructure deployed by Terraform (and never Manual changes, ever), as the tfstate file is not updated and inconsistencies will therefore be created.

Vagrant enables you to deploy test environments locally, quickly and very simply, based on code. By default it is used with VirtualBox, but it is compatible with other providers such as VMware or Hyper-V. Machines can be deployed on Cloud providers such as AWS by installing plug-ins. I personally cannot find advantages in using Vagrant for this function.

The machines it deploys are based on boxes, and you indicate which box you want to deploy from the code. Its mode of operating can be compared with Docker in this aspect. However, the basis is completely different. Vagrant creates virtual machines with a virtualisation tool below it (VirtualBox), while Docker deploys containers and its support is a containerisation technology, Vagrant versus Docker.

Typical use cases can be to test Ansible playbooks, recreate a laboratory environment locally in just a few minutes, making a demo, etc.

Vagrant is based on a Vagrantfile. Once located in the directory where this Vagrantfile is located, by running vagrant up, the tool begins to deploy what is indicated within that file.

The steps to launch a virtual machine with Vagrant are:

1. Install Vagrant. It is distributed as a binary file, like all other HashiCorp products.

2. With an example Vagrantfile:

Vagrant.configure("2") do |config|
  config.vm.box = "gbailey/amzn2"
end

PS: (the parameter 2 within configure refers to the version of Vagrant, which I had to check for the post)

3. Run within the directory where the Vagrantfile is located

vagrant up

4. Enter the machine using ssh

vagrant ssh

5. Destroy the machine

vagrant destroy

Vagrant manages ssh access by creating a private key and storing it in the .vagrant directory, apart from other metadata. The machine can be accessed by running vagrant ssh. The machines deployed can also be displayed in the VirtualBox application.

With Packer you can build automated machine images. It can be used, for example, to build an image on a Cloud provider, such as AWS, already with an initial configuration, and to be able to deploy it an indefinite number of times. This means that you only need to provision it once, and when you deploy the instance with that image you already have the desired configuration without having to spend more time provisioning it.

A simple example would be:

1. Install Packer. Similarly, it is a binary that will need to be placed in a folder located on our path.

2. Create a file, e.g. builder.json. A small script is also created in bash (the link shown in builder.json is dummy):

{
  "variables": {
    "aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
    "aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}",
    "region": "eu-west-1"
  },
  "builders": [
    {
      "access_key": "{{user `aws_access_key`}}",
      "ami_name": "my-custom-ami-{{timestamp}}",
      "instance_type": "t2.micro",
      "region": "{{user `region`}}",
      "secret_key": "{{user `aws_secret_key`}}",
      "source_ami_filter": {
        "filters": {
            "virtualization-type": "hvm",
            "name": "amzn2-ami-hvm-2*",
            "root-device-type": "ebs"
        },
        "owners": ["amazon"],
        "most_recent": true
      },
      "ssh_username": "ec2-user",
      "type": "amazon-ebs"
    }
  ],
  "provisioners": [
    {
        "type": "shell",
        "inline": [
            "curl -sL https://raw.githubusercontent.com/example/master/userdata/ec2-userdata.sh | sudo bash -xe"
        ]
    }
  ]
}

Provisioners use builtin and third-party software to install and configure the machine image after booting. Our example ec2-userdata.sh

yum install -y \
  python3 \
  python3-pip \

pip3 install cowsay

3. Run:

packer build builder.json

And we would already have an AMI provisioned with cowsay installed. Now it will never be necessary for you, as a first step after launching an instance, to install cowsay because we will already have it as a base, as it should be.

As you would expect, Packer not only works with AWS, but also with any Cloud provider such as Azure or GCP. It also works with VirtualBox and VMware and a long list of builders. You can also create an image by nesting builders in the Packer configuration file that is the same for various Cloud providers. This is very interesting for multi-cloud environments, where different configurations need to be replicated.

Nomad is a workload orchestrator. Through its Task Drivers they execute a task in an isolated environment. The most common use case is to orchestrate Docker containers. It has two basic actors: Nomad Server and Nomad Client. The two actors are run with the same binary, but different configurations. Servers organise the cluster, while Agents run the tasks.

A little “Hello world” in Nomad could be:

1. Download and install Nomad and run it (in development mode; this mode must never be used in production):

nomad agent -dev

Once this has been done, you can access the Nomad UI at localhost:4646

2. Run an example job that will launch a Grafana image

nomad job run grafana.nomad

grafana.nomad:

job "grafana" {
    datacenters = [
        "dc1"
    ]

    group "grafana" {
        task "grafana"{
            driver = "docker"

            config {
                image = "grafana/grafana"
            }

            resources {
                cpu = 500
                memory = 256

            network {
                mbits = 10

                port "http" {
                    static = "3000"
                }
            }
            }
        }
    }
}

3. Access localhost:3000 to check that you can access grafana and access the Nomad UI (localhost:4646) to view the job

4. Destroy the job

nomad stop grafana

5. Stop the Nomad agent. If it has been run as indicated here, pressing Control-C on the terminal where it is running will suffice

In short, Nomad is a very light orchestrator, unlike Kubernetes. Logically, it does not have the same features as Kubernetes, but it offers the ability to run workloads at high availability easily and without the need for a large amount of resources. We will deal with examples in greater detail in the post on Nomad to be published in the future.

Vault manages secrets. Users or applications can request secrets or identities through its API. These users are authenticated in Vault, which has a connection to a trusted identity provider such as an Active Directory, for example. Vault works with two types of actors, like the other tools mentioned, Server and Agent.

When Vault starts up, it starts in a sealed state (Seal) and various tokens are generated for unsealing (Unseal). An actual use case would be too long for this article and we will look at this in detail in the article on Vault alone. However, you can download and test it in dev mode as noted previously for Nomad. In this mode, Vault starts up in Unsealed mode, storing everything in memory without need for authentication, without using TLS, and with a single sealing key. You can play with its features and explore it quickly and easily in this way.

See this link for further information.

1. Install Vault. Like the rest of HashiCorp’s products, it is distributed as a binary through its website. Once done, launch (this mode must never be used in production):

vault server -dev

2. Vault will display the root token used to authenticate against Vault. It will also display the single sealing key it creates. Vault is now accessible via localhost:8200

3. In another terminal, export the following variables to be able to conduct tests:

export VAULT_ADDR='http://127.0.0.1:8200'
export VAULT_DEV_ROOT_TOKEN_ID="token-showed-when-running-vault-dev"

4. Create a secret (in the terminal where the above variables were exported)

vault kv put secret/test supersecreto=iewdubwef293bd8dn0as90jdasd

5. Recover that secret

vault kv get secret/test

These steps are the Hello world of Vault, to check its operation. We will look in detail at the operation and installation of Vault in the article on it, as well as at further features such as the policies, exploring the UI, etc.

Consul is used to create a services network. It operates in a distributed manner and a Consul client will run in a location where there are services that you want to register. Consul includes Health Checking of services, a key value database and features to secure network traffic. Like Nomad and Vault, Consul has two primary actors, Server and Agent.

Just as we did with Nomad and Vault, we are going to run Consul in dev mode to show a small example defining a service:

1. Install Consul. Create a working folder and within it create a json file with a service to be defined (example folder name: consul.d).

web.json file (HashiCorp website example):

{
  "service": {
    "name": "web",
    "tags": [
      "rails"
    ],
    "port": 80
  }
}

2. Run the Consul agent by pointing it to the configuration directory created before where web.json is located (this mode must never be used in production):

consul agent -dev -enable-script-checks -config-dir=./consul.d

3. From now, although there is nothing actually running on the indicated port (80), a service has been created with a DNS name that can be requested to access it through that DNS name. Its name will be NAME.service.consul. In our case NAME is web. Run to query:

dig @127.0.0.1 -p 8600 web.service.consul

Note: Consul runs by default on port 8600.

Similarly, you can dig for a record of SRV type:

dig @127.0.0.1 -p 8600 web.service.consul SRV

You can also query through the service tags (TAG.NAME.service.consul):

dig @127.0.0.1 -p 8600 rails.web.service.consul

And you can query through the API:

curl http://localhost:8500/v1/catalog/service/web

Final note: Consul is used as a complement to other products, as it is used to create services on other tools already deployed. That is why this example is not too illustrative. Apart from the specific article on Consul, it will be found as an example in other articles, such as the one on Nomad.

Future articles will also explain Consul Service Mesh for interconnection of services via Sidecar proxies, federation of Nomad Datacentres in Consul to perform deployments or intentions, which are used to define rules under which services can communicate with another service.

Conclusions

All these products have great potential in their respective areas and offer cross-cutting management that can help to avoid the famous vendor lock-in by Cloud providers. But the most important thing is their interoperability and compatibility.

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

Jorge de Diego

Cloud DevOps Engineer

My name is Jorge de Diego and I specialise in Cloud environments. I usually work with AWS, although I have knowledge of GCP and Azure. I joined Bluetab in September 2019 and have been working on Cloud DevOps and security tasks since then. I am interested in everything related to technology, especially security models and Infrastructure fields. You can spot me in the office by my shorts.

SOLUTIONS, WE ARE EXPERTS

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Empoderando a las decisiones en diversos sectores con árboles de decisión en AWS

junio 4, 2024

Serverless Microservices

octubre 14, 2021

DataOps

octubre 24, 2023

Data-Drive Agriculture; Big Data, Cloud & AI aplicados

noviembre 4, 2020

Análisis de vulnerabilidades en contenedores con trivy

marzo 22, 2024

Características esenciales que debemos tener en cuenta al adoptar un paradigma en la nube

septiembre 12, 2022

Basic AWS Glue concepts

julio 22, 2020 by Bluetab

Basic AWS Glue concepts

At Cloud Practice we aim to encourage adoption of the cloud as a way of working in the IT world. To help with this task, we are going to publish numerous good practice articles and use cases and others will talk about those key services within the cloud.

We present the basic concepts AWS Glue below.

What is AWS Glue?

AWS Glue is one of those AWS services that are relatively new but have enormous potential. In particular, this service could be very useful to all those companies that work with data and do not yet have powerful Big Data infrastructure.

Basically, Glue is a fully AWS-managed pay-as-you-go ETL service without the need for provisioning instances. To achieve this, it combines the speed and power of Apache Spark with the data organisation offered by Hive Metastore.

AWS Glue Data Catalogue

The Glue Data Catalogue is where all the data sources and destinations for Glue jobs are stored.

Table is the definition of a metadata table on the data sources and not the data itself. AWS Glue tables can refer to data based on files stored in S3 (such as Parquet, CSV, etc.), RDBMS tables…
Database refers to a grouping of data sources to which the tables belong.
Connection is a link configured between AWS Glue and an RDS, Redshift or other JDBC-compliant database cluster. These allow Glue to access their data.
Crawler is the service that connects to a data store, it progresses through a prioritised list of classifiers to determine the schema for the data and to generate the metadata tables. They support determining the schema of complex unstructured or semi-structured data. This is especially important when working with Parquet, AVRO, etc. data sources.

ETL

An ETL in AWS Glue consists primarily of scripts and other tools that use the data configured in the Data Catalogue to extract, transform and load the data into a defined site.

Job is the main ETL engine. A job consists of a script that loads data from the sources defined in the catalogue and performs transformations on them. Glue can generate a script automatically or you can create a customised one using the Apache Spark API in Python (PySpark) or Scala. It also allows the use of external libraries which will be linked to the job by means of a zip file in S3.
Triggers are responsible for running the Jobs. They can be run according to a timetable, a CloudWatch event or even a cron command.
Workflows is a set of triggers, crawlers and jobs related to each other in AWS Glue. You can use them to create a workflow to perform a complex multi-step ETL, but that AWS Glue can run as a single entity.
ML Transforms are specific jobs that use Machine Learning models to create custom transforms for data cleaning, such as identifying duplicate data, for example.
Finally, you can also use Dev Endpoints and Notebooks, which make it faster and easier to develop and test scripts.

Examples

Sample ETL script in Python:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()

glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

## Read Data from a RDS DB using JDBC driver
connection_option = {
"url": "jdbc:mysql://mysql–instance1.123456789012.us-east-1.rds.amazonaws.com:3306/database",
"user": "test",
"password": "password",
"dbtable": "test_table",
"hashexpression": "column_name",
"hashpartitions": "10"
}

source_df = glueContext.create_dynamic_frame.from_options('mysql', connection_options = connection_option, transformation_ctx = "source_df")

job.init(args['JOB_NAME'], args)

## Convert DataFrames to *AWS Glue* 's DynamicFrames Object
dynamic_df = DynamicFrame.fromDF(source_df, glueContext, "dynamic_df")

## Write Dynamic Frame to S3 in CSV format
datasink = glueContext.write_dynamic_frame.from_options(frame = dynamic_df, connection_type = "s3", connection_options = {
"path": "s3://glueuserdata"
}, format = "csv", transformation_ctx = "datasink")

job.commit()

Creating a Job using a command line:

aws glue create-job --name python-job-cli --role Glue_DefaultRole \

--command '{"Name" : "my_python_etl", "ScriptLocation" : "s3://SOME_BUCKET/etl/my_python_etl.py"}'

Running a Job using a command line:

aws glue start-job-run --job-name my_python_etl

AWS has also published a repository with numerous example ETLs for AWS Glue.

Security

Like all AWS services, it is designed and implemented to provide the greatest possible security. Here are some of the security features that AWS GLUE offers:

Encryption at Rest: this service supports data encryption (SSE-S3 or SSE-KMS) at rest for all objects it works with (metadata catalogue, connection password, writing or reading of ETL data, etc.).
Encryption in Transit: AWS offers Secure Sockets Layer (SSL) encryption for all data in motion, AWS Glue API calls and all AWS services, such as S3, RDS…
Logging and monitoring: is tightly integrated with AWS CloudTrail and AWS CloudWatch.
Network security: is capable of enabling connections within a private VPC and working with Security Groups.

Price

AWS bills for the execution time of the ETL crawlers / jobs and for the use of the Data Catalogue.

Crawlers: only crawler run time is billed. The price is $0.44 (eu-west-1) per hour of DPU (4 vCPUs and 16 GB RAM), billed in hourly increments.
Data Catalogue: you can store up to one million objects at no cost and at $1.00 (eu-west-1) per 100,000 objects thereafter. In addition, $1 (eu-west-1) is billed for every 1,000,000 requests to the Data Catalogue, of which first million is free.
ETL Jobs: billed only for the time the ETL job takes to run. The price is $0.44 (eu-west-1) per hour of DPU (4 vCPUs and 16 GB RAM), billed by the second.

Benefits

Although it is a young service, it is quite mature and is being used a lot by clients all over the AWS world. The most important features it offers us are:

It automatically manages resource escalation, task retries and error handling.
It is a Serverless service, AWS manages the provisioning and scaling of resources to execute the commands or queries in the Apache Spark environment.
The crawlers are able to track your data, suggest schemas and store them in a centralised catalogue. They also detect changes in them.
The Glue ETL engine automatically generates Python / Scala code and has a programmer including dependencies. This facilitates development of the ETLs.
You can directly query the S3 data using Athena and Redshift Spectrum using the Glue catalogue.

Conclusions

Like any database, tool, or service offered, AWS Glue has certain limitations that would need to be considered to adopt it as an ETL service. You therefore need to bear in mind that:

It is highly focused on working with data sources in S3 (CSV, Parquet, etc.) and JDBC (MySQL, Oracle, etc.).
The learning curve is steep. If your team comes from the traditional ETL world, you will need to wait for them to pick up understanding of Apache Spark.
Unlike other ETL tools, it lacks default compatibility with many third-party services.
It is not a 100% ETL tool in use and, as it uses Spark, code optimisations need to be performed manually.
Until recently (April 2020), AWS Glue did not support streaming data. It is too early to use AWS Glue as an ETL tool for real-time data.

Do you want to know more about what we offer and to see other success stories?

Álvaro Santos

Senior Cloud Solution Architect

My name is Álvaro Santos and I have been working as Solution Architect for over 5 years. I am certified in AWS, GCP, Apache Spark and a few others. I joined Bluetab in October 2018, and since then I have been involved in cloud Banking and Energy projects and I am also involved as a Cloud Master Partitioner. I am passionate about new distributed patterns, Big Data, open-source software and anything else cool in the IT world.

SOLUTIONS, WE ARE EXPERTS

Bluetab

Data Governance in the Media sector

Almacén Cloud en Banca

Cloud Warehouse in Banking

Introducción a los productos de HashiCorp

¿Por qué Hashicorp?

¿Cuáles son estos productos?

Ansible:

Terraform:

Ansible:

Terraform:

Conclusiones

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Introduction to HashiCorp products

Why HashiCorp?

What are their products?

Ansible:

Terraform:

Ansible:

Terraform:

Conclusions

¿Quieres saber más de lo que ofrecemos y ver otros casos de éxito?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Basic AWS Glue concepts

What is AWS Glue?

AWS Glue Data Catalogue

ETL

Examples

Security

Price

Benefits

Conclusions

Do you want to know more about what we offer and to see other success stories?

DATA STRATEGY

DATA FABRIC

AUGMENTED ANALYTICS

Footer