Abdullah Şamil Güser

S3 Data Lake Architecture - Data Consolidation

Introduction to Data Lake (Lesson 188)

Concept of Data Lakes and Data Warehouses

Differences Between Data Lakes and Data Warehouses

Building a Data Lake on AWS

  1. Storage Options:
    • S3: Primary storage solution for data lakes.
    • Amazon Glacier: For data backup and long-term archival.
  2. Data Ingestion/Consolidation:
    • Kinesis Firehose: For real-time streaming data capture and loading.
    • Storage Gateway: Integrates on-premises data with S3.
    • Snowball & Snowmobile: Physical appliances for large-scale data transfer to AWS.
    • SDK, CLI and more: AWS SDKs, Command Line tools, third-party tools for S3 data storage.
  3. Metadata and Data Catalog: Essential for data discoverability and preventing data swamps.
    • Do-it-yourself: Manual logic for metadata collection.
    • AWS Glue: Automated building and maintaining of data catalog.

Data Lake vs. Data Swamp

Kinesis - Streaming and Batch Processing (Lesson 189)

Overview of Amazon Kinesis

Understanding Streaming Data

Comparison: Streaming vs. Batch Processing

  1. Batch Processing:
    • Data collected and stored in databases or data lakes.
    • Analyzed periodically (hourly, daily, weekly).
    • Suitable for non-time-sensitive analytics.
    • Tools: Spark on EMR, machine learning, etc.
  2. Stream Processing:
    • Data analyzed as it arrives.
    • Responds within seconds or minutes.
    • Essential for real-time insights and actions.
    • Use Cases: GPS navigation, billing alerts, medical emergency alerts.

Kinesis Product Family

  1. Kinesis Video Streams:
    • Securely stream video from various devices.
    • Applications: Video playback, security monitoring, analytics.
  2. Kinesis Data Streams:
    • Offers control over streaming data for custom real-time applications.
    • Integrates with Kinesis Data Analytics, Spark on EMR, EC2, Lambda.
  3. Kinesis Firehose:
    • Simplest solution for capturing and processing data streams.
    • Automatically loads data into AWS services (S3, Redshift, Elasticsearch, Splunk).
    • Enables analysis using existing BI tools.
  4. Kinesis Data Analytics:
    • Query streaming data using SQL.
    • Process and route data to AWS data stores.

Data Formats and Tools for Data Format Conversion (Lesson 190)

Data Formats and Their Strengths

Tools for Data Transformation

In-Place Analytics and Portfolio of Tools (Lesson 191)

In-Place Querying Tools

  1. Amazon Athena:
    • Interactive query service for S3 data using SQL.
    • Serverless, pay-per-query model.
    • Supports CSV, JSON, Parquet, ORC, Avro.
    • Use Case: Ideal for ad-hoc data discovery and SQL querying.
  2. Redshift Spectrum:
    • Queries data on S3 directly.
    • Advanced query optimization, distributed across nodes.
    • Integrates with Redshift data warehouse.
    • Use Case: Suited for complex queries and large user bases.

Querying Streaming Data

  1. Kinesis Data Analytics:
    • SQL querying for streaming data.
    • Continuously running queries for real-time monitoring.
    • Use Case: Real-time analytics on streaming data.

Broader Analytics Portfolio

  1. Amazon EMR:
    • Runs Hadoop workloads (Spark, Hive, HBase).
    • Consumes data from S3.
  2. Amazon SageMaker:
    • Machine learning with supervised, unsupervised, and reinforcement learning.
    • Trains on data in S3, provides real-time and batch predictions.
  3. Amazon AI Services:
    • Pre-built services for video and image analysis, NLP.
    • Analyzes data in S3.
  4. Amazon QuickSight:
    • BI tool for interactive dashboards.
    • Connects to Redshift, Athena, databases, S3.
  5. Amazon Redshift:
    • Petabyte-scale data warehouse.
    • Loads data from S3, extends with Redshift Spectrum.
  6. AWS Lambda:
    • Executes business logic for data lake.
    • Integrates with S3 data lake.

Monitoring and Optimization (Lesson 192)

Monitoring Tools

  1. AWS CloudWatch:
    • Monitors health of data lake components.
    • Tracks metrics, sets alarms, and automates responses.
    • Supports both AWS-generated and custom application metrics.
    • Includes CloudWatch Logs for log file consolidation and event monitoring.
  2. AWS CloudTrail:
    • Provides an audit trail of all AWS API activities.
    • Captures actions across web console, CLI, SDKs.
    • CloudTrail logs are queryable using Athena and SQL.

Data Storage Optimization

Security and Protection (Lesson 193)

Access Control

Data Encryption

Data Durability and Versioning

Tag-Based Security

Lab - Glue Data Catalog (Lesson 194 & 195)

1. Setting Up Permissions for Glue

  1. Open IAM Console
  2. Create Role, select “Glue”
  3. On the Permissions page, search for “Glue”. Select AWSGlueServiceRole policy.
  4. Name the role AWSGlueServiceRoleDefault.
  5. Create the role.

2. S3 Bucket Setup

  1. Create S3 Bucket. Use naming convention: aws-glue-yourname.
  2. In the bucket, create a folder named iris.
  3. Inside iris, create a subfolder csv.
  4. In the course distribution, find iris_all.csv.
  5. Place iris_all.csv in the iris/csv folder of your S3 bucket.

3. Configure Glue Crawler

  1. Open Glue Console. Ensure the region matches your S3 bucket.
  2. Create Crawler. Name the crawler iris_csv_crawler.
  3. Choose “Data Stores” as the source and set “Crawl all folders”.
  4. Select “S3” as the data store.
  5. Choose “Specified path in my account”. Include path: s3://aws-glue-yourname/iris/csv/.
  6. Set IAM Role. Select the AWSGlueServiceRoleDefault role.
  7. Add a new database named demo_db.
  8. Prefix tables with iris_.
  9. Crawler Frequency: Choose “Run on demand”.
  10. Run the crawler.
  11. Wait for the crawler to complete (check for “Tables added” count).
  12. Go to “Tables” in Glue. Select the iris_csv table. Review schema details and data types.

Lab – Query with Athena (Lesson 196 & 197)

Athena Configuration

  1. Open Athena Console. From the AWS console, access the Athena service.
  2. In Athena, expand the left navigation and select “Query editor”.

Setup Query Results

  1. Before running queries, set up an S3 location for query results.
  2. Click on “View Settings”, then “Manage”.
  3. Follow the recommended naming convention: aws-athena-query-results-MyAcctID-MyRegion.
  4. Example S3 location: s3://aws-athena-query-results-1234567890-us-east-1/.
  5. Save the changes.

Running Queries

  1. Select Data Source & Database:
    • In the Query Editor, ensure the Editor tab is selected.
    • For Data source, choose AWSDataCatalog.
    • Select demo_db as the database.
  2. Preview Table:
    • Find the iris_csv table under Tables.
    • Click the three dots next to iris_csv, select “Preview table”.
    • This generates a sample SQL query: SELECT * FROM "demo_db"."iris_csv" limit 10;.

Example Queries

198. Glue ETL - Pandas DataFrame vs Spark DataFrame vs Glue DynamicFrame

Introduction

Underlying Concepts

Dynamic Frame Features

Bookmarks in Glue

Types of Jobs in Glue

  1. Spark Jobs: Batch data processing.
  2. Streaming ETL Jobs: Continuous processing of streaming data.
  3. Python Shell Jobs: Run Python scripts without Apache Spark.
  4. Radio Jobs: New capability for machine learning workflows.

Summary

Labs (Lecture 199 - 205)

  1. Glue ETL - Convert format to Parquet (Lecture 199)
  2. Query Amazon Customer Reviews with Athena (Lecture 200)
  3. Sentiment of the Customer Review (Lecture 201)
  4. Query Sentiment of Customer Reviews using Athena (Lecture 202)
  5. Serverless Customer Review Solution Part 1 (Lecture 204)
  6. Serverless Customer Review Solution Part 2 (Lecture 205)