HDFS DCM Matchtables V2
Projects >> HDFS DCM Matchtables V2
Overview
This is my code that pulls down daily DoubleClick Campaign Manager (DCM) log files (v2) from Google Cloud Storage (GCS) and loads them into Hadoop Distributed File System (HDFS) for use in Hive queries. The script fully replaces matchtable_v2 files with each day’s new log file on HDFS. Match tables are demoed here, but it is imperative to frequently download other log files that change more often (ex. clicks and impressions on an hourly basis).
Google Cloud Storage
DCM v2 log files are stored in a bucket that this program accesses. Cloud Storage lets you store unstructured objects in containers called buckets. You can serve static data directly from Cloud Storage, or you can use it to store data for other Google Cloud Platform services. For more information, click here.
DCM data transfer files (log files) documentation
Documentation for v2 of the data transfer files can be found here.
Other helpful docs:
- DCM log file formats
- Match tables
- A mapping of old v1 data fields to the new v2 data fields.
- gsutil, a Python tool that facilitates command line access to GCS
About Me
I'm a data leader working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>
comments powered by Disqus