Vacancy expired!
Looking for a Data Engineer position and this Data Engineer is responsible for moving the data from flat files with JSON XML formats, both structured and unstructured and semi structured. So, they need to use the PySpark to load the data into the Hive tables (Hive is a Hadoop ecosystem) and ultimately move the data from the Hadoop ecosystem to Snowflake. We already have utilities for moving the data from Hadoop ecosystem to Snowflake However we are looking for somebody who can write code using PySpark to ingest the data into Hive. This is 100% on the UNIX system, so not looking for anybody who can deploy the code on AWS or SCALAR. Only looking for pure PySpark experience of processing the data from the flat files and loading into the Hadoop ecosystem.
Location: Must be onsite in McLean, VA - Hybrid - Onsite Tuesday through Thursday. Candidates must be local to McLean, VA area (no more than 90 min. commute) who are able to attend in-person interviews.Must Haves: 6+ years of hands on Python development experience using PySpark, XML, Json. Database side they need to have experience with SQL using Hive. 6+ years of experience with Hadoop is mandatory. Preference for Snowflake. Hands- on experience with Cloud technologies is preference and could result in contract extensions supporting future projects.Description• Cleanse, manipulate and analyze large datasets (Structured and Unstructured data – XMLs, JSONs, PDFs) using Hadoop platform.• Develop Python, PySpark, Spark scripts to filter/cleanse/map/aggregate data.• Manage and implement data processes (Data Quality reports)• Develop data profiling, deduping logic, matching logic for analysis• Programming Languages experience in Python, PySpark and Spark for data ingestion.• Programming experience in BigData platform using Hadoop platform.• Present ideas and recommendations on Hadoop and other technologies best use to management.Qualifications:• Bachelor’s degree in Computer Science, Statistics, Data science or a related quantitative field.• 5+ years of experience in processing large volumes and variety of data (Structured and unstructured data, writing code for parallel processing, XMLS, JSONs, PDFs)• 5+ years of programming experience in Hadoop, Spark, Python for data processing and analysis.• Strong SQL experience is a must• 5+ years of experience using Hadoop platform and performing analysis. Familiarity with Hadoop cluster environment and configurations for resource management for analysis work• 2+ Prior experience working in Cloud platforms AWS. Kubernetes experience is highly desirable• Hands on Work experience using technologies for manipulating structured and unstructured big data. Big data technologies may include—but are not limited to—Hadoop, Hive, Spark, relational databases, and NoSQL.• Desired prior experience with MPP Databases like Snowflake.• Detail oriented and superb communication and written skills• Must be able to prioritize and meet deadlines