How to Start in Data Engineering?
I started to work with data a few years ago, messing with databases and SQL while building dashboards for my boss to report in some meetings. One thing that I noticed was how hard it was to get quality data at a decent frequency. I could clean a few CSV files using Excel, but if I had to do every week, this would mean that I had to do this manual process over and over again.
This necessity made me trigger the search for knowledge to experiment and see more about data engineering because I wanted to make my life easier at the time. I started to study more and realized how fascinating is this subject, so I decided to make a jump to work full-time as data engineer.
In this article, I explain my main resources used in this process and a few tips for you to land your first position in data engineering.
Just to be clear, most of the elements here are suggestions. Your company might use different programming languages or tools, so you don't need to follow the elements I bring here at 100%. I'm just compiling the main resources that I believe will be useful in your journey.
I will approach the more “junior” level in this article: people that have zero to two years of experience in data engineering as a whole. Or, someone that worked in similar areas of programming/data and wants to work dedicated in engineering.
Roadmap and Tools
Since you are possible someone that don’t know much about the area or the necessary tools, my suggestion is to start with the basics and the fundamentals. This will pave for great results in your future career as a data engineer.
The main resources that I recommend for this phase are the following books:
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - Excellent book about big systems and can each communicate towards scalability. Essential reading even for backend engineers.
- The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling - This book is a must-read to explain good practices and ways to structure the information in your data warehouse.
- Business Intelligence Guidebook: From Data Integration to Analytics - This book is a good complement after the reading of the two above. The objective here is to see more related to the data area, like dashboards and communication with business stakeholders.
You made some reading, learning a bit more about the essentials of the area. Now, it is time to learn more about the tools that you are going to use for the next phase:
- Programming:
- The suggestion is Python. It is the market standard for most of the companies that have data practices already.
- You could learn Scala or Java, but it is not widely used in the industry as a whole. The other languages are more useful for specific markets or really big companies (like FAANGs).
- Database:
- For handling information in the databases, use SQL - it is useful for relational databases, which are mostly of the data warehouse structures.
- To build better your connection with Data Warehouses and version control, use DBT. It is an amazing tool to use and it is being adopted more and more in the data/analytics engineering aspects.
- Development:
- Study Docker. It is an incredible tool to write code and deploy in production.
- Cloud:
- Study more about the cloud provider that your company currently works on, if it is Azure or Google then see more about that.
- If you don’t work directly with any specific cloud provider, the safest option is AWS. Although not perfect, it is the biggest player in this market at the moment, so studying more about this provider will assist you to get a position easier.
- Learn the basics: IAM and permissions, data warehouses in the cloud, storage, VMs, and containers.
- When you got more time, learn the specific tools: data pipelines, big data solutions, and machine learning models.
Now, after learning most of the elements above, you should do some personal projects, to show exactly what you could do. Here are a few projects to consider:
- Free API to data lake:
- Consume a free API resource (options here) and send the files into the data lake (Example: S3 and AWS Lake Formation). Transform the files in a columnar format (e.g. Parquet, Avro) and run a solution to read those files (for example, Athena).
- The objective here is to show that you can do a proper ETL pipeline, from start to finish.
- Pro-tip: Don’t focus too much on structuring the data lake, following the best and most secure practices for that. Just pay attention to moving data and serve it in the appropriate format for consumption.
- Clean big dataset to dashboard:
- Get a large open dataset (options here) and clean it to serve the results in a dashboard (e.g. Tableau Public, Google Data Studio).
- The objective here is to show that you can clean and handle “big data” in size, possibly using Spark/Dask/AWS Glue to solve the issue. Also, to show that you can serve this information to be useful for the business, as data visualization.
- Pro-tip: don’t worry too much about the type of data, just pick a large dataset and try to imagine some visualizations there. If you have a few in mind then start working to clean the data, which is the most time-consuming part.
Conclusion
Most of the resources quoted here give you a good sense of to what expect in your day-to-day job as a data engineer. Keep in mind that every company has specific needs and business demands. For a few startups, you could work more with machine learning engineering than cleaning data, others you can do more models and analytics for dashboards. Adjust your plan to reflect and perform better in your current situation.
A final tip for anyone who is looking for any positions (it does not matter the level), is:
Study the positions that you are interested in: what are the common tools, experiences, and expectations. And then, adjust your learning plan and resume to consider the results of your study.
For example: if most of the positions in your city for juniors data engineers demand Spark then adjust your plan to learn it, even though I didn’t include it in the resources above, you should modify your plan to better fit your goals and get faster to your objective.