Now that we have the foundation for data infrastructure (storage, compute, and databases), let’s put this all together to see how they relate to one another. We will go over an example from Robinhood and then go through an example of what a system Design question might look like. It’s important to note that you will never be asked to code anything with regards to this topic – there isn’t enough time and that’s what the SQL and python questions are for. However, you might be asked to draw something out. The next section will have some practice problems for you.
1) Take a look through this case study: Data Lake at Robinhood.
You’ll notice that Robinhood broke up their data infrastructure into multiple sections: Ingestion, Storage, and Processing, Validation, and User Facing Layers. You can also see that Robinhood uses a lot of the tools that we have already discussed: S3 to store their data and at one point Redshift as the data warehouse, which you can see sits in the intersection of the Storage and Query Layer. Let’s go over each layer to discuss the tools we haven’t discussed yet. More importantly, pay extra attention to what might be asked in an interview that isn’t included in this article.
Ingestion: This section is important because it contains a lot of new information and how you answer will be very dependent on the question that your interviewer asks. Robinhood points out Kafka. Kafka is an open-source tools that lets you stream data in real time. Think of this as the water pipeline in your home. When you turn on your facet you expect water to stream through immediately. This may not be a necessary requirement for every company, so you need to think through your specific business use case. For example, a financial company that is running ML models on historical data may only need to update their data once a day. For this use case, that financial company might just get their data through an FTP server or have their data provider drop the data directly into their S3 bucket.
The AWS equivalent of Kafka is Kinesis. Kinesis is also a real-time streaming tool but the difference is that it is in the AWS ecosystem and integrates well with other AWS tools (like S3). Kinesis is quicker and easier to set up. So why did Robinhood choose Kinesis? Because it’s cheaper! Sure it takes longer and more engineering time to set up, but because Kafka is open source and Robinhood has the budget for engineers, it was probably in Robinhood’s best interest to spend the time to benefit from the cost savings in the long run.
This is what the interviewer is looking for you: they want to see you discuss your rationale for using real time (business use case) and the tradeoffs between what you did (Kafka) and didn’t do (Kinesis).
Storage: This one is straightforward. When you think storage for raw data, think S3. Notice how Robinhood even mentions Glacier, which we previously discussed. Robinhood also placed Redshift in the intersection of Storage and Query, which is correct because if you recall, Redshift has a storage and compute layer to it.
Processing: This is also an important section containing new things. Glue and Airflow are two AWS tools that allow you to do ETL and create your pipelines. So, when you think ETL, most likely there is some “processing” that is involved. To transfer data from your S3 bucket into Redshift, you might need to get the data into the right format, as we discussed in the Compute section. While Airflow and Glue are different tools, Amazon uses EC2 as the servers in the background! Like the notes about ingestion layer, your job is to discuss the tradeoffs between the two (Robinhood happens to use both).
Glue is cheaper and “serverless”, meaning you don’t need to do any work configuring how much computer power you are going to use. Airflow is an orchestration tool that visualizes your entire pipeline. For example, if your pipeline has 10 steps and it fails on step 6, you can first see where the error happened and then fix the error and rerun the rest of the pipeline starting from the point of failure. Airflow is expensive and requires a little more configuration. Either way, chances are you will need to do some sort of ETL and that important thing here is that you mention this during the interview.
Query: The query layer is not something you need to worry about. It simply setting up a tool so that users can query databases. SQL workbench is 1 example of an open-source tool, but most likely your company will already have a tool.
Validation: The Validation Layer can occur in a lot of tools. Airflow, which was discussed in the query layer, is a common one. It’s an important concept as we have discussed before – you want to make sure you have checks to identify any problems with the data. You should mention that somewhere in your design, but it is necessarily its only section. It just happens along the way.
User Facing Layers: If you are not familiar with JupyterHub or Looker, don’t worry. Like the query layer, these are tools that are easy to set up and simply ways for your coworkers (data analysts and data scientists) to interact with your data warehouse. Interviews probably won’t ask you about this, but it’s good to know in case you want to mention it. Tableau dashboards are another way for data scientists to show their work and analysis to the rest of the team.
2) Outline of a System Design Question
Not all companies do what I am about to show you, but I think it’s important that you get familiar with at least the format of what a system design question can look like.
In the next section, there will be 2 system design questions. I want you to imagine that you have a blank word doc with 4 quadrants and each quadrant has one of the following categories: Notes, Schema Design, System Design, API.
Notes | Schema Design
System Design | API Design
For this book, don’t worry too much about the API section. Data Engineers won’t have to worry much about. It is essentially just a “if the user asks us for x, what do we return to them.” In the Robinhood example, a user might ask us for the price of amazon stock, and we would return then $130 for example.
What is important are the other three sections. When an interview asks you to design a system (like Robinhood), the first place you might head to is the Notes section. This is where you ask your questions, take notes, write down the question that is asked etc. Remember the tips for section 5? Apply those tips! Make sure you understand the requirements before you start drawing arrows like you see in the Robinhood article.
Once you have a clear understanding of the question being asked, head over to the Schema Design section and start drawing out the tables you would want to include in your system. Whether virtual or on a white board, you won’t need draw out an entire star schema like you did in section 4. But including some tables (both fact and dim) with the appropriate columns and the logic to justify it will be more than sufficient.
Now is where you head into the Schema Design quadrant. This is where you draw something along the lines of what you saw in the Robinhood example. Explain what data goes into s3, how you used real time tool to get it there (if you needed to), and how it eventually got into Redshift to create the tables that are written out in the Schema Design quadrant. Again, don’t worry about making it pretty! As long as you can explain your rationale for each part, the interviewer will like what they see!
….
….
….
Exercise
You work at a music streaming company called Spotify. You have access to all the data that any music streaming service would have. With every song that a user listens, the data scientists on the team can use Machine learning to learn quickly and give song Recommendations. However, they need the proper data and infrastructure to make their ML work. How would you design a Recommendation Engine?