Stories
Slash Boxes
Comments

SoylentNews is people

Submission Preview

10 Best Practices to Write Readable and Maintainable SQL Code

Rejected submission by upstart at 2021-07-31 22:04:34
News

████ # This file was generated bot-o-matically! Edit at your own risk. ████

10 Best Practices to Write Readable and Maintainable SQL Code [towardsdatascience.com]:

10 Best Practices to Write Readable and Maintainable SQL Code How to write SQL queries that your team can easily read and maintain?

David Martins

3 hours ago·8 min read

It’s easy to mess with SQL without proper guidelines. Since everybody in your team might have their own habits of writing SQL, you can quickly end up with a confusing code that nobody understands.

You probably realized the importance of following a set of good practices.May this article give you the guidance you’re looking for!

1. Use Uppercase for the Keywords

Let’s start with a basic: use uppercase for the SQL keywords [w3schools.com], and lowercase for your tables and columns. It’s also a good practice to use uppercase for the SQL functions (FIRST_VALUE(), DATE_TRUNC(), etc) even though it’s more debatable.

Avoid

select id, name from company.customers

Prefer

SELECT id, name FROM company.customers2. Use Snake Case for the schemas, tables, columns

Programming languages have their best practices when it comes to case types: camelCase, PascalCase, kebab-case, and snake_case are the most common.

When it comes to SQL, Snake Case (sometimes referred to as underscore case) is the most widely used convention.

Avoid

SELECT Customers.id, Customers.name, COUNT(WebVisit.id) as nbVisitFROM COMPANY.CustomersJOIN COMPANY.WebVisit ON Customers.id = WebVisit.customerIdWHERE Customers.age = 30GROUP BY Customers.id, Customers.name

Prefer

SELECT customers.id, customers.name, COUNT(web_visit.id) as nb_visitFROM company.customersJOIN company.web_visit ON customers.id = web_visit.customer_idWHERE customers.age = 30GROUP BY customers.id, customers.name

Although some people like to include variations to differentiate between schemas, tables, and columns, I would recommend sticking with snake case.

3. Use aliases when it improves readability

It’s well known, aliases are a convenient way to rename tables or columns which doesn’t make sense. Don’t hesitate to give an alias to your tables and columns when their names aren’t meaningful, and to alias your aggregates.

Avoid

SELECT customers.id, customers.name, customers.context_col1, nested.f0_FROM company.customersJOIN (SELECT customer_id, MIN(date)FROM company.purchasesGROUP BY customer_id ) ON customer_id = customers.id

Prefer

SELECT customers.id, customers.name, customers.context_col1 as ip_address, first_purchase.date as first_purchase_dateFROM company.customersJOIN (SELECT customer_id, MIN(date) as dateFROM company.purchasesGROUP BY customer_id ) AS first_purchase ON first_purchase.customer_id = customers.id

I usually alias the columns with a lower case as, and the tables with an uppercase AS.

4. Formatting: Carefully use Indentation & White spaces

Even though it’s a basic principle, it’s a quick win to make your code more readable. As you would do with python, you should ident your SQL code.

Ident after a keyword, and when you use a subquery or a derived table.

Avoid

SELECT customers.id, customers.name, customers.age, customers.gender, customers.salary, first_purchase.dateFROM company.customersLEFT JOIN ( SELECT customer_id, MIN(date) as date FROM company.purchases GROUP BY customer_id ) AS first_purchase ON first_purchase.customer_id = customers.id WHERE customers.age=30

Prefer

SELECT customers.id, customers.name, customers.age, customers.gender, customers.salary, first_purchase.dateFROM company.customersLEFT JOIN (SELECT customer_id, MIN(date) as date FROM company.purchasesGROUP BY customer_id ) AS first_purchase ON first_purchase.customer_id = customers.idWHERE customers.age = 30

Also, note how we used the white spaces in the where clause.

Avoid

SELECT id WHERE customers.age=30

Prefer

SELECT id WHERE customers.age = 305. Avoid Select *

It is worth nothing to remind this good practice. You should be explicit about what you want to Select, therefore avoid using Select *.

Select * make your request unclear since it hides the intentions behind the query. Also, remember that your tables might evolve and impact Select *. That’s why I’m not a big fan of the EXCEPT() instruction.

Avoid

SELECT * EXCEPT(id) FROM company.customers

Prefer

SELECT name, age, salaryFROM company.customers6. Go for the ANSI-92 JOIN Syntax

…instead of the SQL WHERE Clause for Joining Tables.Even though you can use both a WHERE clause and a JOIN clause to join tables, it’s a best practice to use the JOIN / ANSI-92 syntax.

Although there is no difference in terms of performance, the JOIN clause separates the relationship logic from the filters and improves the readability.

Avoid

SELECT customers.id, customers.name, COUNT(transactions.id) as nb_transactionFROM company.customers, company.transactionsWHERE customers.id = transactions.customer_idAND customers.age = 30GROUP BY customers.id, customers.name

Prefer

SELECT customers.id, customers.name, COUNT(transactions.id) as nb_transactionFROM company.customersJOIN company.transactions ON customers.id = transactions.customer_idWHERE customers.age = 30GROUP BY customers.id, customers.name

The “Where clause based” syntax — also known as ANSI-89 — is older than the new ANSI-92, which is why it’s still very common. Nowadays most Developers and Data analysts use the JOIN syntax.

7. Use Common Table Expression (CTE)

A CTE allows you to define and execute a query, of which the result exists temporarily and can be used within a larger query. CTEs are available on most modern databases.

It works like a derived table, with 2 advantages :

  • Using CTE improves the readability of your query
  • A CTE is defined once then can be referred multiple times

You declare a CTE with the instruction WITH … AS:

WITH my_cte AS( SELECT col1, col2 FROM table)SELECT * FROM my_cte

Avoid

SELECT customers.id, customers.name, customers.age, customers.gender, customers.salary, persona_salary.avg_salary as persona_avg_salary, first_purchase.dateFROM company.customersJOIN (SELECT customer_id, MIN(date) as date FROM company.purchasesGROUP BY customer_id ) AS first_purchase ON first_purchase.customer_id = customers.idJOIN (SELECT age, gender, AVG(salary) as avg_salaryFROM company.customersGROUP BY age, gender ) AS persona_salary ON persona_salary.age = customers.ageAND persona_salary.gender = customers.genderWHERE customers.age = 30

Prefer

WITH first_purchase AS(SELECT customer_id, MIN(date) as date FROM company.purchasesGROUP BY customer_id),persona_salary AS(SELECT age, gender, AVG(salary) as avg_salaryFROM company.customersGROUP BY age, gender)SELECT customers.id, customers.name, customers.age, customers.gender, customers.salary, persona_salary.avg_salary as persona_avg_salary, first_purchase.dateFROM company.customersJOIN first_purchase ON first_purchase.customer_id = customers.idJOIN persona_salary ON persona_salary.age = customers.ageAND persona_salary.gender = customers.genderWHERE customers.age = 308. Sometimes, it might be worth splitting into multiple queries

Be careful with this one. Let’s give some context:

I often use AirFlow to execute SQL queries on Bigquery, transform data, and prepare data visualizations. We have a workflow orchestrator (Airflow) that executes requests in a defined order. In some situations, we choose to split complex queries into multiple, smaller queries.

Instead Of

CREATE TABLE customers_infos ASSELECT customers.id, customers.salary, traffic_info.weeks_since_last_visit, category_info.most_visited_category_id, purchase_info.highest_purchase_valueFROM company.customersLEFT JOIN ([..]) AS traffic_infoLEFT JOIN ([..]) AS category_infoLEFT JOIN ([..]) AS purchase_info

You Could Use

## STEP1: Create initial tableCREATE TABLE public.customers_infos ASSELECT customers.id, customers.salary, 0 as weeks_since_last_visit, 0 as most_visited_category_id, 0 as highest_purchase_valueFROM company.customers## STEP2: Update traffic infosUPDATE public.customers_infosSET weeks_since_last_visit = DATE_DIFF(CURRENT_DATE, last_visit.date, WEEK)FROM ( SELECT customer_id, max(visit_date) as date FROM web.traffic_info GROUP BY customer_id ) AS last_visitWHERE last_visit.customer_id = customers_infos.id## STEP3: Update category infosUPDATE public.customers_infosSET most_visited_category_id = [...]WHERE [...]## STEP4: Update purchase infosUPDATE public.customers_infosSET highest_purchase_value = [...]WHERE [...]

WARNING: Even though this method is great when it comes to simplifying complex queries, it might come with a readability/performances trade-off.

It’s especially true if you work with an OLAP — or any column-oriented — database, optimized for aggregations and analytics queries (SELECT, AVG, MIN, MAX, …), but less performant when it comes to transactions (UPDATE).

Although in some cases, it might also improve your performance. Even with a modern column-oriented database, too many JOINs will result in memory or performance issues. In those situations, splitting your request usually helps with performance and memory.

Also, it is worth nothing to mention that you need some sort of program or orchestrator to execute your queries in a defined order.

9. Meaningful names based on your own conventions

Rightfully naming your schemas and tables is difficult. Which naming conventions to use is debatable, but choosing one and sticking to it is not.You should define your own convention and have it adopted by your team.

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

Here are examples of conventions I use:

Schemas

If you work with an analytics database that serves multiple purposes, it’s a good practice to organize your tables in meaningful schemas.

In our Bigquery database, we’ve one schema per data source. More important, we output the results in different schemas depending on their purpose.

  • Any table which should be accessible by a third-party tool lays in the public schema. Dataviz tools like DataStudio or Tableau get their data from here.
  • Since we use Machine Learning with BQML [towardsdatascience.com], we’ve got a dedicated machine_learning schema.

Tables

Tables themselves should be names according to conventions.At Agorapulse, we have several Dashboards for data visualization, each with its own purpose: a Marketing dashboard, a Product dashboard, an Executive dashboard, to name a few.

Each table in our public schema is prefixed by the name of the dashboard. Some examples might include:

product_inbox_usageproduct_addon_competitor_statsmarketing_acquisition_agenciesexecutive_funnel_overview

When working with a team it is worth taking the time to define your conventions. When it comes to naming a new table, never go with the quick and dirty name that you’ll “change later”: you probably won’t.

Feel free to use those examples to define your conventions.

10. Finally, write useful comments… but not too much

I agree with the idea that a well-written and rightfully named code shouldn’t need comments. Someone who reads your code should understand the logic and intention even before the code itself.

Still, comments might be useful in some situations. But you should definitely avoid the pitfall of commenting too much.

Avoid

WITH fp AS(SELECT c_id, # customer id MIN(date) as dt # date of first purchaseFROM company.purchasesGROUP BY c_id),ps AS(SELECT age, gender, AVG(salary) as avgFROM company.customersGROUP BY age, gender)SELECT customers.id, ct.name, ct.c_age, # customer age ct.gender, ct.salary, ps.avg, # average salary of a similar persona fp.dt # date of first purchase for this clientFROM company.customers ct# join the first purchase on client idJOIN fp ON c_id = ct.id# match persona based on same age and genreJOIN ps ON ps.age = c_ageAND ps.gender = ct.genderWHERE c_age = 30

Prefer

WITH first_purchase AS(SELECT customer_id, MIN(date) as date FROM company.purchasesGROUP BY customer_id),persona_salary AS(SELECT age, gender, AVG(salary) as avg_salaryFROM company.customersGROUP BY age, gender)SELECT customers.id, customers.name, customers.age, customers.gender, customers.salary, persona_salary.avg_salary as persona_avg_salary, first_purchase.dateFROM company.customersJOIN first_purchase ON first_purchase.customer_id = customers.idJOIN persona_salary ON persona_salary.age = customers.ageAND persona_salary.gender = customers.genderWHERE customers.age = 30Conclusion

SQL is great. It’s one of the bases of Data Analysis, Data Science, Data Engineering, and even software development: it won’t go await. Its flexibility is a strength but can be a pitfall.

You might not realize it at first, especially if you’re the only person in charge of your own code. But at some point, when working with a team or if someone has to continue your work, a SQL code without a set of best practices will become a burden.

In this article, I summarized the most common best practices to write SQL.I hope it’ll help you take your SQL quality to the next level!

In this article, I summarized the most common best practices to write SQL.I hope it’ll help you take your SQL quality to the next level!


Original Submission